0
1
4
9
16
25
Lecture 20 - Parallel Computing
map# (Very) inefficient way to define a map function
def my_map(function, array):
# create a container for the results
output = []
# loop over each element
for element in array:
# add the intermediate result to the container
output.append(function(element))
# return the now-filled container
return outputjoblib for parallel computingmap call depends on the other stepsbar and apply it to each value simultaneouslyjoblib for this purposeParallel function from joblib is used to parallelise the task across as many jobs as we wantn_jobs parameter specifies the number of jobs to run in paralleldelayed function is used to delay the execution of the function bar until the parallelisation is readyresults variable will contain the output of the parallel computationbar function and foo array from before:[0, 1, 4, 9, 16, 25]
joblib is doing here is creating 6 instances of the bar function and applying each one to a different element of the foo arrayLet’s see another example of the difference between serial and parallel execution
We will use the %timeit magic command to measure the time it takes to run a function
Serial:
445 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
648 ms ± 7.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
.jpg images, that we want to perform the same actions on, like rotate by 180 degrees and convert to a different formatdef image_flipper(file_name):
# extract the base file name
base_name = file_name[0:-4]
# open the file
im = Image.open( file_name )
# rotate by 180 degrees
im_flipped = im.rotate(angle=180)
# Save a PDF with a new file name
im_flipped.save(base_name + "_flipped.pdf", format='PDF')
return base_name + "_flipped.pdf"data folder:./data/kings_cross.jpg
./data/charing_cross.jpg
./data/victoria.jpg
./data/waterloo.jpg
./data/euston.jpg
./data/fenchurch.jpg
./data/st_pancras.jpg
./data/london_bridge.jpg
./data/liverpool_street.jpg
./data/paddington.jpg
image_flipper function to each file in the list:['./data/kings_cross_flipped.pdf',
'./data/charing_cross_flipped.pdf',
'./data/victoria_flipped.pdf',
'./data/waterloo_flipped.pdf',
'./data/euston_flipped.pdf',
'./data/fenchurch_flipped.pdf',
'./data/st_pancras_flipped.pdf',
'./data/london_bridge_flipped.pdf',
'./data/liverpool_street_flipped.pdf',
'./data/paddington_flipped.pdf']
joblib module makes it simple to run these steps together with a single commandjoblib and NumPy if you haven’t done so yet
!pip install joblib numpy
|
||||||||||||||||
.compute() method to compute the result+, *, exp, log, etc)sum(), mean(), std())tensordot)557 ms ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
...)| name | id | x | y | |
|---|---|---|---|---|
| npartitions=30 | ||||
| 2000-01-01 | string | int64 | float64 | float64 |
| 2000-01-02 | ... | ... | ... | ... |
| ... | ... | ... | ... | ... |
| 2000-01-30 | ... | ... | ... | ... |
| 2000-01-31 | ... | ... | ... | ... |
| name | id | x | y | |
|---|---|---|---|---|
| timestamp | ||||
| 2000-01-01 00:00:00 | Michael | 995 | -0.54 | -0.45 |
| 2000-01-01 00:00:01 | Wendy | 1017 | -0.45 | -0.16 |
| 2000-01-01 00:00:02 | Patricia | 1044 | -0.01 | -0.08 |
| 2000-01-01 00:00:03 | Oliver | 956 | -0.10 | -0.83 |
| 2000-01-01 00:00:04 | Zelda | 1009 | 0.46 | 0.82 |
x columnDask Series Structure:
npartitions=1
float64
...
Dask Name: getitem, 8 expressions
Expr=(((Filter(frame=ArrowStringConversion(frame=FromMap(2e00396)), predicate=ArrowStringConversion(frame=FromMap(2e00396))['y'] > 0))[['name', 'x']]).std(ddof=1, numeric_only=False, split_out=None, observed=False))['x']
df3 is still not shown.compute() method to display the resultx column and the maximum of the y column by the name column.persist() method to do this, and then the data will be available for future computationsresample method to aggregate the data by a time periodx and y columns| x | y | |
|---|---|---|
| timestamp | ||
| 2000-01-01 00:00:00 | 2.28e-03 | -0.02 |
| 2000-01-01 01:00:00 | 3.23e-03 | -0.01 |
| 2000-01-01 02:00:00 | 1.40e-02 | 0.02 |
rolling method to calculate a rolling mean of the datadask if you haven’t done so yet
!pip install daskpip install dask-sqldask_sql.Context is the Python equivalent to a SQL database,Context is created and used for the duration of a Python script or notebookContext has been created, there are many ways to register tables in itcreate_table methodtimeseries table| x | y | |
|---|---|---|
| name | ||
| Alice | -324.03 | -2.83e-03 |
| Bob | 6.97 | -5.11e-04 |
| Charlie | 28.39 | -3.17e-03 |
| Dan | 248.15 | 1.27e-03 |
| Edith | -1.04 | 1.92e-03 |
| ... | ... | ... |
| Victor | 237.12 | -9.49e-04 |
| Wendy | 302.69 | -2.44e-03 |
| Xavier | 23.48 | -2.97e-03 |
| Yvonne | -258.51 | 2.46e-03 |
| Zelda | 50.05 | -2.69e-04 |
26 rows × 2 columns
.csv is very common in data science (and for good reasons).csv files very well, but it is not the best option for large files.csv files| name | id | x | y | |
|---|---|---|---|---|
| npartitions=30 | ||||
| 2000-01-01 | object | int64 | float64 | float64 |
| 2000-01-02 | ... | ... | ... | ... |
| ... | ... | ... | ... | ... |
| 2000-01-30 | ... | ... | ... | ... |
| 2000-01-31 | ... | ... | ... | ... |
import os
import datetime
if not os.path.exists('data'):
os.mkdir('data')
def name(i):
""" Provide date for filename given index
Examples
--------
>>> name(0)
'2000-01-01'
>>> name(10)
'2000-01-11'
"""
return str(datetime.date(2000, 1, 1) + i * datetime.timedelta(days=1))
df.to_csv('data/*.csv', name_function=name);data directory, one for each day in the month of January 2000dd.read_csv function| timestamp | name | id | x | y | |
|---|---|---|---|---|---|
| npartitions=30 | |||||
| object | object | int64 | float64 | float64 | |
| ... | ... | ... | ... | ... | |
| ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | |
| ... | ... | ... | ... | ... |
.csv files are nice, new formats like Parquet are gaining popularity| name | x | |
|---|---|---|
| npartitions=30 | ||
| object | float64 | |
| ... | ... | |
| ... | ... | ... |
| ... | ... | |
| ... | ... |
dask.delayed function@dask.delayed decorator to the function@dask.delayed
def delayed_calculation(size=10000000):
arr = np.random.rand(size)
for _ in range(10):
arr = np.sqrt(arr) + np.sin(arr)
return np.mean(arr)
results = []
for _ in range(5):
results.append(delayed_calculation())
# Compute all results at once
%timeit final_results = dask.compute(*results)822 ms ± 16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
.compute() method at the end@dask.delayed
def generate_data(size):
return np.random.rand(size)
@dask.delayed
def transform_data(data):
return np.sqrt(data) + np.sin(data)
@dask.delayed
def aggregate_data(data):
return {
'mean': np.mean(data),
'std': np.std(data),
'max': np.max(data)
}
# Compare execution
sizes = [1000000, 2000000, 3000000]
# Dask execution
dask_results = []
for size in sizes:
data = generate_data(size)
transformed = transform_data(data)
stats = aggregate_data(transformed)
dask_results.append(stats)
%timeit dask.compute(*dask_results)33.5 ms ± 361 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
.compute and .persist sparingly: These functions can be expensive, so use them only when you need to72.8 ms ± 303 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.45 s ± 83.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
565 μs ± 20 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
4.59 ms ± 51.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)