0
1
4
9
16
25
Lecture 20 - Parallel Computing
Joblib and Dask for parallel computing
map# (Very) inefficient way to define a map function
def my_map(function, array):
# create a container for the results
output = []
# loop over each element
for element in array:
# add the intermediate result to the container
output.append(function(element))
# return the now-filled container
return outputjoblib for parallel computingmap call depends on the other stepsbar and apply it to each value simultaneouslyjoblib for this purpose
pip install joblibParallel function from joblib is used to parallelise the task across as many jobs as we wantn_jobs parameter specifies the number of jobs to run in paralleldelayed function is used to delay the execution of the function bar until the parallelisation is readyresults variable will contain the output of the parallel computationbar function and foo array from before:[0, 1, 4, 9, 16, 25]
joblib is doing here is creating 6 instances of the bar function and applying each one to a different element of the foo arrayLet’s see another example of the difference between serial and parallel execution
Here, we will create a NumPy array with 10 million random numbers and perform some mathematical operations on it multiple times
Each call to calculation is independent of the others, so we call them embarrassingly parallel, meaning they can be easily parallelised
We will use the %timeit magic command to measure the time it takes to run a function
Serial:
1.84 s ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
628 ms ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
.jpg images, that we want to perform the same actions on, like rotate by 180 degrees and convert to a different formatdef image_flipper(file_name):
# extract the base file name
base_name = file_name[0:-4]
# open the file
im = Image.open(file_name)
# rotate by 180 degrees
im_flipped = im.rotate(angle=180)
# Save a PDF with a new file name
im_flipped.save(base_name + "_flipped.pdf", format='PDF')
return base_name + "_flipped.pdf"data folder:./data/kings_cross.jpg
./data/charing_cross.jpg
./data/victoria.jpg
./data/waterloo.jpg
./data/euston.jpg
./data/fenchurch.jpg
./data/st_pancras.jpg
./data/london_bridge.jpg
./data/liverpool_street.jpg
./data/paddington.jpg
image_flipper function to each file in the list:image_flipper function we just defined has O(n) complexity relative to the number of imagesimport matplotlib.pyplot as plt
import numpy as np
# Simulate processing times
num_images = np.array([1, 10, 50, 100, 200])
sequential_time = num_images * 2 # 2 seconds per image
parallel_time = (num_images * 2) / 4 # 4 cores, ideal speedup
plt.figure(figsize=(8, 5))
plt.plot(num_images, sequential_time, 'o-', label='Serial O(n)', linewidth=2, markersize=8)
plt.plot(num_images, parallel_time, 's-', label='Parallel O(n/4)', linewidth=2, markersize=8)
plt.xlabel('Number of Images', fontsize=12)
plt.ylabel('Time (seconds)', fontsize=12)
plt.title('O(n) Scaling: Serial vs Parallel', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()Serial complexity: O(n) — time grows linearly with input
Parallel complexity: O(n/p + overhead) where p = number of cores
Parallel computing can’t improve Big O complexity, only the constant factors
An O(n²) algorithm is still O(n²) when parallelised, just with a smaller constant
Parallel computing is most valuable for embarrassingly parallel O(n) problems like:
Processing n images
Running n independent simulations
Applying a function to n data points
# Example: O(n²) nested loop (not embarrassingly parallel)
def inefficient_pairwise_sum(data):
n = len(data)
results = []
for i in range(n):
for j in range(n):
results.append(data[i] + data[j])
return results
# This is SLOW for n=1000
data = list(range(1000))
# DON'T run this - it would take forever!
# inefficient_pairwise_sum(data)joblib module makes it simple to run these steps together with a single commandjoblib and NumPy if you haven’t done so yet
!pip install joblib numpya to get the first 10 rows of the 6th column
|
||||||||||||||||
.compute() method to compute the result+, *, exp, log, etc)sum(), mean(), std())tensordot)526 ms ± 32.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
...)| name | id | x | y | |
|---|---|---|---|---|
| npartitions=30 | ||||
| 2000-01-01 | object | int64 | float64 | float64 |
| 2000-01-02 | ... | ... | ... | ... |
| ... | ... | ... | ... | ... |
| 2000-01-30 | ... | ... | ... | ... |
| 2000-01-31 | ... | ... | ... | ... |
| name | id | x | y | |
|---|---|---|---|---|
| timestamp | ||||
| 2000-01-01 00:00:00 | Wendy | 1009 | -0.53 | -0.30 |
| 2000-01-01 00:00:01 | Michael | 978 | -0.07 | -0.26 |
| 2000-01-01 00:00:02 | Frank | 952 | -0.69 | -0.23 |
| 2000-01-01 00:00:03 | Alice | 1013 | 0.01 | 0.09 |
| 2000-01-01 00:00:04 | George | 1017 | 0.16 | 0.74 |
x columnDask Series Structure:
npartitions=1
float64
...
Dask Name: getitem, 7 expressions
Expr=(((Filter(frame=FromMap(9a3b17a), predicate=FromMap(9a3b17a)['y'] > 0))[['name', 'x']]).std(ddof=1, numeric_only=False, split_out=None, observed=False))['x']
df3 is still not shown.compute() method to display the resultx column and the maximum of the y column by the name column.persist() method to do this, and then the data will be available for future computationsresample method to aggregate the data by a time periodx and y columns| x | y | |
|---|---|---|
| timestamp | ||
| 2000-01-01 00:00:00 | -1.28e-02 | -0.02 |
| 2000-01-01 01:00:00 | 8.83e-03 | 0.01 |
| 2000-01-01 02:00:00 | 2.43e-02 | 0.01 |
rolling() method to calculate a rolling mean of the dataInstall dask if you haven’t done so yet
!pip install daskFind the right chunk size!
Create a Dask array with 10 million random numbers (or less if you have memory constraints)
Vary the chunk size and time the following operation:
Calculate mean(sqrt(x^2)) on the Dask array
See the code below for an example with three different chunk sizes. Which one worked best for you? Why do you think that is?
import numpy as np
import dask.array as da
size = 10_000_000
# Dask with SMALL chunks
da_data_small = da.random.random(size, chunks=100_000) # 100 chunks
%timeit da.sqrt(da_data_small**2).mean().compute()
# Dask with MEDIUM chunks
da_data_medium = da.random.random(size, chunks=2_000_000) # 5 chunks
%timeit da.sqrt(da_data_medium**2).mean().compute()
# Dask with LARGE chunks
da_data_large = da.random.random(size, chunks=5_000_000) # Only 2 chunks
%timeit da.sqrt(da_data_large**2).mean().compute()pip install dask-sqldask_sql.Context is the Python equivalent to a SQL database,Context is created and used for the duration of a Python script or notebookContext has been created, there are many ways to register tables in itcreate_table methodtimeseries table| x | y | |
|---|---|---|
| name | ||
| Alice | -8.42 | 1.45e-03 |
| Bob | -81.65 | -1.29e-03 |
| Charlie | -65.81 | -1.68e-03 |
| Dan | -85.68 | 2.35e-03 |
| Edith | -110.51 | 3.45e-03 |
| ... | ... | ... |
| Victor | 45.52 | 1.52e-03 |
| Wendy | -143.84 | -8.10e-04 |
| Xavier | 77.07 | -8.51e-04 |
| Yvonne | -143.36 | -2.57e-03 |
| Zelda | 298.10 | 1.21e-03 |
26 rows × 2 columns
.csv is very common in data science (and for good reasons).csv files very well, but it is not the best option for large files.csv files| name | id | x | y | |
|---|---|---|---|---|
| npartitions=30 | ||||
| 2000-01-01 | object | int64 | float64 | float64 |
| 2000-01-02 | ... | ... | ... | ... |
| ... | ... | ... | ... | ... |
| 2000-01-30 | ... | ... | ... | ... |
| 2000-01-31 | ... | ... | ... | ... |
import os
import datetime
if not os.path.exists('data'):
os.mkdir('data')
def name(i):
""" Provide date for filename given index
Examples
--------
>>> name(0)
'2000-01-01'
>>> name(10)
'2000-01-11'
"""
return str(datetime.date(2000, 1, 1) + i * datetime.timedelta(days=1))
df.to_csv('data/*.csv', name_function=name);data directory, one for each day in the month of January 2000dd.read_csv function| timestamp | name | id | x | y | |
|---|---|---|---|---|---|
| npartitions=30 | |||||
| object | object | int64 | float64 | float64 | |
| ... | ... | ... | ... | ... | |
| ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | |
| ... | ... | ... | ... | ... |
.csv files are nice, new formats like Parquet are gaining popularity| name | x | |
|---|---|---|
| npartitions=30 | ||
| object | float64 | |
| ... | ... | |
| ... | ... | ... |
| ... | ... | |
| ... | ... |
dask.delayed function@dask.delayed decorator to the function@dask.delayed
def delayed_calculation(size=10000000):
arr = np.random.rand(size)
for _ in range(10):
arr = np.sqrt(arr) + np.sin(arr)
return np.mean(arr)
results = []
for _ in range(5):
results.append(delayed_calculation())
# Compute all results at once
%timeit final_results = dask.compute(*results)917 ms ± 110 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
.compute() method at the end@dask.delayed
def generate_data(size):
return np.random.rand(size)
@dask.delayed
def transform_data(data):
return np.sqrt(data) + np.sin(data)
@dask.delayed
def aggregate_data(data):
return {
'mean': np.mean(data),
'std': np.std(data),
'max': np.max(data)
}
# Compare execution
sizes = [1000000, 2000000, 3000000]
# Dask execution
dask_results = []
for size in sizes:
data = generate_data(size)
transformed = transform_data(data)
stats = aggregate_data(transformed)
dask_results.append(stats)
%timeit dask.compute(*dask_results)34.2 ms ± 630 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
.compute() and .persist() sparingly: These functions can be expensive, so use them only when you need tochunks='auto' if you’re unsure).parquet files for large datasets: They are much more efficient than .csv files76.8 ms ± 3.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.49 s ± 98.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
mean(sqrt(x^2))35 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
20.8 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
30.4 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
51.8 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)