0
1
4
9
16
25
Lecture 20 - Parallel Computing
INSERT ... ON CONFLICT to handle duplicatesmap function and why it mattersjoblib for single-node parallelism
Parallel and delayed%timeitDask for scalable computing
dask.delayed for custom pipelines.npmignore. So it shipped to every user who ran npm install.gitignore properly! 😅map functionmapmap function in the standard library:map function is much faster than mine (it’s implemented in C), so of course you should use that one! 😂joblib for parallel computingmap call is independent, so it is perfect for parallelismjoblib makes this easy. Two things to know:
Parallel(n_jobs=k) runs k tasks at the same timedelayed(f) wraps f so joblib can schedule itpip install joblibn_jobs=-1 uses all available CPU cores automaticallybar function and foo array from before:joblib creates 6 instances of bar and applies each one to a different element of foocalculation runs 10 heavy operations on 10 million numbers%timeit:
calculation function called on n inputs is O(n)
import matplotlib.pyplot as plt
import numpy as np
# Simulate processing times
num_images = np.array([1, 10, 50, 100, 200])
sequential_time = num_images * 2 # 2 seconds per image
parallel_time = (num_images * 2) / 4 # 4 cores, ideal speedup
plt.figure(figsize=(8, 5))
plt.plot(num_images, sequential_time, 'o-', label='Serial O(n)', linewidth=2, markersize=8)
plt.plot(num_images, parallel_time, 's-', label='Parallel O(n/4)', linewidth=2, markersize=8)
plt.xlabel('Number of Images', fontsize=12)
plt.ylabel('Time (seconds)', fontsize=12)
plt.title('O(n) Scaling: Serial vs Parallel', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()p cores, an O(n) problem becomes O(n/p + overhead)
n independent calculationsn inputsjoblib and NumPy if you haven’t done so yet
!pip install joblib numpyWebsite: https://www.dask.org/
a.sum(), a.mean(), slicing, etc. all worka is a lazy wrapper around the original NumPy array, split into 100×100 chunks.compute()a to get the first 10 rows of the 6th column.compute() method to compute the result+, *, exp, log, etc)sum(), mean(), std())tensordot)...)| name | id | x | y | |
|---|---|---|---|---|
| timestamp | ||||
| 2000-01-01 00:00:00 | Oliver | 968 | -0.29 | -0.19 |
| 2000-01-01 00:00:01 | Ray | 1017 | 0.68 | 0.75 |
| 2000-01-01 00:00:02 | George | 968 | 0.39 | -0.86 |
| 2000-01-01 00:00:03 | Zelda | 1020 | -0.74 | -0.97 |
| 2000-01-01 00:00:04 | Laura | 995 | -0.99 | -0.02 |
y > 0, then compute the standard deviation of x per group:Dask Series Structure:
npartitions=1
float64
...
Dask Name: getitem, 8 expressions
Expr=(((Filter(frame=ArrowStringConversion(frame=Timeseries(827f5a6)), predicate=ArrowStringConversion(frame=Timeseries(827f5a6))['y'] > 0))[['name', 'x']]).std(ddof=1, numeric_only=False, split_out=None, observed=False))['x']
df3 is still not shown.compute() method to display the resultx and the maximum of y, grouped by name.compute() to get a regular Pandas DataFrame for final steps.compute() to convert to Pandasimport dask
import dask.dataframe as dd
df = dask.datasets.timeseries()
# Step 1-2: filter and aggregate with Dask
summary = (
df[df.x > 0]
.groupby("name")
.agg({"x": "mean", "y": "std"})
)
# Step 3: bring to Pandas
pdf = summary.compute()
# Step 4: use Pandas normally
pdf.sort_values("x", ascending=False).head(5)| x | y | |
|---|---|---|
| name | ||
| George | 0.5 | 0.58 |
| Charlie | 0.5 | 0.58 |
| Zelda | 0.5 | 0.58 |
| Yvonne | 0.5 | 0.58 |
| Frank | 0.5 | 0.58 |
Install dask if you haven’t done so yet
!pip install daskFind the right chunk size!
Create a Dask array with 10 million random numbers (or less if you have memory constraints)
Vary the chunk size and time the following operation:
Calculate mean(sqrt(x^2)) on the Dask array
See the code below for an example with three different chunk sizes. Which one worked best for you? Why do you think that is?
import numpy as np
import dask.array as da
size = 10_000_000
# Dask with SMALL chunks
da_data_small = da.random.random(size, chunks=100_000) # 100 chunks
%timeit da.sqrt(da_data_small**2).mean().compute()
# Dask with MEDIUM chunks
da_data_medium = da.random.random(size, chunks=2_000_000) # 5 chunks
%timeit da.sqrt(da_data_medium**2).mean().compute()
# Dask with LARGE chunks
da_data_large = da.random.random(size, chunks=5_000_000) # Only 2 chunks
%timeit da.sqrt(da_data_large**2).mean().compute()pip install duckdbpdf becomes the SQL table pdf.parquet and .csv files directly in SQL┌───────────────────────┐
│ mean_x │
│ double │
├───────────────────────┤
│ -0.016252195829227004 │
└───────────────────────┘
┌─────────┬─────────────────────┬───────┐
│ name │ mean_x │ n │
│ varchar │ double │ int64 │
├─────────┼─────────────────────┼───────┤
│ Yvonne │ 0.0973837018818252 │ 39 │
│ Hannah │ 0.09098737819584524 │ 36 │
│ Ray │ 0.08676713725913592 │ 37 │
│ Tim │ 0.08060252152376055 │ 40 │
│ Norbert │ 0.05389034574545834 │ 38 │
└─────────┴─────────────────────┴───────┘
.csv is very common in data science (and for good reasons).csv well, but it loads the entire file into memory.csv filesdata directory, one for each day in January 2000dd.read_csv.csv files are nice, newer formats like Parquet are gaining popularity| Feature | CSV | Parquet |
|---|---|---|
| Storage | Row-based | Column-based |
| Compression | None | Snappy/gzip |
| Column selection | Reads all | Reads only needed |
| Data types | Text only | Typed (int, float, date) |
| File size (1M rows) | ~100 MB | ~25 MB |
dask.delayed function@dask.delayed decorator to the function@dask.delayed
def delayed_calculation(size=10000000):
arr = np.random.rand(size)
for _ in range(10):
arr = np.sqrt(arr) + np.sin(arr)
return np.mean(arr)
results = []
for _ in range(5):
results.append(delayed_calculation())
# Compute all results at once
%timeit final_results = dask.compute(*results)771 ms ± 65.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
.compute() method at the end@dask.delayed
def generate_data(size):
return np.random.rand(size)
@dask.delayed
def transform_data(data):
return np.sqrt(data) + np.sin(data)
@dask.delayed
def aggregate_data(data):
return {
'mean': np.mean(data),
'std': np.std(data),
'max': np.max(data)
}
# Compare execution
sizes = [1000000, 2000000, 3000000]
# Dask execution
dask_results = []
for size in sizes:
data = generate_data(size)
transformed = transform_data(data)
stats = aggregate_data(transformed)
dask_results.append(stats)
%timeit dask.compute(*dask_results)33.5 ms ± 888 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
.compute() and .persist() sparingly: each call triggers execution, so batch your workchunks=’auto’.parquet over .csv for large datasets47.5 ms ± 668 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.84 s ± 9.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
square(x) is so fast that joblib’s process-spawning overhead dominates. Parallel computing pays off when each task is heavy, not trivial.mean(sqrt(x^2))40.9 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
17.5 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
27.7 ms ± 787 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
49.4 ms ± 736 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)