DATASCI 350 - Data Science Computing

Lecture 20 - Parallel Computing

Danilo Freire

Department of Data and Decision Sciences
Emory University

Hello again, my friends! 😊

Brief recap 📚

SQL joins and set operations

  • Last class we covered advanced SQL table operations:
    • INNER, LEFT, and FULL OUTER JOIN to combine tables horizontally
    • CROSS JOIN to produce every combination of two tables
    • SELF JOIN to compare rows within the same table
    • UNION, INTERSECT, and EXCEPT to combine query results vertically
    • UPSERT with INSERT ... ON CONFLICT to handle duplicates
    • VIEWS as saved queries you can treat like tables

Today’s agenda 📅

Lecture outline

  • Serial vs parallel execution
    • The map function and why it matters
    • Embarrassingly parallel problems
    • Big O notation and what parallelism can (and cannot) fix
  • joblib for single-node parallelism
    • Parallel and delayed
    • Timing serial vs parallel with %timeit
  • Dask for scalable computing
    • Lazy evaluation: arrays and DataFrames
    • Reading and writing CSV and Parquet files
    • dask.delayed for custom pipelines
  • Best practices: when to parallelise and when not to

Funny AI news of the day 😂

  • Anthropic shipped a debug file with 512K lines of source code inside a routine update (Decrypt)
  • The cause? They forgot to add it to .npmignore. So it shipped to every user who ran npm install
  • That’s why I keep telling you to use .gitignore properly! 😅
  • Spotted within minutes, posted on X
  • 21 million views before the team woke up
  • Code mirrored across GitHub; DMCA takedowns couldn’t keep up
  • Sigrid Jin (most active Claude Code user in the world, 25B tokens/year per the WSJ) rewrote it all in Python before sunrise
  • His repo hit 50K stars in 2 hours 🚀
  • Anthropic had built “Undercover Mode” to stop Claude from leaking secrets. Then a human leaked their own code 🤦🏻‍♂️

Serial vs parallel algorithms

Serial execution

  • Typical programs run lines one after another:
# Import packages
import numpy as np

# Define an array of numbers
foo = np.array([0, 1, 2, 3, 4, 5])

# Define a function that squares numbers
def bar(x):
    return x * x

# Loop over each element and perform an action on it
for element in foo:
    # Print the result of bar
    print(bar(element))
0
1
4
9
16
25

The map function

  • One tool that we will use later is called map
  • This lets us apply a function to each element in a list or array:
# (Very) inefficient function
def my_map(function, array):
    # create a container for the results
    output = []

    # loop over each element
    for element in array:
        
        # add the intermediate result
        output.append(function(element))
    
    # return the now-filled container
    return output
my_map(bar, foo)
[np.int64(0),
 np.int64(1),
 np.int64(4),
 np.int64(9),
 np.int64(16),
 np.int64(25)]
  • Python helpfully provides a map function in the standard library:
list(map(bar, foo))
[np.int64(0),
 np.int64(1),
 np.int64(4),
 np.int64(9),
 np.int64(16),
 np.int64(25)]
  • The built-in map function is much faster than mine (it’s implemented in C), so of course you should use that one! 😂

Using joblib for parallel computing

  • Each step of our map call is independent, so it is perfect for parallelism
  • joblib makes this easy. Two things to know:
    • Parallel(n_jobs=k) runs k tasks at the same time
    • delayed(f) wraps f so joblib can schedule it
  • Combine them with a generator expression (see right)
  • Install with pip install joblib
  • n_jobs=-1 uses all available CPU cores automatically
  • joblib handles the process creation, data transfer, and result collection behind the scenes
  • It works best when each task takes at least a few milliseconds; for very cheap operations, the overhead of spawning processes can be larger than the speedup
  • Using our bar function and foo array from before:
# Install joblib if you haven't done so yet
# !pip install joblib

# Import joblib functions 
from joblib import Parallel, delayed

results = Parallel(n_jobs=6)(
    delayed(bar)(x) for x in foo
)
results
[np.int64(0),
 np.int64(1),
 np.int64(4),
 np.int64(9),
 np.int64(16),
 np.int64(25)]
  • joblib creates 6 instances of bar and applies each one to a different element of foo
  • The results are the same as before, but the computation runs in parallel

Serial vs parallel execution

  • calculation runs 10 heavy operations on 10 million numbers
  • Each call is fully independent: embarrassingly parallel
  • We time three things with %timeit:
    • one call (baseline)
    • four calls in serial
    • four calls in parallel
  • One call:
def calculation(size=10000000):
    # Create a large array and perform operations
    arr = np.random.rand(size)
    for _ in range(10):
        arr = np.sqrt(arr) + np.sin(arr)
    return np.mean(arr)

# Single run
%timeit calculation()
450 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Four calls, serial:
# Sequential runs (4 times)
%timeit [calculation() for _ in range(4)]
1.78 s ± 25.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Four calls, parallel:
%%timeit
# Parallel runs (4 times)
Parallel(n_jobs=4)(
    delayed(calculation)() for _ in range(4)
)
589 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Not exactly 4× faster (there is some overhead) but the speedup is real

Big O notation

Big O notation: why it matters for parallel computing

  • Big O notation describes how runtime grows as the input grows
  • O(1): constant, does not change with input size (e.g., reading one array element)
  • O(n): linear, doubles when input doubles (e.g., a single loop)
  • O(n²): quadratic, quadruples when input doubles (e.g., nested loops)
  • Our calculation function called on n inputs is O(n)
    • 2 calls take ~2× as long as 1 call, 100 calls take ~100× as long
Code
import matplotlib.pyplot as plt
import numpy as np

# Simulate processing times
num_images = np.array([1, 10, 50, 100, 200])
sequential_time = num_images * 2  # 2 seconds per image
parallel_time = (num_images * 2) / 4  # 4 cores, ideal speedup

plt.figure(figsize=(8, 5))
plt.plot(num_images, sequential_time, 'o-', label='Serial O(n)', linewidth=2, markersize=8)
plt.plot(num_images, parallel_time, 's-', label='Parallel O(n/4)', linewidth=2, markersize=8)
plt.xlabel('Number of Images', fontsize=12)
plt.ylabel('Time (seconds)', fontsize=12)
plt.title('O(n) Scaling: Serial vs Parallel', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

What parallelism does to complexity

  • With p cores, an O(n) problem becomes O(n/p + overhead)
    • 4 cores: theoretically 4× faster, in practice slightly less due to overhead
  • Parallelism never changes the Big O class, only the constant
    • O(n²) stays O(n²); you just reach “too slow” later
  • Best candidates: embarrassingly parallel O(n) tasks
    • Running n independent calculations
    • Applying the same function to n inputs
    • No shared state, no dependencies between tasks
  • Poor candidates for parallelism:
# O(n²): every pair of elements — not independent
def pairwise_sum(data):
    n = len(data)
    results = []
    for i in range(n):
        for j in range(n):
            results.append(data[i] + data[j])
    return results

data = list(range(1000))
# Don't run: ~1 million operations
# pairwise_sum(data)
  • Other bad candidates: steps that depend on previous results, algorithms with shared state, memory-bound problems

Try it yourself! 🧠

  • Install joblib and NumPy if you haven’t done so yet
    • !pip install joblib numpy
  • Compare the time of the serial and parallel versions of the following function:
def square(x):
    return x**2

# Create a large array to process
numbers = np.arange(1000000)

# Sequential version
%timeit [square(x) for x in numbers]
  • Then try the parallel version:
from joblib import Parallel, delayed

# Parallel version
%timeit Parallel(n_jobs=4)(delayed(square)(x) for x in numbers)
  • What did you find? Is the parallel version faster?
  • Appendix 01

Dask

Dask

  • Dask is a parallel computing library for Python
  • It gives you larger-than-memory versions of NumPy arrays and Pandas dataframes
  • Write familiar code; Dask handles the parallelism
  • Two main components:
    • Task scheduling: splits your work into a graph of small tasks and runs them in parallel
    • Big data collections: arrays, dataframes, and lists that look like NumPy/Pandas but can hold more data than your RAM
  • Design goals:
    • Familiar: same NumPy/Pandas API
    • Native: pure Python
    • Fast: low overhead on small and large data

Dask arrays

  • Let’s import Dask and see how it works
import dask.dataframe as dd
import dask.array as da
  • Dask arrays are a parallel version of NumPy arrays
  • They split a large array into chunks, each of which is a regular NumPy array
  • Dask processes chunks in parallel across your CPU cores
  • You choose the chunk size when creating the array (here: 100×100 blocks)
  • The API is the same as NumPy, so a.sum(), a.mean(), slicing, etc. all work
  • Let’s create a Dask array from a large NumPy array:
  • The Dask array a is a lazy wrapper around the original NumPy array, split into 100×100 chunks
data = np.random.normal(size=100000).reshape(200, 500)
a = da.from_array(data, chunks=(100, 100))
a
Array Chunk
Bytes 781.25 kiB 78.12 kiB
Shape (200, 500) (100, 100)
Dask graph 10 chunks in 1 graph layer
Data type float64 numpy.ndarray
500 200

Dask arrays

  • Dask arrays are lazy: they do not compute anything until you call .compute()
  • This lets you build up a full computation graph (an execution plan) before any work runs
  • Dask can then optimise that graph and minimise memory use before executing
  • For example, let’s slice the Dask array a to get the first 10 rows of the 6th column
a[:10, 5] # first 10 rows of the 6th column
Array Chunk
Bytes 80 B 80 B
Shape (10,) (10,)
Dask graph 1 chunks in 2 graph layers
Data type float64 numpy.ndarray
10 1
  • We use the .compute() method to compute the result
a[:10, 5].compute()
array([ 0.55988339,  0.59055209, -0.98667922,  0.30260127, -0.31109429,
       -0.22151789, -1.72475807,  0.98976483,  0.05664527, -0.1717961 ])

Dask arrays

  • Dask Array supports many common NumPy operations including:
  • Basic arithmetic and scalar mathematics (+, *, exp, log, etc)
  • Reductions along axes (sum(), mean(), std())
  • Tensor operations and matrix multiplication (tensordot)
  • Array slicing and basic indexing
  • Axis reordering and transposition
a.sum().compute()
np.float64(483.0683227278738)
  • Let’s compare a similar operation in NumPy:
size = 100000000
np_arr = np.random.random(size)
%timeit np_result = np.sqrt(np_arr) + np.sin(np_arr)
507 ms ± 25.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • And in Dask:
da_arr = da.random.random(size, chunks='auto')
%timeit (da.sqrt(da_arr) + da.sin(da_arr)).compute()
377 ms ± 83.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Dask dataframes

  • Dask also provides a parallelised version of Pandas dataframes
  • They are composed of many Pandas dataframes, split along the index
  • Let’s jump into an example:
import dask
df = dask.datasets.timeseries()
  • This is a small dataset of about 240 MB
  • Unlike Pandas, Dask DataFrames are also lazy
  • No data is printed here; instead it is replaced by ellipses (...)
df
Dask DataFrame Structure:
name id x y
npartitions=30
2000-01-01 string int64 float64 float64
2000-01-02 ... ... ... ...
... ... ... ... ...
2000-01-30 ... ... ... ...
2000-01-31 ... ... ... ...
Dask Name: to_string_dtype, 2 expressions
  • Nonetheless, the column names and dtypes are known
df.dtypes
name    string[pyarrow]
id                int64
x               float64
y               float64
dtype: object

Dask dataframes

  • Dask DataFrames support a large subset of the Pandas API
  • Some operations will automatically display the data
import pandas as pd

pd.options.display.precision = 2
pd.options.display.max_rows = 10

df.head()
name id x y
timestamp
2000-01-01 00:00:00 Oliver 968 -0.29 -0.19
2000-01-01 00:00:01 Ray 1017 0.68 0.75
2000-01-01 00:00:02 George 968 0.39 -0.86
2000-01-01 00:00:03 Zelda 1020 -0.74 -0.97
2000-01-01 00:00:04 Laura 995 -0.99 -0.02
  • Here we filter rows where y > 0, then compute the standard deviation of x per group:
df2 = df[df.y > 0]
df3 = df2.groupby("name").x.std()
df3
Dask Series Structure:
npartitions=1
    float64
        ...
Dask Name: getitem, 8 expressions
Expr=(((Filter(frame=ArrowStringConversion(frame=Timeseries(827f5a6)), predicate=ArrowStringConversion(frame=Timeseries(827f5a6))['y'] > 0))[['name', 'x']]).std(ddof=1, numeric_only=False, split_out=None, observed=False))['x']
  • Note that df3 is still not shown
  • We can use the .compute() method to display the result
df3.compute()
name
Bob         0.58
Dan         0.58
Edith       0.58
George      0.58
Hannah      0.58
            ... 
Patricia    0.58
Tim         0.58
Ursula      0.58
Victor      0.58
Xavier      0.58
Name: x, Length: 26, dtype: float64

Dask dataframes

  • Aggregations are also supported
  • Here we compute the sum of x and the maximum of y, grouped by name
df4 = df.groupby("name").aggregate({"x": "sum", "y": "max"})
df4.compute()
x y
name
Alice -70.86 1.0
Sarah 348.15 1.0
Ingrid 174.17 1.0
Patricia 68.24 1.0
Kevin 37.70 1.0
... ... ...
Dan -50.53 1.0
George 18.53 1.0
Oliver -363.98 1.0
Edith -343.47 1.0
Xavier 10.60 1.0

26 rows × 2 columns

  • If you have enough RAM for the whole dataset, you can keep it in memory with .persist()
  • Once persisted, future computations on that data skip the re-loading step
df5 = df4.persist()
df5.head()
x y
name
Alice -70.86 1.0
Sarah 348.15 1.0
Ingrid 174.17 1.0
Patricia 68.24 1.0
Kevin 37.70 1.0

Combining Dask and Pandas

  • Dask and Pandas work together naturally
  • Use Dask for the heavy lifting (large data, parallelism), then call .compute() to get a regular Pandas DataFrame for final steps
  • This is the typical real-world pattern:
    1. Load and filter with Dask (fast, parallel)
    2. Aggregate to a small result
    3. Call .compute() to convert to Pandas
    4. Use Pandas for plotting, formatting, or export
  • In short: Dask handles what doesn’t fit in memory, Pandas handles everything else
  • Most Dask code looks almost identical to Pandas, so switching between them is easy
import dask
import dask.dataframe as dd

df = dask.datasets.timeseries()

# Step 1-2: filter and aggregate with Dask
summary = (
    df[df.x > 0]
    .groupby("name")
    .agg({"x": "mean", "y": "std"})
)

# Step 3: bring to Pandas
pdf = summary.compute()

# Step 4: use Pandas normally
pdf.sort_values("x", ascending=False).head(5)
x y
name
George 0.5 0.58
Charlie 0.5 0.58
Zelda 0.5 0.58
Yvonne 0.5 0.58
Frank 0.5 0.58

Try it yourself! 🧠

  • Install dask if you haven’t done so yet

    • !pip install dask
  • Find the right chunk size!

  • Create a Dask array with 10 million random numbers (or less if you have memory constraints)

  • Vary the chunk size and time the following operation:

  • Calculate mean(sqrt(x^2)) on the Dask array

  • See the code below for an example with three different chunk sizes. Which one worked best for you? Why do you think that is?

import numpy as np
import dask.array as da

size = 10_000_000

# Dask with SMALL chunks
da_data_small = da.random.random(size, chunks=100_000)  # 100 chunks
%timeit da.sqrt(da_data_small**2).mean().compute()

# Dask with MEDIUM chunks
da_data_medium = da.random.random(size, chunks=2_000_000)  # 5 chunks
%timeit da.sqrt(da_data_medium**2).mean().compute()

# Dask with LARGE chunks
da_data_large = da.random.random(size, chunks=5_000_000)  # Only 2 chunks
%timeit da.sqrt(da_data_large**2).mean().compute()

SQL on DataFrames with DuckDB

  • DuckDB lets you run SQL queries on Pandas DataFrames
  • No server, no setup: just pip install duckdb
  • It queries your DataFrame variables by name: the Python variable pdf becomes the SQL table pdf
  • Also reads .parquet and .csv files directly in SQL
import duckdb
import dask

# Create a sample Pandas DataFrame
df = dask.datasets.timeseries()
pdf = df.head(1000)

# Query it with SQL — no registration needed
duckdb.sql("SELECT AVG(x) AS mean_x FROM pdf")
┌───────────────────────┐
│        mean_x         │
│        double         │
├───────────────────────┤
│ -0.016252195829227004 │
└───────────────────────┘
  • You can write full SQL queries with GROUP BY, joins, subqueries, etc.
duckdb.sql("""
    SELECT name,
           AVG(x)   AS mean_x,
           COUNT(*)  AS n
    FROM pdf
    GROUP BY name
    ORDER BY mean_x DESC
    LIMIT 5
""")
┌─────────┬─────────────────────┬───────┐
│  name   │       mean_x        │   n   │
│ varchar │       double        │ int64 │
├─────────┼─────────────────────┼───────┤
│ Yvonne  │  0.0973837018818252 │    39 │
│ Hannah  │ 0.09098737819584524 │    36 │
│ Ray     │ 0.08676713725913592 │    37 │
│ Tim     │ 0.08060252152376055 │    40 │
│ Norbert │ 0.05389034574545834 │    38 │
└─────────┴─────────────────────┴───────┘
  • DuckDB can also query files without loading them first:
# Read a Parquet file directly in SQL
duckdb.sql("SELECT * FROM 'data/sales.parquet'")

Read and write data with Dask

Reading and writing data

  • .csv is very common in data science (and for good reasons)
  • Pandas reads .csv well, but it loads the entire file into memory
  • For large files, that can mean several gigabytes of RAM
  • Dask provides a much more efficient way to read and write .csv files
  • Let’s split our dataset into daily files:
df = dask.datasets.timeseries()
df
Dask DataFrame Structure:
name id x y
npartitions=30
2000-01-01 string int64 float64 float64
2000-01-02 ... ... ... ...
... ... ... ... ...
2000-01-30 ... ... ... ...
2000-01-31 ... ... ... ...
Dask Name: to_string_dtype, 2 expressions
import os
import datetime

if not os.path.exists('data'):
    os.mkdir('data')

def name(i):
    return str(datetime.date(2000, 1, 1)
               + i * datetime.timedelta(days=1))

df.to_csv('data/*.csv', name_function=name);

Reading and writing data

  • We now have many CSV files in our data directory, one for each day in January 2000
  • We can read all of them as one logical dataframe using dd.read_csv
  • Dask reads the files in parallel and only loads what it needs
df = dd.read_csv('data/2000-*-*.csv')
df
Dask DataFrame Structure:
timestamp name id x y
npartitions=30
string string int64 float64 float64
... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
Dask Name: to_string_dtype, 2 expressions
  • Let’s do a simple computation
%timeit df.groupby('name').x.mean().compute()
1.07 s ± 66.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Reading and writing data

Parquet files

  • Although .csv files are nice, newer formats like Parquet are gaining popularity
  • Data are stored by column rather than by row, so you can query specific columns without reading the whole file
  • Files are typically 75% smaller than equivalent CSV
df.to_parquet('data/2000-01.parquet',
              engine='pyarrow')
  • Now we can read the parquet file (note we can select specific columns)
df = dd.read_parquet('data/2000-01.parquet',
                     columns=['name', 'x'],
                     engine='pyarrow')
df
Dask DataFrame Structure:
name x
npartitions=30
string float64
... ...
... ... ...
... ...
... ...
Dask Name: read_parquet, 1 expression
  • The same computation, now on Parquet:
%timeit df.groupby('name').x.mean().compute()
100 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why Parquet?

  • CSV is row-based: reading one column means scanning every row
  • Parquet is column-based: it reads only the columns you ask for
Feature CSV Parquet
Storage Row-based Column-based
Compression None Snappy/gzip
Column selection Reads all Reads only needed
Data types Text only Typed (int, float, date)
File size (1M rows) ~100 MB ~25 MB
import pandas as pd

# CSV: reads everything, infers types
df = pd.read_csv("sales.csv")

# Parquet: reads only what you need
df = pd.read_parquet("sales.parquet",
                     columns=["date", "revenue"])
  • Use Parquet when you have many columns but query only a few, files are larger than ~100 MB, or you share data across Python, R, and Spark
  • CSV is still fine for small files, one-off exports, or sharing with non-technical users

Dask delayed

Dask delayed

  • Sometimes you don’t want to use an entire Dask DataFrame or Dask Array
  • You may want to parallelise a single function, for instance, or a small part of your code
  • There is a way to do this with Dask: the dask.delayed function
def calculation(size=10000000):
    arr = np.random.rand(size)
    for _ in range(10):
        arr = np.sqrt(arr) + np.sin(arr)
    return np.mean(arr)

%timeit [calculation() for _ in range(4)]
1.84 s ± 35.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • We just need to add the @dask.delayed decorator to the function
@dask.delayed
def delayed_calculation(size=10000000):
    arr = np.random.rand(size)
    for _ in range(10):
        arr = np.sqrt(arr) + np.sin(arr)
    return np.mean(arr)

results = []
for _ in range(5):
    results.append(delayed_calculation())

# Compute all results at once
%timeit final_results = dask.compute(*results)
771 ms ± 65.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • The results are much faster, and we didn’t have to change the code at all!
  • Just remember to add the .compute() method at the end

Dask delayed

  • We can even visualise the computation graph
@dask.delayed
def generate_data(size):
    return np.random.rand(size)

@dask.delayed
def transform_data(data):
    return np.sqrt(data) + np.sin(data)

@dask.delayed
def aggregate_data(data):
    return {
        'mean': np.mean(data),
        'std': np.std(data),
        'max': np.max(data)
    }

# Compare execution
sizes = [1000000, 2000000, 3000000]

# Dask execution
dask_results = []
for size in sizes:
    data = generate_data(size)
    transformed = transform_data(data)
    stats = aggregate_data(transformed)
    dask_results.append(stats)

%timeit dask.compute(*dask_results)
33.5 ms ± 888 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Requires graphviz: pip install graphviz
dask.visualize(*dask_results)

Best practices

Best practices

  • Parallel computing is not always the right tool. Tips from the Dask docs:
  • Start small: NumPy or Pandas may already have a fast function. Switching to Parquet alone can be a big speedup
  • Sample first: do you really need all those TBs of data?
  • Load with Dask from the start: easier to scale up than to scale down
  • Call .compute() and .persist() sparingly: each call triggers execution, so batch your work
  • Experiment with chunk sizes: try a few, or use chunks=’auto’
  • Break computations into many pieces: more pieces means more parallelism
  • Use .parquet over .csv for large datasets
  • More tips: Dask best practices

And that’s all for today! 🎉

See you next time! 😊

Appendix 01

  • Here is the solution to the exercise:
def square(x):
    return x**2

# Create a large array to process
numbers = np.arange(1000000)

# Sequential version
%timeit [square(x) for x in numbers]
47.5 ms ± 668 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
from joblib import Parallel, delayed

# Parallel version
%timeit Parallel(n_jobs=4)(delayed(square)(x) for x in numbers)
1.84 s ± 9.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Expected result: the parallel version is slower here. square(x) is so fast that joblib’s process-spawning overhead dominates. Parallel computing pays off when each task is heavy, not trivial.

Back to exercise

Appendix 02

  • Create an array with 10 million random numbers and calculate: mean(sqrt(x^2))
  • Try different chunk sizes and see which one works best
size = 10000000

# Dask with SMALL chunks 
da_data_small = da.random.random(size, chunks=100000)
%timeit da.sqrt(da_data_small**2).mean().compute()
40.9 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Dask with MEDIUM chunks 
da_data_medium = da.random.random(size, chunks=2000000)
%timeit da.sqrt(da_data_medium**2).mean().compute()
17.5 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Dask with LARGE chunks 
da_data_large = da.random.random(size, chunks=5000000)
%timeit da.sqrt(da_data_large**2).mean().compute()
27.7 ms ± 787 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Dask with AUTO chunks 
da_data_auto = da.random.random(size, chunks='auto')
%timeit da.sqrt(da_data_auto**2).mean().compute()
49.4 ms ± 736 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Back to exercise