QTM 350 - Data Science Computing

Lecture 21 - Parallel Computing

Danilo Freire

danilo.freire@emory.edu

Emory University

Hello, my friends! 😊

Today’s agenda 📅

Lecture outline

Serial vs Parallel Algorithms
Python implementations of parallelism
- Single node
- Multi-node
Joblib and Dask for parallel computing
Tools for further exploration

Serial vs Parallel Algorithms

Serial Execution

Typical programs operate lines sequentially:

# Import packages
import numpy as np

# Define an array of numbers
foo = np.array([0, 1, 2, 3, 4, 5])

# Define a function that squares numbers
def bar(x):
    return x * x

# Loop over each element and perform an action on it
for element in foo:

        # Print the result of bar
        print(bar(element))

The map function

A key tool that we will utilise later is called map
This lets us apply a function to each element in a list or array:

# (Very) inefficient way to define a map function
def my_map(function, array):
    # create a container for the results
    output = []

    # loop over each element
    for element in array:
        
        # add the intermediate result to the container
        output.append(function(element))
    
    # return the now-filled container
    return output

my_map(bar, foo)

[0, 1, 4, 9, 16, 25]

Python has a helpfully provided a map function in the standard library:

list(map(bar, foo))

[0, 1, 4, 9, 16, 25]

The built-in map function is much more flexible and featured than ours, so of course you should use that one! 😂

Using `joblib` for parallel computing

In the example we showed before, no step of the map call depend on the other steps
Rather than waiting for the function to loop over each value, we could create multiple instances of the function bar and apply it to each value simultaneously
There are several methods to achieve this, but we will use joblib for this purpose
The Parallel function from joblib is used to parallelise the task across as many jobs as we want
The n_jobs parameter specifies the number of jobs to run in parallel
The delayed function is used to delay the execution of the function bar until the parallelisation is ready
The results variable will contain the output of the parallel computation

Using our bar function and foo array from before:

from joblib import Parallel, delayed

results = Parallel(n_jobs=6)(delayed(bar)(x) for x in foo)
results

[0, 1, 4, 9, 16, 25]

What joblib is doing here is creating 6 instances of the bar function and applying each one to a different element of the foo array
As you can see, the results are the same as before
The difference is that the computation is now done in parallel

Serial vs parallel execution

Let’s see another example of the difference between serial and parallel execution
We will use the %timeit magic command to measure the time it takes to run a function
Serial:

def calculation(size=10000000):
    # Create a large array and perform operations
    arr = np.random.rand(size)
    for _ in range(10):
        arr = np.sqrt(arr) + np.sin(arr)
    return np.mean(arr)

# Single run
%timeit calculation()

440 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now let’s run the same function 4 times:

# Sequential runs (4 times)
%timeit [calculation() for _ in range(4)]

1.77 s ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now let’s see the parallel version:

# Parallel runs (4 times)
%timeit Parallel(n_jobs=4)(delayed(calculation)() for _ in range(4))

605 ms ± 6.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As you can see, the parallel version is much faster than the serial version
It is not exactly 4 times faster because there is some overhead in creating the parallel processes
But the different is still significant

Another example

Processing multiple input files

Say we have a number of input files, like .jpg images, that we want to perform the same actions on, like rotate by 180 degrees and convert to a different format
We can define a function that takes a file as input and performs these actions, then map it on a list of files

# Import python image library functions
from PIL import Image

from matplotlib.pyplot import imshow
%matplotlib inline

Let’s read an image file and display it:

im = Image.open( './data/kings_cross.jpg' )
# Display image
im

Parallel processing of images

Rotate the image by 180 degrees

im.rotate(angle=180)

Let’s define a function that takes a file name as input, opens the file, rotates it upside down, and then saves the output as a PDF:

def image_flipper(file_name):
    # extract the base file name
    base_name = file_name[0:-4]
    
    # opens the file
    im = Image.open( file_name )

    # rotates by 180deg
    im_flipped = im.rotate(angle=180)
    
    # Saves a PDF output with a new file name
    im_flipped.save(base_name + "_flipped.pdf", format='PDF')

    return base_name + "_flipped.pdf"

Parallel processing of images

Let’s see the images we have in the data folder:

import glob

file_list = glob.glob('./data/*jpg')

for f in file_list:
    print(f)

./data/kings_cross.jpg
./data/charing_cross.jpg
./data/victoria.jpg
./data/waterloo.jpg
./data/euston.jpg
./data/fenchurch.jpg
./data/st_pancras.jpg
./data/london_bridge.jpg
./data/liverpool_street.jpg
./data/paddington.jpg

Now let’s apply the image_flipper function to each file in the list:

Parallel(n_jobs=4)(delayed(image_flipper)(f) for f in file_list)

['./data/kings_cross_flipped.pdf',
 './data/charing_cross_flipped.pdf',
 './data/victoria_flipped.pdf',
 './data/waterloo_flipped.pdf',
 './data/euston_flipped.pdf',
 './data/fenchurch_flipped.pdf',
 './data/st_pancras_flipped.pdf',
 './data/london_bridge_flipped.pdf',
 './data/liverpool_street_flipped.pdf',
 './data/paddington_flipped.pdf']

Some take-aways

Parallel computing can be much faster than serial computing
These problems are essentially independent and share no information between them
The joblib module makes it simple to run these steps together with a single command
This workflow is limited to running on a single computer (or compute node) since there is no mechanism to communicate outside

Try it yourself! 🧠

Install joblib and numpy if you haven’t done so yet
- !pip install joblib numpy
Compare the time of the serial and parallel versions of the following function:

def square(x):
    return x**2

# Create a large array to process
numbers = np.arange(1000000)

# Sequential version
%timeit [square(x) for x in numbers]

Then try the parallel version:

from joblib import Parallel, delayed

# Parallel version
%timeit Parallel(n_jobs=4)(delayed(square)(x) for x in numbers)

What did you find? Is the parallel version faster?
Appendix 01

Dask

Dask is a flexible parallel computing library for analytic computing
It is composed of two components:
- Dynamic task scheduling optimised for computation
- “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments
Dask emphasises the following virtues:
- Familiar: Provides parallelised NumPy array and Pandas DataFrame objects
- Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects
- Native: Enables distributed computing in pure Python
- Fast: Operates with low overhead, both in small data and large data settings

Let’s import Dask and see how it works

import dask.dataframe as dd
import dask.array as da

Dask arrays are a parallel and distributed version of NumPy arrays
They are composed of many NumPy arrays, split along one or more dimensions

data = np.random.normal(size=100000).reshape(200, 500)
a = da.from_array(data, chunks=(100, 100))
a

	Array	Chunk
Bytes	781.25 kiB	78.12 kiB
Shape	(200, 500)	(100, 100)
Dask graph	10 chunks in 1 graph layer
Data type	float64 numpy.ndarray

Dask arrays

Dask arrays support most of the NumPy functions
They are lazy, that is, they do not compute anything until you ask for it

a[:10, 5] # first 10 rows of the 6th column

	Array	Chunk
Bytes	80 B	80 B
Shape	(10,)	(10,)
Dask graph	1 chunks in 2 graph layers
Data type	float64 numpy.ndarray

We use the .compute() method to compute the result

a[:10, 5].compute()

array([-0.37444792, -0.12032115,  0.96631004,  1.64421129, -0.21111765,
        1.65239234,  1.78607608, -0.64443537,  0.05087629, -1.15887878])

Dask arrays

Dask Array supports many common NumPy operations including:
Basic arithmetic and scalar mathematics (+, *, exp, log, etc)
Reductions along axes (sum(), mean(), std())
Tensor operations and matrix multiplication (tensordot)
Array slicing and basic indexing
Axis reordering and transposition

a.sum().compute()

-7.866101121579902

Let’s compare a similar operation in NumPy:

size = 100000000
np_arr = np.random.random(size)
%timeit np_result = np.sqrt(np_arr) + np.sin(np_arr)

519 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And in Dask:

da_arr = da.random.random(size, chunks='auto') 
%timeit da_result = da.sqrt(da_arr) + da.sin(da_arr)

800 μs ± 2.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Dask dataframes

Dask also provides a parallelised version of Pandas dataframes
They are composed of many Pandas dataframes, split along the index
Let’s jump into an example:

import dask
df = dask.datasets.timeseries()

This is a small dataset of about 240 MB
Unlike Pandas, Dask DataFrames are lazy, meaning that data is only loaded when it is needed for a computation
No data is printed here, instead it is replaced by ellipses (...)

df

Dask DataFrame Structure:

	name	id	x	y
npartitions=30
2000-01-01	string	int64	float64	float64
2000-01-02	...	...	...	...
...	...	...	...	...
2000-01-30	...	...	...	...
2000-01-31	...	...	...	...

Dask Name: to_string_dtype, 2 expressions

Nonetheless, the column names and dtypes are known

df.dtypes

name    string[pyarrow]
id                int64
x               float64
y               float64
dtype: object

Dask dataframes

Dask DataFrames support a large subset of the Pandas API
Some operations will automatically display the data

import pandas as pd

pd.options.display.precision = 2
pd.options.display.max_rows = 10

df.head()

	name	id	x	y
timestamp
2000-01-01 00:00:00	Norbert	959	0.52	-0.57
2000-01-01 00:00:01	Quinn	989	0.15	0.12
2000-01-01 00:00:02	Victor	948	-0.54	-0.58
2000-01-01 00:00:03	Bob	954	0.88	-0.80
2000-01-01 00:00:04	Tim	962	-0.61	0.67

This example shows how to slice the data based on a condition and then determine the standard deviation of the data in the x column

df2 = df[df.y > 0]
df3 = df2.groupby("name").x.std()
df3

Dask Series Structure:
npartitions=1
    float64
        ...
Dask Name: getitem, 8 expressions
Expr=(((Filter(frame=ArrowStringConversion(frame=FromMap(f6d27d5)), predicate=ArrowStringConversion(frame=FromMap(f6d27d5))['y'] > 0))[['name', 'x']]).std(ddof=1, numeric_only=False, split_out=None, observed=False))['x']

Note that df3 is still not shown
We can use the .compute() method to display the result

df3.compute()

name
Bob         0.58
Dan         0.58
Edith       0.58
George      0.58
Hannah      0.58
            ... 
Patricia    0.58
Tim         0.58
Ursula      0.58
Victor      0.58
Xavier      0.58
Name: x, Length: 26, dtype: float64

Dask dataframes

Aggregations are also supported
Here we aggregate the sum of the x column and the maximum of the y column by the name column

df4 = df.groupby("name").aggregate({"x": "sum", "y": "max"})
df4.compute()

	x	y
name
Kevin	-33.64	1.0
Sarah	111.00	1.0
Alice	284.10	1.0
Patricia	-178.26	1.0
Ingrid	163.09	1.0
...	...	...
Jerry	20.19	1.0
Ray	235.17	1.0
Quinn	-100.90	1.0
Hannah	-32.22	1.0
Dan	194.80	1.0

26 rows × 2 columns

If you have the available RAM for your dataset then you can persist data in memory
We use the .persist() method to do this, and then the data will be available for future computations

df5 = df4.persist()
df5.head()

	x	y
name
Kevin	-33.64	1.0
Sarah	111.00	1.0
Alice	284.10	1.0
Patricia	-178.26	1.0
Ingrid	163.09	1.0

Time series example

Since the Dask dataframe we are using has a time series structure, we can use the resample method to aggregate the data by a time period
Let’s resample the data by 1 hour and calculate the mean of the x and y columns

df[["x", "y"]].resample("1h").mean().head(3)

	x	y
timestamp
2000-01-01 00:00:00	6.63e-03	-4.72e-03
2000-01-01 01:00:00	-6.19e-03	-2.47e-02
2000-01-01 02:00:00	4.97e-03	-5.67e-03

We can also use the rolling method to calculate a rolling mean of the data

df[["x", "y"]].rolling(window="24h").mean().head()

	x	y
timestamp
2000-01-01 00:00:00	0.52	-0.57
2000-01-01 00:00:01	0.33	-0.22
2000-01-01 00:00:02	0.04	-0.34
2000-01-01 00:00:03	0.25	-0.46
2000-01-01 00:00:04	0.08	-0.23

Try it yourself! 🧠

Install dask if you haven’t done so yet
- !pip install dask
Create a Dask array with 1 million random numbers and calculate the mean of the square root of the array
Compare the time it takes to run the same operation in NumPy
Appendix 02

Dask and SQL

dask-sql is a library that allows you to run SQL queries on Dask DataFrames
It is built on top of Apache Calcite, a SQL interpreter
It is still under development, but it already has many features
You can install it with pip install dask-sql
A dask_sql.Context is the Python equivalent to a SQL database,
It serves as an interface to register all tables and functions used in SQL queries, as well as execute the queries themselves
A single Context is created and used for the duration of a Python script or notebook

from dask_sql import Context
c = Context()

Dask and SQL

Once a Context has been created, there are many ways to register tables in it
The simplest way to do this is through the create_table method
Supported input types include Pandas DataFrames, Dask DataFrames, remote datasets (like on Amazon S3), and more

# Register and persist a dask table
from dask.datasets import timeseries
ddf = timeseries()
c.create_table("dask", ddf, persist=True)

Now we can run SQL queries on the timeseries table

c.sql("SELECT AVG(x) FROM dask").compute()

	AVG(dask.x)
0	1.17e-04

Dask, SQL and Pandas

You can of course combine all three libraries!

# Perform a multi-column sort
res = c.sql("""
    SELECT * FROM dask ORDER BY name ASC, id DESC, x ASC
""")

# Now do some follow groupby aggregations
res.groupby("name").agg({"x": "sum", "y": "mean"}).compute()

	x	y
name
Alice	134.70	-3.04e-03
Ingrid	-132.42	1.69e-03
Kevin	-214.08	3.22e-04
Patricia	311.75	1.37e-03
Sarah	29.90	1.05e-03
...	...	...
Ray	-22.47	-2.65e-04
Tim	-87.40	2.19e-03
Ursula	-295.27	3.79e-04
Wendy	13.15	2.26e-03
Xavier	346.36	-1.54e-03

26 rows × 2 columns

Read and write data with Dask

Reading and writing data

.csv is very common in data science (and for good reasons)
Pandas reads and writes .csv files very well, but it is not the best option for large files
It may need several gigabytes of memory to read a large file, as it reads the entire file into memory
Dask provides a much more efficient way to read and write .csv files
Let’s split our dataset

df = dask.datasets.timeseries()
df

Dask DataFrame Structure:

	name	id	x	y
npartitions=30
2000-01-01	object	int64	float64	float64
2000-01-02	...	...	...	...
...	...	...	...	...
2000-01-30	...	...	...	...
2000-01-31	...	...	...	...

Dask Name: make-timeseries, 1 expression

import os
import datetime

if not os.path.exists('data'):
    os.mkdir('data')

def name(i):
    """ Provide date for filename given index

    Examples
    --------
    >>> name(0)
    '2000-01-01'
    >>> name(10)
    '2000-01-11'
    """
    return str(datetime.date(2000, 1, 1) + i * datetime.timedelta(days=1))

df.to_csv('data/*.csv', name_function=name);

Reading and writing data

We now have many CSV files in our data directory, one for each day in the month of January 2000
Each CSV file holds timeseries data for that day
We can read all of them as one logical dataframe using the dd.read_csv function

df = dd.read_csv('data/2000-*-*.csv')
df

Dask DataFrame Structure:

	timestamp	name	id	x	y
npartitions=30
	object	object	int64	float64	float64
	...	...	...	...	...
...	...	...	...	...	...
	...	...	...	...	...
	...	...	...	...	...

Dask Name: read_csv, 1 expression

Let’s do a simple computation

%timeit df.groupby('name').x.mean().compute()

785 ms ± 155 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Reading and writing data

Parquet files

Although .csv files are nice, new formats like Parquet are gaining popularity
Data are stored by column rather than by row, allowing for efficient querying of specific columns without reading entire rows
It can be used with various programming languages and data processing frameworks
Achieves up to 75% reduction in file size compared to uncompressed formats

df.to_parquet('data/2000-01.parquet', engine='pyarrow')

Now we can read the parquet file

df = dd.read_parquet('data/2000-01.parquet', columns=['name', 'x'], engine='pyarrow')
df

Dask DataFrame Structure:

	name	x
npartitions=30
	object	float64
	...	...
...	...	...
	...	...
	...	...

Dask Name: read_parquet, 1 expression

Let’s do the same computation as before

%timeit df.groupby('name').x.mean().compute()

191 ms ± 4.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Dask delayed

Sometimes you don’t want to use an entire Dask DataFrame or Dask Array
You may want to parallelise a single function, for instance, or a small part of your code
There is a way to do this with Dask: the dask.delayed function

def calculation(size=10000000):
    arr = np.random.rand(size)
    for _ in range(10):
        arr = np.sqrt(arr) + np.sin(arr)
    return np.mean(arr)

%timeit [calculation() for _ in range(4)]

1.87 s ± 40.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We just need to add the @dask.delayed decorator to the function

@dask.delayed
def delayed_calculation(size=10000000):
    arr = np.random.rand(size)
    for _ in range(10):
        arr = np.sqrt(arr) + np.sin(arr)
    return np.mean(arr)

results = []
for _ in range(5):
    results.append(delayed_calculation())

# Compute all results at once
%timeit final_results = dask.compute(*results)

755 ms ± 21.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As you can see, the results are notably faster, and we didn’t have to change the code at all!
Just remember to add the .compute() method at the end

Dask delayed

We can even visualise the computation graph

@dask.delayed
def generate_data(size):
    return np.random.rand(size)

@dask.delayed
def transform_data(data):
    return np.sqrt(data) + np.sin(data)

@dask.delayed
def aggregate_data(data):
    return {
        'mean': np.mean(data),
        'std': np.std(data),
        'max': np.max(data)
    }

# Compare execution
sizes = [1000000, 2000000, 3000000]

# Dask execution
dask_results = []
for size in sizes:
    data = generate_data(size)
    transformed = transform_data(data)
    stats = aggregate_data(transformed)
    dask_results.append(stats)

%timeit dask.compute(*dask_results)

32 ms ± 140 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

dask.visualize(*dask_results)

Best practices

Parallel computing can be a powerful tool, but it is not always the best solution
So here are some times from the creators of Dask themselves:
Start small: NumPy, pandas, Scikit-learn may have faster functions for what you’re trying to do. Sometimes just by changing the data format to Parquet you can get a huge speedup
Sampling: If you have a large dataset, try working with a sample of the data first. Do you really need all those TBs of data?
Load data with Dask: If you have a large dataset, load it with Dask from the start. It’s much easier to scale up than to scale down
Use .compute and .persist sparingly: These functions can be expensive, so use them only when you need to
Break up computations into many pieces: This will allow Dask to parallelise the computation faster
More tips can be found here

And that’s all for today! 🎉

See you next time! 😊

Appendix 01

Here is the solution to the exercise:

def square(x):
    return x**2

# Create a large array to process
numbers = np.arange(1000000)

# Sequential version
%timeit [square(x) for x in numbers]

61.7 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

from joblib import Parallel, delayed

# Parallel version
%timeit Parallel(n_jobs=4)(delayed(square)(x) for x in numbers)

2.45 s ± 68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Back to exercise

Appendix 02

Create a Dask array with 1 million random numbers and calculate the mean of the square root of the array
Compare the time it takes to run the same operation in NumPy

size = 1000000
np_arr = np.random.random(size)
%timeit np_result = np.sqrt(np_arr).mean()

549 μs ± 100 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

da_arr = da.random.random(size, chunks=100000)
%timeit da_result = da.sqrt(da_arr).mean().compute()

5.28 ms ± 484 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Back to exercise

QTM 350 - Data Science Computing

Hello, my friends! 😊

Today’s agenda 📅

Lecture outline

Serial vs Parallel Algorithms

Serial Execution

The map function

Using joblib for parallel computing

Serial vs parallel execution

Another example

Processing multiple input files

Parallel processing of images

Parallel processing of images

Some take-aways

Try it yourself! 🧠

Dask

Dask

Dask arrays

Dask arrays

Dask dataframes

Dask dataframes

Dask dataframes

Time series example

Try it yourself! 🧠

Dask and SQL

Dask and SQL

Dask, SQL and Pandas

Read and write data with Dask

Reading and writing data

Reading and writing data

Reading and writing data

Parquet files

Dask delayed

Dask delayed

Dask delayed

Best practices

Best practices

And that’s all for today! 🎉

See you next time! 😊

Appendix 01

Appendix 02

Using `joblib` for parallel computing