QTM 350 - Data Science Computing

Lecture 21 - Parallel Computing

Danilo Freire

Emory University

13 November, 2024

Hello, my friends! 😊

Today’s agenda 📅

Lecture outline

  • Serial vs Parallel Algorithms
  • Python implementations of parallelism
    • Single node
    • Multi-node
  • Joblib and Dask for parallel computing
  • Tools for further exploration

Serial vs Parallel Algorithms

Serial Execution

  • Typical programs operate lines sequentially:
# Import packages
import numpy as np

# Define an array of numbers
foo = np.array([0, 1, 2, 3, 4, 5])

# Define a function that squares numbers
def bar(x):
    return x * x

# Loop over each element and perform an action on it
for element in foo:

        # Print the result of bar
        print(bar(element))
0
1
4
9
16
25

The map function

  • A key tool that we will utilise later is called map
  • This lets us apply a function to each element in a list or array:
# (Very) inefficient way to define a map function
def my_map(function, array):
    # create a container for the results
    output = []

    # loop over each element
    for element in array:
        
        # add the intermediate result to the container
        output.append(function(element))
    
    # return the now-filled container
    return output
my_map(bar, foo)
[0, 1, 4, 9, 16, 25]
  • Python has a helpfully provided a map function in the standard library:
list(map(bar, foo))
[0, 1, 4, 9, 16, 25]
  • The built-in map function is much more flexible and featured than ours, so of course you should use that one! 😂

Using joblib for parallel computing

  • In the example we showed before, no step of the map call depend on the other steps
  • Rather than waiting for the function to loop over each value, we could create multiple instances of the function bar and apply it to each value simultaneously
  • There are several methods to achieve this, but we will use joblib for this purpose
  • The Parallel function from joblib is used to parallelise the task across as many jobs as we want
  • The n_jobs parameter specifies the number of jobs to run in parallel
  • The delayed function is used to delay the execution of the function bar until the parallelisation is ready
  • The results variable will contain the output of the parallel computation
  • Using our bar function and foo array from before:
from joblib import Parallel, delayed

results = Parallel(n_jobs=6)(delayed(bar)(x) for x in foo)
results
[0, 1, 4, 9, 16, 25]
  • What joblib is doing here is creating 6 instances of the bar function and applying each one to a different element of the foo array
  • As you can see, the results are the same as before
  • The difference is that the computation is now done in parallel

Serial vs parallel execution

  • Let’s see another example of the difference between serial and parallel execution

  • We will use the %timeit magic command to measure the time it takes to run a function

  • Serial:

def calculation(size=10000000):
    # Create a large array and perform operations
    arr = np.random.rand(size)
    for _ in range(10):
        arr = np.sqrt(arr) + np.sin(arr)
    return np.mean(arr)

# Single run
%timeit calculation()
451 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Now let’s run the same function 4 times:
# Sequential runs (4 times)
%timeit [calculation() for _ in range(4)]
1.79 s ± 24.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Now let’s see the parallel version:
# Parallel runs (4 times)
%timeit Parallel(n_jobs=4)(delayed(calculation)() for _ in range(4))
596 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • As you can see, the parallel version is much faster than the serial version
  • It is not exactly 4 times faster because there is some overhead in creating the parallel processes
  • But the different is still significant

Another example

Processing multiple input files

  • Say we have a number of input files, like .jpg images, that we want to perform the same actions on, like rotate by 180 degrees and convert to a different format
  • We can define a function that takes a file as input and performs these actions, then map it on a list of files
# Import python image library functions
from PIL import Image

from matplotlib.pyplot import imshow
%matplotlib inline
  • Let’s read an image file and display it:
im = Image.open( './data/kings_cross.jpg' )
# Display image
im 

Parallel processing of images

  • Rotate the image by 180 degrees
im.rotate(angle=180)

  • Let’s define a function that takes a file name as input, opens the file, rotates it upside down, and then saves the output as a PDF:
def image_flipper(file_name):
    # extract the base file name
    base_name = file_name[0:-4]
    
    # opens the file
    im = Image.open( file_name )

    # rotates by 180deg
    im_flipped = im.rotate(angle=180)
    
    # Saves a PDF output with a new file name
    im_flipped.save(base_name + "_flipped.pdf", format='PDF')

    return base_name + "_flipped.pdf"

Parallel processing of images

  • Let’s see the images we have in the data folder:
import glob

file_list = glob.glob('./data/*jpg')

for f in file_list:
    print(f)
./data/kings_cross.jpg
./data/charing_cross.jpg
./data/victoria.jpg
./data/waterloo.jpg
./data/euston.jpg
./data/fenchurch.jpg
./data/st_pancras.jpg
./data/london_bridge.jpg
./data/liverpool_street.jpg
./data/paddington.jpg
  • Now let’s apply the image_flipper function to each file in the list:
Parallel(n_jobs=4)(delayed(image_flipper)(f) for f in file_list)
['./data/kings_cross_flipped.pdf',
 './data/charing_cross_flipped.pdf',
 './data/victoria_flipped.pdf',
 './data/waterloo_flipped.pdf',
 './data/euston_flipped.pdf',
 './data/fenchurch_flipped.pdf',
 './data/st_pancras_flipped.pdf',
 './data/london_bridge_flipped.pdf',
 './data/liverpool_street_flipped.pdf',
 './data/paddington_flipped.pdf']

Some take-aways

  • Parallel computing can be much faster than serial computing
  • These problems are essentially independent and share no information between them
  • The joblib module makes it simple to run these steps together with a single command
  • This workflow is limited to running on a single computer (or compute node) since there is no mechanism to communicate outside

Try it yourself! 🧠

  • Install joblib and numpy if you haven’t done so yet
    • !pip install joblib numpy
  • Compare the time of the serial and parallel versions of the following function:
def square(x):
    return x**2

# Create a large array to process
numbers = np.arange(1000000)

# Sequential version
%timeit [square(x) for x in numbers]
  • Then try the parallel version:
from joblib import Parallel, delayed

# Parallel version
%timeit Parallel(n_jobs=4)(delayed(square)(x) for x in numbers)
  • What did you find? Is the parallel version faster?
  • Appendix 01

Dask

Dask

  • Dask is a flexible parallel computing library for analytic computing
  • It is composed of two components:
    • Dynamic task scheduling optimised for computation
    • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments
  • Dask emphasises the following virtues:
    • Familiar: Provides parallelised NumPy array and Pandas DataFrame objects
    • Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects
    • Native: Enables distributed computing in pure Python
    • Fast: Operates with low overhead, both in small data and large data settings
  • Let’s import Dask and see how it works
import dask.dataframe as dd
import dask.array as da
  • Dask arrays are a parallel and distributed version of NumPy arrays
  • They are composed of many NumPy arrays, split along one or more dimensions
data = np.random.normal(size=100000).reshape(200, 500)
a = da.from_array(data, chunks=(100, 100))
a
Array Chunk
Bytes 781.25 kiB 78.12 kiB
Shape (200, 500) (100, 100)
Dask graph 10 chunks in 1 graph layer
Data type float64 numpy.ndarray
500 200

Dask arrays

  • Dask arrays support most of the NumPy functions
  • They are lazy, that is, they do not compute anything until you ask for it
a[:10, 5] # first 10 rows of the 6th column
Array Chunk
Bytes 80 B 80 B
Shape (10,) (10,)
Dask graph 1 chunks in 2 graph layers
Data type float64 numpy.ndarray
10 1
  • We use the .compute() method to compute the result
a[:10, 5].compute()
array([-1.01717283, -0.38495348,  0.4420003 ,  0.42915814, -0.22395892,
       -0.17129497,  0.84975211, -0.45828641, -1.58461724,  0.12297606])

Dask arrays

  • Dask Array supports many common NumPy operations including:
  • Basic arithmetic and scalar mathematics (+, *, exp, log, etc)
  • Reductions along axes (sum(), mean(), std())
  • Tensor operations and matrix multiplication (tensordot)
  • Array slicing and basic indexing
  • Axis reordering and transposition
a.sum().compute()
505.74604200460675
  • Let’s compare a similar operation in NumPy:
size = 100000000
np_arr = np.random.random(size)
%timeit np_result = np.sqrt(np_arr) + np.sin(np_arr)
542 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • And in Dask:
da_arr = da.random.random(size, chunks='auto') 
%timeit da_result = da.sqrt(da_arr) + da.sin(da_arr)
878 μs ± 22.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Dask dataframes

  • Dask also provides a parallelised version of Pandas dataframes
  • They are composed of many Pandas dataframes, split along the index
  • Let’s jump into an example:
import dask
df = dask.datasets.timeseries()
  • This is a small dataset of about 240 MB
  • Unlike Pandas, Dask DataFrames are lazy, meaning that data is only loaded when it is needed for a computation
  • No data is printed here, instead it is replaced by ellipses (...)
df
Dask DataFrame Structure:
name id x y
npartitions=30
2000-01-01 object int64 float64 float64
2000-01-02 ... ... ... ...
... ... ... ... ...
2000-01-30 ... ... ... ...
2000-01-31 ... ... ... ...
Dask Name: make-timeseries, 1 expression
  • Nonetheless, the column names and dtypes are known
df.dtypes
name     object
id        int64
x       float64
y       float64
dtype: object

Dask dataframes

  • Dask DataFrames support a large subset of the Pandas API
  • Some operations will automatically display the data
import pandas as pd

pd.options.display.precision = 2
pd.options.display.max_rows = 10

df.head()
name id x y
timestamp
2000-01-01 00:00:00 Zelda 993 3.24e-01 -0.15
2000-01-01 00:00:01 Norbert 968 -6.41e-01 0.49
2000-01-01 00:00:02 Jerry 990 6.81e-01 -0.19
2000-01-01 00:00:03 Ingrid 951 -5.18e-01 -0.30
2000-01-01 00:00:04 Quinn 974 2.28e-03 -0.82
  • This example shows how to slice the data based on a condition and then determine the standard deviation of the data in the x column
df2 = df[df.y > 0]
df3 = df2.groupby("name").x.std()
df3
Dask Series Structure:
npartitions=1
    float64
        ...
Dask Name: getitem, 7 expressions
Expr=(((Filter(frame=FromMap(7729135), predicate=FromMap(7729135)['y'] > 0))[['name', 'x']]).std(ddof=1, numeric_only=False, split_out=None, observed=False))['x']
  • Note that df3 is still not shown
  • We can use the .compute() method to display the result
df3.compute()
name
Bob         0.58
Dan         0.58
Edith       0.58
George      0.58
Hannah      0.58
            ... 
Patricia    0.58
Tim         0.58
Ursula      0.58
Victor      0.58
Xavier      0.58
Name: x, Length: 26, dtype: float64

Dask dataframes

  • Aggregations are also supported
  • Here we aggregate the sum of the x column and the maximum of the y column by the name column
df4 = df.groupby("name").aggregate({"x": "sum", "y": "max"})
df4.compute()
x y
name
Ingrid -74.28 1.0
Sarah -547.34 1.0
Alice -255.63 1.0
Patricia 218.95 1.0
Kevin 43.66 1.0
... ... ...
Tim 371.42 1.0
Edith 396.76 1.0
Ursula -277.10 1.0
Oliver -227.67 1.0
Dan 83.08 1.0

26 rows × 2 columns

  • If you have the available RAM for your dataset then you can persist data in memory
  • We use the .persist() method to do this, and then the data will be available for future computations
df5 = df4.persist()
df5.head()
x y
name
Ingrid -74.28 1.0
Sarah -547.34 1.0
Alice -255.63 1.0
Patricia 218.95 1.0
Kevin 43.66 1.0

Time series example

  • Since the Dask dataframe we are using has a time series structure, we can use the resample method to aggregate the data by a time period
  • Let’s resample the data by 1 hour and calculate the mean of the x and y columns
df[["x", "y"]].resample("1h").mean().head(3)
x y
timestamp
2000-01-01 00:00:00 -5.44e-03 -6.49e-03
2000-01-01 01:00:00 3.84e-03 -8.69e-03
2000-01-01 02:00:00 1.28e-02 -9.08e-03
  • We can also use the rolling method to calculate a rolling mean of the data
df[["x", "y"]].rolling(window="24h").mean().head()
x y
timestamp
2000-01-01 00:00:00 0.32 -0.15
2000-01-01 00:00:01 -0.16 0.17
2000-01-01 00:00:02 0.12 0.05
2000-01-01 00:00:03 -0.04 -0.03
2000-01-01 00:00:04 -0.03 -0.19

Try it yourself! 🧠

  • Install dask if you haven’t done so yet
    • !pip install dask
  • Create a Dask array with 1 million random numbers and calculate the mean of the square root of the array
  • Compare the time it takes to run the same operation in NumPy
  • Appendix 02

Dask and SQL

  • dask-sql is a library that allows you to run SQL queries on Dask DataFrames
  • It is built on top of Apache Calcite, a SQL interpreter
  • It is still under development, but it already has many features
  • You can install it with pip install dask-sql
  • A dask_sql.Context is the Python equivalent to a SQL database,
  • It serves as an interface to register all tables and functions used in SQL queries, as well as execute the queries themselves
  • A single Context is created and used for the duration of a Python script or notebook
from dask_sql import Context
c = Context()

Dask and SQL

  • Once a Context has been created, there are many ways to register tables in it
  • The simplest way to do this is through the create_table method
  • Supported input types include Pandas DataFrames, Dask DataFrames, remote datasets (like on Amazon S3), and more
# Register and persist a dask table
from dask.datasets import timeseries
ddf = timeseries()
c.create_table("dask", ddf, persist=True)
  • Now we can run SQL queries on the timeseries table
c.sql("SELECT AVG(x) FROM dask").compute()
AVG(dask.x)
0 4.81e-04

Dask, SQL and Pandas

  • You can of course combine all three libraries!
# Perform a multi-column sort
res = c.sql("""
    SELECT * FROM dask ORDER BY name ASC, id DESC, x ASC
""")

# Now do some follow groupby aggregations
res.groupby("name").agg({"x": "sum", "y": "mean"}).compute()
x y
name
Alice -261.45 1.84e-03
Ingrid -117.48 5.67e-04
Kevin -198.47 -8.56e-04
Patricia 47.66 2.99e-03
Sarah 111.36 -3.23e-03
... ... ...
Ray 461.83 3.02e-04
Tim 346.05 -5.88e-04
Ursula -186.78 2.86e-03
Wendy -16.66 1.18e-03
Xavier -213.37 -7.37e-04

26 rows × 2 columns

Read and write data with Dask

Reading and writing data

  • .csv is very common in data science (and for good reasons)
  • Pandas reads and writes .csv files very well, but it is not the best option for large files
  • It may need several gigabytes of memory to read a large file, as it reads the entire file into memory
  • Dask provides a much more efficient way to read and write .csv files
  • Let’s split our dataset
df = dask.datasets.timeseries()
df
Dask DataFrame Structure:
name id x y
npartitions=30
2000-01-01 object int64 float64 float64
2000-01-02 ... ... ... ...
... ... ... ... ...
2000-01-30 ... ... ... ...
2000-01-31 ... ... ... ...
Dask Name: make-timeseries, 1 expression
import os
import datetime

if not os.path.exists('data'):
    os.mkdir('data')

def name(i):
    """ Provide date for filename given index

    Examples
    --------
    >>> name(0)
    '2000-01-01'
    >>> name(10)
    '2000-01-11'
    """
    return str(datetime.date(2000, 1, 1) + i * datetime.timedelta(days=1))

df.to_csv('data/*.csv', name_function=name);

Reading and writing data

  • We now have many CSV files in our data directory, one for each day in the month of January 2000
  • Each CSV file holds timeseries data for that day
  • We can read all of them as one logical dataframe using the dd.read_csv function
df = dd.read_csv('data/2000-*-*.csv')
df
Dask DataFrame Structure:
timestamp name id x y
npartitions=30
object object int64 float64 float64
... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
Dask Name: read_csv, 1 expression
  • Let’s do a simple computation
%timeit df.groupby('name').x.mean().compute()
1.34 s ± 451 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Reading and writing data

Parquet files

  • Although .csv files are nice, new formats like Parquet are gaining popularity
  • Data are stored by column rather than by row, allowing for efficient querying of specific columns without reading entire rows
  • It can be used with various programming languages and data processing frameworks
  • Achieves up to 75% reduction in file size compared to uncompressed formats
df.to_parquet('data/2000-01.parquet', engine='pyarrow')
  • Now we can read the parquet file
df = dd.read_parquet('data/2000-01.parquet', columns=['name', 'x'], engine='pyarrow')
df
Dask DataFrame Structure:
name x
npartitions=30
object float64
... ...
... ... ...
... ...
... ...
Dask Name: read_parquet, 1 expression
  • Let’s do the same computation as before
%timeit df.groupby('name').x.mean().compute()
163 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Dask delayed

Dask delayed

  • Sometimes you don’t want to use an entire Dask DataFrame or Dask Array
  • You may want to parallelise a single function, for instance, or a small part of your code
  • There is a way to do this with Dask: the dask.delayed function
def calculation(size=10000000):
    arr = np.random.rand(size)
    for _ in range(10):
        arr = np.sqrt(arr) + np.sin(arr)
    return np.mean(arr)

%timeit [calculation() for _ in range(4)]
1.81 s ± 9.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • We just need to add the @dask.delayed decorator to the function
@dask.delayed
def delayed_calculation(size=10000000):
    arr = np.random.rand(size)
    for _ in range(10):
        arr = np.sqrt(arr) + np.sin(arr)
    return np.mean(arr)

results = []
for _ in range(5):
    results.append(delayed_calculation())

# Compute all results at once
%timeit final_results = dask.compute(*results)
803 ms ± 64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • As you can see, the results are notably faster, and we didn’t have to change the code at all!
  • Just remember to add the .compute() method at the end

Dask delayed

  • We can even visualise the computation graph
@dask.delayed
def generate_data(size):
    return np.random.rand(size)

@dask.delayed
def transform_data(data):
    return np.sqrt(data) + np.sin(data)

@dask.delayed
def aggregate_data(data):
    return {
        'mean': np.mean(data),
        'std': np.std(data),
        'max': np.max(data)
    }

# Compare execution
sizes = [1000000, 2000000, 3000000]

# Dask execution
dask_results = []
for size in sizes:
    data = generate_data(size)
    transformed = transform_data(data)
    stats = aggregate_data(transformed)
    dask_results.append(stats)

%timeit dask.compute(*dask_results)
37.9 ms ± 349 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
dask.visualize(*dask_results)

Best practices

Best practices

  • Parallel computing can be a powerful tool, but it is not always the best solution
  • So here are some times from the creators of Dask themselves:
  • Start small: NumPy, pandas, Scikit-learn may have faster functions for what you’re trying to do. Sometimes just by changing the data format to Parquet you can get a huge speedup
  • Sampling: If you have a large dataset, try working with a sample of the data first. Do you really need all those TBs of data?
  • Load data with Dask: If you have a large dataset, load it with Dask from the start. It’s much easier to scale up than to scale down
  • Use .compute and .persist sparingly: These functions can be expensive, so use them only when you need to
  • Break up computations into many pieces: This will allow Dask to parallelise the computation faster
  • More tips can be found here

And that’s all for today! 🎉

See you next time! 😊

Appendix 01

  • Here is the solution to the exercise:
def square(x):
    return x**2

# Create a large array to process
numbers = np.arange(1000000)

# Sequential version
%timeit [square(x) for x in numbers]
60.3 ms ± 683 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
from joblib import Parallel, delayed

# Parallel version
%timeit Parallel(n_jobs=4)(delayed(square)(x) for x in numbers)
2.39 s ± 33.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Back to exercise

Appendix 02

  • Create a Dask array with 1 million random numbers and calculate the mean of the square root of the array
  • Compare the time it takes to run the same operation in NumPy
size = 1000000
np_arr = np.random.random(size)
%timeit np_result = np.sqrt(np_arr).mean()
451 μs ± 60.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
da_arr = da.random.random(size, chunks=100000)
%timeit da_result = da.sqrt(da_arr).mean().compute()
3.72 ms ± 24 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Back to exercise