Lecture 32 - Parallel Computing & Docker Practice
In this session, you will apply your knowledge of parallel computing and containerisation. The goal is to move from writing simple serial code to creating scalable, parallel computations with Dask, and finally, packaging your entire workflow into a portable, reproducible Docker container.
To ensure our work is reproducible, we must first define and create a consistent Python environment. This practice requires Python 3.10 and several specific package versions.
Write the single terminal command to create a new conda
environment named qtm350-parallel
that includes python=3.10
and the following packages:
dask-sql=2024.5.0
dask=2024.4.1
ipykernel=6.29.3
joblib=1.3.2
numpy=1.26.4
pandas=2.2.1
After creating it, remember to activate it and select it as the kernel for this notebook.
You can either paste the bash/zsh command below, or run it from Python using the subprocess library.
joblib
Imagine you have a list of 100 data records to process. Each record takes a small but non-trivial amount of time. Your task is to simulate this using joblib
to see the benefits of parallelisation.
process_record(record_id)
that simulates work by printing which record it’s processing, then sleeps for 0.1 seconds (using time.sleep(0.1)
) and returns record_id * 2
.record_ids
from 0 to 99.joblib.Parallel
to run this function on all record IDs using 4 cores (n_jobs=4
).for
loop. You can use %time
magic for this.Dask is perfect for linear algebra on very large matrices.
Task:
X
of random numbers with a shape of (20000, 5000).X
with X
(i.e., X.T @ X
).This task focuses on a fundamental groupby
aggregation on a larger-than-memory dataset. We will create a large sample employee dataset and calculate the average salary for each department.
import pandas as pd
import numpy as np
import dask.dataframe as dd
# Create a large sample pandas DataFrame
n_rows = 1_000_000
departments = ['Engineering', 'HR', 'Sales', 'Marketing']
data = {
'department': np.random.choice(departments, n_rows),
'salary': np.random.randint(50000, 150000, n_rows),
'years_of_service': np.random.randint(1, 20, n_rows)
}
pdf = pd.DataFrame(data)
# Convert to a Dask DataFrame
df = dd.from_pandas(pdf, npartitions=4)
# Your code here
This task uses the same employee dataset but introduces SQL for more expressive querying.
Task:
A “Hello, World!” is good, but a real script has dependencies. Let’s create a Dockerfile for a Python script that uses pandas
.
Your task: Write a Dockerfile
that:
python:3.10-slim
base image.pip
to install pandas
./data
.generate_report.py
into the container.CMD
.To test, create a local file generate_report.py
that imports pandas, creates a simple DataFrame, and prints it. Sample code is provided below.
Using an environment.yml
file is standard for reproducible environments. We will build a Docker image using this best practice.
Task: Write a Dockerfile
that:
continuumio/miniconda3
.environment.yml
file.conda
to create the environment from the file.bash
shell within the activated environment.