QTM 350 - Data Science Computing

Lecture 22 - Parallelising Data Analysis with Dask and AutoML

Danilo Freire

Emory University

18 October, 2024

Hello again! 😊
How’s everything?

Brief recap of last class 📚

Parallel computing

  • Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously
  • Python has several libraries that allow you to parallelise your code
    • We discussed the joblib library, which is a simple and effective way to parallelise your code
    • But we spent most of our time discussing dask, which is currently the de facto standard for parallel computing in Python
  • We also saw that parallel computing is not a panacea: it can be hard to implement and may not always lead to performance improvements
  • But when it works, it works great! 🚀

Today’s agenda 📅

Lecture outline

  • Today we will continue our discussion on parallel computing
  • We will focus on parallelising data analysis tasks with Dask ML
  • More specifically, we will discuss how to parallelise the training of machine learning models, and how to use automated machine learning (AutoML) tools to speed up the process
  • We will also learn about Dask Clusters, which allow you to scale your computations across multiple machines (or just one!)

Dask Clients and Clusters 🌐

What is a Dask Cluster?

Workers and schedulers

  • A Dask Cluster is a collection of Dask workers that can be used to parallelise your computations

  • In plain English, a Dask Cluster is a group of computational engines (cores, GPUs, servers, etc) that work together to solve a problem

  • Workers provide two functions:

    • Compute tasks as directed by the scheduler, the scheduler being the brain of the cluster
    • Store and serve computed results to other workers or clients
  • Workers are the reason why lazy evaluation speeds up computations

  • A simple example of workers interacting with a scheduler can help explain how lazy evaluation works:

Scheduler -> Eve: Compute a <- multiply(3, 5)!
Eve -> Scheduler: I've computed a and am holding on to it! 
Scheduler -> Frank: Compute b <- add(a, 7)!
Frank: You will need a. Eve has a.
Frank -> Eve: Please send me a.
Eve -> Frank: Sure. a is 15!
Frank -> Scheduler: I've computed b and am holding on to it!
  • How do workers know what to do?
    • The scheduler assigns tasks to workers based on their availability
    • Workers can be on the same machine or on different machines
    • Workers can be CPUs or GPUs
  • Dask workers save their data as a Python dictionary, which is then sent to the scheduler as a thread in the concurrent.futures library
  • These processes can automatically restart and scale up without any intervention from the user
  • The optimal number of workers depends on the size of the data and the complexity of the computations, but often the default configuration is sufficient
  • In an adaptive cluster, you set the minimum and maximum number of workers and let the cluster add and remove workers as needed

Setting up a Dask Client

  • Using dask.distributed requires that you set up a Client
  • This should be the first thing you do if you intend to use dask.distributed in your analysis
  • It offers low latency, data locality, data sharing between the workers, and is easy to set up
from dask.distributed import LocalCluster, Client

# Specify a different port for the dashboard
cluster = LocalCluster(dashboard_address=':8789') 
client = Client(cluster)

# Print the dashboard link
print(f"Dask dashboard is available at: {cluster.dashboard_link}")
Dask dashboard is available at: http://127.0.0.1:8789/status
  • The Client object provides a way to interact with the cluster, submit tasks, and monitor the progress of computations
  • It is also a context manager, so you can use it in a with statement to ensure that the cluster is closed when you are done with it
  • You will see a screen like this in your browser:

Dask Client Dashboard

Dask Client Dashboard

Testing the Dask Client

  • You can test the Dask Client by running a simple computation
import dask.array as da

# Create a random array
x = da.random.RandomState(42).random((10000, 10000), chunks=(1000, 1000))
x
Array Chunk
Bytes 762.94 MiB 7.63 MiB
Shape (10000, 10000) (1000, 1000)
Dask graph 100 chunks in 1 graph layer
Data type float64 numpy.ndarray
10000 10000
# Perform a simple computation
y = (x + x.T).sum()

# Compute the result
y.compute()
np.float64(99987830.48502485)
  • The RandomState object is used to set a seed number
  • We can inspect the client dashboard to see how the computation was distributed across the workers

Testing the Dask Client

Testing the Dask Client

Dask ML 🤖

Dimensions of Scale

Addressing the Challenges

Challenge 1: Scaling Model Size

  • Model size: the number of parameters in a model
  • If your models become more complex, you need more computational resources to train them
  • Under this scaling challenge tasks like model training, prediction, or evaluation steps will (eventually) complete, they just take too long
  • You’ve become compute bound
  • You can continue to use your current algorithms and libraries (pandas, numpy, scikit-learn, etc), but you need to scale them up

Challenge 2: Scaling Data Size

  • Data size: the number of samples in your dataset
  • There are cases where datasets grow larger than RAM (shown along the horizontal axis)
  • Under this scaling challenge, even loading the data into numpy or pandas becomes impossible
  • You’ve become memory bound
  • In this case, you can use a different file format (parquet, Dask DataFrame, etc) together with algorithms that can handle large datasets

What is Dask ML?

  • Dask ML is a scalable machine learning library built on top of Dask
  • It provides parallel implementations of many popular machine learning algorithms and libraries, such as scikit-learn, XGBoost, LightGBM, TensorFlow, and PyTorch
  • Dask ML is built on top of Dask Arrays and DataFrames, allowing you to scale your machine learning workflows to large datasets
  • You can use Dask ML to do many tasks, such as model selection, model evaluation, and, most importantly, hyperparameter tuning
  • And you can also use it together with automated machine learning (AutoML) tools to speed up the process
  • The main advantage of using Dask ML + AutoML together is that you can quickly move from data preprocessing to model evaluation without having to worry about the details of the machine learning algorithms
  • In a sense, this is a tool to democratise machine learning
  • First, let’s install Dask ML:
!pip install dask-ml
  • And then import the necessary modules, mainly scikit-learn:
import numpy as np
from dask.distributed import LocalCluster, Client
import joblib
from sklearn.datasets import load_digits
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
  • The load_digits dataset is a well-known dataset in machine learning, containing 1797 8x8 pixel images of handwritten digits
  • The main task here is to predict the digit from the image

Dask ML and scikit-learn

  • Let’s estimate the model with a simple grid search
  • param_space: A list of settings to try out for the model
  • C: Controls how much to punish mistakes
  • gamma: Defines how far the influence of a single example reaches
  • tol: Tells the model when to stop trying to improve
  • class_weight: Options for handling imbalanced data
  • This is pretty standard scikit-learn code, but with a twist: we are using joblib.parallel_backend('dask') to parallelise the search
  • This will distribute the search across the workers in the Dask cluster (and we don’t have to worry about it!)
  • The RandomizedSearchCV object will try out 50 different combinations of hyperparameters and return the best one
# Load the digits dataset
digits = load_digits()   

# Define the parameter space to search through
param_space = {
    'C': np.logspace(-6, 6, 13),      
    'gamma': np.logspace(-8, 8, 17),
    'tol': np.logspace(-4, -1, 4),
    'class_weight': [None, 'balanced'], 
}

# Create the model
model = SVC()

search = RandomizedSearchCV(
    model,
    param_space,
    cv=3,
    n_iter=50,
    verbose=10
)

# Perform the search using Dask
start_time = time.time()
with joblib.parallel_backend('dask'):
    search.fit(digits.data, digits.target)
end_time = time.time()

# Calculate the elapsed time
elapsed_time = end_time - start_time

# Print the best parameters
print("Best parameters found: ", search.best_params_)
print("Best score: ", search.best_score_)
print("Best estimator: ", search.best_estimator_)
print("Time taken: {:.2f} seconds".format(elapsed_time))
Best parameters found:  {'tol': 0.0001, 'gamma': 0.0001, 'class_weight': None, 'C': 10.0}
Best score:  0.9565943238731217
Best estimator:  SVC(C=10.0, gamma=0.0001, tol=0.0001)
Time taken: 2.68 seconds

Now let’s tackle the problems we discussed earlier…

Neither compute nor memory constrained

  • The model was trained in 2.68 seconds, which is pretty fast
  • The dataset is small, so we are not memory constrained
  • The model is not complex, so we are not compute constrained
  • So in this case we only used dask to parallelise the search, but we could have used scikit-learn alone
  • But what if we had a larger dataset or a more complex model?

Memory constrained, but not compute constrained

  • Here we have a case where the dataset is too large to fit in memory, but there is enough compute power to train the model
  • It makes sense to use Parquet and Dask DataFrames to load the data in chunks, but this may not be enough
  • A cool solution is Dask’s dask_ml Incremental class, which can train models on chunks of data
  • It starts training the model on many hyper-parameters on a small amount of data, and then only continues training those models that seem to be performing well
  • The commands is dask_ml.model_selection.IncrementalSearchCV()
  • There are some new variations of this method that are worth checking out. Here is the documentation
  • This strategy is based on scikit-learn’s partial_fit method. More information here

Let’s continue to tackle the problems we discussed earlier

Compute constrained, but not memory constrained

  • Here we have a case where the model is too complex to train in a reasonable amount of time
  • Or the models require specialised hardware like GPU
  • The best class for this case is HyperbandSearchCV, which is a hyperparameter search algorithm that is based on the Hyperband algorithm
  • In a nutshell, this algorithm is easy to use, has strong mathematical motivation and performs remarkably well
from dask_ml.model_selection import HyperbandSearchCV
from dask_ml.datasets import make_classification
from sklearn.linear_model import SGDClassifier

X, y = make_classification(chunks=20)
est = SGDClassifier(tol=1e-3)
param_dist = {'alpha': np.logspace(-4, 0, num=1000),
              'loss': ['hinge', 'log_loss', 'modified_huber', 'squared_hinge'],
              'average': [True, False]}

start_time = time.time()
search = HyperbandSearchCV(est, param_dist)
search.fit(X, y, classes=np.unique(y))
end_time = time.time()

# Calculate the elapsed time
elapsed_time = end_time - start_time
print("Best parameters found: ", search.best_params_)
print("Best score: ", search.best_score_)
print("Best estimator: ", search.best_estimator_)
print(f"Time taken: {elapsed_time:.2f} seconds")
Best parameters found:  {'loss': 'modified_huber', 'average': False, 'alpha': 0.07427982482564918}
Best score:  0.85
Best estimator:  SGDClassifier(alpha=0.07427982482564918, loss='modified_huber')
Time taken: 1.62 seconds

Compute and memory constrained

  • This is the worst-case scenario, where you have a large dataset and a complex model
  • In this case, you can use a combination of the strategies we discussed earlier
    • Use Dask DataFrames to load the data in chunks
    • Use the HyperbandSearchCV class to search for the best hyperparameters
  • Apart from this, you can always use cloud computing services like AWS, GCP, or Azure…
  • … or pray to the machine learning gods 😂

Dask with AutoML 🤖

What is AutoML?

  • In the last part of this lecture, we will discuss automated machine learning (AutoML)
  • AutoML is the process of automating the process of applying machine learning to real-world problems
  • The main goal, which I personally support, is to move quickly from data preprocessing to model evaluation without having to worry about the details of the machine learning algorithms
  • In my experience, the faster we can get the model running, the more time we have to think about the problem and the data
  • AutoML tools can be used to automate the process of hyperparameter tuning, model selection, feature engineering, and model evaluation
  • There are several tools available for AutoML, such as Auto-sklearn, H2O.ai, and TPOT
  • Here we will use TPOT, which is a Python library that automatically creates and optimises machine learning pipelines using genetic programming
  • TPOT is built on top of scikit-learn and uses a similar syntax
  • There are several articles about how to optimise AutoML algorithms and quickly find the best model for your data, so I won’t go into too much detail here (but you can check out this article for more information)
  • First, let’s install TPOT:
!pip install tpot
  • I’m using Python 3.9.6 for this example, so your mileage may vary

Using TPOT

  • Let’s see an example using the load_digits dataset
import time
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# Load the digits dataset
digits = load_digits()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

# Create the TPOTClassifier object
start_time = time.time()
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, random_state=42)

# Fit the model
tpot.fit(X_train, y_train)
end_time = time.time()
elapsed_time = end_time - start_time

# Print the score
print(tpot.score(X_test, y_test))
print(f"Time taken: {elapsed_time:.2f} seconds")
Best pipeline: KNeighborsClassifier(input_matrix, n_neighbors=3, p=2, weights=distance)
0.9822222222222222
Time taken: 57.38 seconds

Using TPOT with Dask

  • Now let’s see an example using Dask
  • It is really easy to use Dask with TPOT
  • I mean it! Just add use_dask=True to the TPOTClassifier object and you are good to go 😊
import time
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# Load the digits dataset
digits = load_digits()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

# Create the TPOTClassifier object
start_time = time.time()
tpot = TPOTClassifier(generations=5, population_size=20, 
                      verbosity=2, random_state=42, use_dask=True)

# Fit the model
tpot.fit(X_train, y_train)
end_time = time.time()
elapsed_time = end_time - start_time

# Print the score
print(tpot.score(X_test, y_test))
print(f"Time taken: {elapsed_time:.2f} seconds")
Best pipeline: KNeighborsClassifier(input_matrix, n_neighbors=3, p=2, weights=distance)
0.9822222222222222
Time taken: 40.40 seconds

Time series forecasting with Prophet

  • Let’s see another example using the Prophet library
  • Prophet is a forecasting tool that is open source and maintained by Facebook
  • It is designed for analyzing time series data that display patterns on different time scales
  • It is particularly good for data that has multiple seasonality with changing trends
  • The library is built on top of pystan, which is a Python interface to Stan, a probabilistic programming language
  • First, let’s install the library:
!pip install prophet
  • Large datasets are not the only type of scaling challenge teams run into
  • In this example we will focus on the third type of scaling challenge: model complexity
  • In the words of Sean Taylor and Ben Letham, the authors of the Prophet library:
    • In most realistic settings, a large number of forecasts will be created, necessitating efficient, automated means of evaluating and comparing them, as well as detecting when they are likely to be performing poorly. When hundreds or even thousands of forecasts are made, it becomes important to let machines do the hard work of model evaluation and comparison while efficiently using human feedback to fix performance problems.

Using Prophet

  • We will use Prophet and Dask together to parallise the diagnostics stage of research
  • It does not attempt to parallise the training of the model itself (which is actually quite fast to begin with)
import pandas as pd
from prophet import Prophet
df = pd.read_csv(
    'https://raw.githubusercontent.com/facebook/prophet/master/examples/example_wp_log_peyton_manning.csv',
    parse_dates=['ds']
)
df.head()
ds y
0 2007-12-10 9.590761
1 2007-12-11 8.519590
2 2007-12-12 8.183677
3 2007-12-13 8.072467
4 2007-12-14 7.893572

Using Prophet

  • Let’s plot the data and fit the model
  • No Dask here, just Prophet
df.plot(x='ds', y='y')

m = Prophet(daily_seasonality=False)
m.fit(df)

  • And we can make a forecast. Again, no Dask here
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
m.plot(forecast)

Using Prophet with Dask

  • Now let’s use Dask to parallelise the diagnostics stage of the research
  • Prophet includes a prophet.diagnostics.cross_validation function method, which uses simulated historical forecasts to provide some idea of a model’s quality
  • This is done by selecting cutoff points in the history, and for each of them fitting the model using data only up to that cutoff point
  • We can then compare the forecasted values to the actual values

Using Prophet with Dask

  • Let’s then use Dask to parallelise the diagnostics stage of the research
import prophet.diagnostics
df_cv = prophet.diagnostics.cross_validation(
    m, initial="730 days", period="180 days", horizon="365 days",
    parallel="dask"
)
df_cv.head()

Conclusion 🎉

Summary

  • Today we discussed how to parallelise data analysis tasks with Dask and AutoML
  • We learned about Dask Clusters and how to set up a Dask Client
  • We discussed the types of problems data scientists face when scaling their models and datasets
    • No constraints
    • Compute constrained
    • Memory constrained
    • Compute and memory constrained
  • We also saw how AutoML tools like TPOT can be used to speed up the process of model selection and hyperparameter tuning
  • And we discussed how to use Prophet and Dask together to parallelise the diagnostics stage of research
  • I hope you enjoyed the lecture and learned something new today! 😊

Next class

  • Next class we will discuss how to deal with environments and, mainly, how to use containers to manage your projects
  • We will discuss the basics of Docker and how to use it to create and manage containers
  • We will also discuss how to use Docker to create reproducible environments for your projects
  • I hope to see you there! 🚀

Thank you! 🙏

See you next time! 🚀