Lecture 22 - Parallelising Data Analysis with Dask and AutoML
18 October, 2024
joblib
library, which is a simple and effective way to parallelise your codedask
, which is currently the de facto standard for parallel computing in PythonA Dask Cluster is a collection of Dask workers that can be used to parallelise your computations
In plain English, a Dask Cluster is a group of computational engines (cores, GPUs, servers, etc) that work together to solve a problem
Workers provide two functions:
Workers are the reason why lazy evaluation speeds up computations
A simple example of workers interacting with a scheduler can help explain how lazy evaluation works:
Scheduler -> Eve: Compute a <- multiply(3, 5)!
Eve -> Scheduler: I've computed a and am holding on to it!
Scheduler -> Frank: Compute b <- add(a, 7)!
Frank: You will need a. Eve has a.
Frank -> Eve: Please send me a.
Eve -> Frank: Sure. a is 15!
Frank -> Scheduler: I've computed b and am holding on to it!
concurrent.futures
librarydask.distributed
requires that you set up a Clientdask.distributed
in your analysisClient
object provides a way to interact with the cluster, submit tasks, and monitor the progress of computationswith
statement to ensure that the cluster is closed when you are done with itimport dask.array as da
# Create a random array
x = da.random.RandomState(42).random((10000, 10000), chunks=(1000, 1000))
x
|
np.float64(99987830.48502485)
RandomState
object is used to set a seed numberload_digits
dataset is a well-known dataset in machine learning, containing 1797 8x8 pixel images of handwritten digitsparam_space
: A list of settings to try out for the modelC
: Controls how much to punish mistakesgamma
: Defines how far the influence of a single example reachestol
: Tells the model when to stop trying to improveclass_weight
: Options for handling imbalanced datajoblib.parallel_backend('dask')
to parallelise the searchRandomizedSearchCV
object will try out 50 different combinations of hyperparameters and return the best one# Load the digits dataset
digits = load_digits()
# Define the parameter space to search through
param_space = {
'C': np.logspace(-6, 6, 13),
'gamma': np.logspace(-8, 8, 17),
'tol': np.logspace(-4, -1, 4),
'class_weight': [None, 'balanced'],
}
# Create the model
model = SVC()
search = RandomizedSearchCV(
model,
param_space,
cv=3,
n_iter=50,
verbose=10
)
# Perform the search using Dask
start_time = time.time()
with joblib.parallel_backend('dask'):
search.fit(digits.data, digits.target)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
# Print the best parameters
print("Best parameters found: ", search.best_params_)
print("Best score: ", search.best_score_)
print("Best estimator: ", search.best_estimator_)
print("Time taken: {:.2f} seconds".format(elapsed_time))
dask_ml
Incremental
class, which can train models on chunks of datadask_ml.model_selection.IncrementalSearchCV()
partial_fit
method. More information hereX
and y
will take up about 16 GB of memorymake_classification
function from dask_ml.datasets
to create the datasetimport time
from dask_ml.datasets import make_classification
X, y = make_classification(n_samples=100000000, n_features=20,
chunks=100000, random_state=0)
# Create the model
from sklearn.linear_model import SGDClassifier
model = SGDClassifier(tol=1e-3, penalty='elasticnet', random_state=0)
# Parameters we want to search through
params = {'alpha': np.logspace(-2, 1, num=1000),
'l1_ratio': np.linspace(0, 1, num=1000),
'average': [True, False]}
# Perform the search
from dask_ml.model_selection import IncrementalSearchCV
search = IncrementalSearchCV(model, params, random_state=0)
start_time = time.time()
search.fit(X, y, classes=[0, 1])
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
# Print the best parameters, best score, and the time taken
print("Best parameters found: ", search.best_params_)
print("Best score: ", search.best_score_)
print("Best estimator: ", search.best_estimator_)
print(f"Time taken: {elapsed_time:.2f} seconds")
HyperbandSearchCV
, which is a hyperparameter search algorithm that is based on the Hyperband algorithmfrom dask_ml.model_selection import HyperbandSearchCV
from dask_ml.datasets import make_classification
from sklearn.linear_model import SGDClassifier
X, y = make_classification(chunks=20)
est = SGDClassifier(tol=1e-3)
param_dist = {'alpha': np.logspace(-4, 0, num=1000),
'loss': ['hinge', 'log_loss', 'modified_huber', 'squared_hinge'],
'average': [True, False]}
start_time = time.time()
search = HyperbandSearchCV(est, param_dist)
search.fit(X, y, classes=np.unique(y))
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print("Best parameters found: ", search.best_params_)
print("Best score: ", search.best_score_)
print("Best estimator: ", search.best_estimator_)
print(f"Time taken: {elapsed_time:.2f} seconds")
HyperbandSearchCV
class to search for the best hyperparametersload_digits
datasetimport time
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# Load the digits dataset
digits = load_digits()
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25)
# Create the TPOTClassifier object
start_time = time.time()
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, random_state=42)
# Fit the model
tpot.fit(X_train, y_train)
end_time = time.time()
elapsed_time = end_time - start_time
# Print the score
print(tpot.score(X_test, y_test))
print(f"Time taken: {elapsed_time:.2f} seconds")
use_dask=True
to the TPOTClassifier
object and you are good to go 😊import time
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# Load the digits dataset
digits = load_digits()
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25)
# Create the TPOTClassifier object
start_time = time.time()
tpot = TPOTClassifier(generations=5, population_size=20,
verbosity=2, random_state=42, use_dask=True)
# Fit the model
tpot.fit(X_train, y_train)
end_time = time.time()
elapsed_time = end_time - start_time
# Print the score
print(tpot.score(X_test, y_test))
print(f"Time taken: {elapsed_time:.2f} seconds")
Prophet
libraryProphet
is a forecasting tool that is open source and maintained by Facebookpystan
, which is a Python interface to Stan, a probabilistic programming languageProphet
library:
prophet.diagnostics.cross_validation
function method, which uses simulated historical forecasts to provide some idea of a model’s quality