Lecture 22 - Parallelising Data Analysis with Dask and AutoML
joblib library, which is a simple and effective way to parallelise your codeA Dask Cluster is a collection of Dask workers that can be used to parallelise your computations
In plain English, a Dask Cluster is a group of computational engines (cores, GPUs, servers, etc.) that work together to solve a problem
Workers provide two functions:
Workers are the reason why lazy evaluation speeds up computations
A simple example of workers interacting with a scheduler can help explain how lazy evaluation works:
Scheduler -> Eve: Compute a <- multiply(3, 5)!
Eve -> Scheduler: I've computed a and am holding on to it!
Scheduler -> Frank: Compute b <- add(a, 7)!
Frank: You will need a. Eve has a.
Frank -> Eve: Please send me a.
Eve -> Frank: Sure. a is 15!
Frank -> Scheduler: I've computed b and am holding on to it!concurrent.futures librarydistributed scheduler is the default for Dask and it works great 👍dask.distributed requires that you set up a Clientdask.distributed in your analysisClient object provides a way to interact with the cluster, submit tasks, and monitor the progress of computationswith statement to ensure that the cluster is closed when you are done with itRandomState object is used to set a seed numberload_digits dataset is a well-known dataset in machine learning, containing 1797 8x8 pixel images of handwritten digitsparam_space: a list of settings to try out for the modelC: controls how much to punish mistakes (regularisation, smaller values = more regularisation) to prevent overfitting
np.logspace(-6, 6, 13) will create a list of 13 values between \(10^{-6}\) and \(10^6\) (!)gamma: defines how far the influence of a single example reaches (smaller values = model is less sensitive to the data)
np.logspace(-8, 8, 17) will create a list of 17 values between \(10^{-8}\) and \(10^8\) (!!)tol: tells the model when to stop trying to improveclass_weight: options for handling imbalanced datajoblib.parallel_backend('dask') to parallelise the searchRandomizedSearchCV object will try out 50 different combinations of hyperparameters and return the best one# Load the digits dataset
digits = load_digits()
# Define the parameter space to search through
param_space = {
'C': np.logspace(-6, 6, 13),
'gamma': np.logspace(-8, 8, 17),
'tol': np.logspace(-4, -1, 4),
'class_weight': [None, 'balanced'],
}
# Create the model
model = SVC()
search = RandomizedSearchCV(
model,
param_space,
cv=3,
n_iter=50,
verbose=10
)
# Perform the search using Dask
start_time = time.time()
with joblib.parallel_backend('dask'):
search.fit(digits.data, digits.target)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
# Print the best parameters
print("Best parameters found: ", search.best_params_)
print("Best score: ", search.best_score_)
print("Best estimator: ", search.best_estimator_)
print("Time taken: {:.2f} seconds".format(elapsed_time))dask_ml Incremental class, which can train models on chunks of datadask_ml.model_selection.IncrementalSearchCV()partial_fit method. More information hereX and y will take up about 16 GB of memorymake_classification function from dask_ml.datasets to create the datasetimport time
from dask_ml.datasets import make_classification
X, y = make_classification(n_samples=100000000, n_features=20,
chunks=100000, random_state=0)
# Create the model
from sklearn.linear_model import SGDClassifier
model = SGDClassifier(tol=1e-3, penalty='elasticnet', random_state=0)
# Parameters we want to search through
params = {'alpha': np.logspace(-2, 1, num=1000),
'l1_ratio': np.linspace(0, 1, num=1000),
'average': [True, False]}
# Perform the search
from dask_ml.model_selection import IncrementalSearchCV
search = IncrementalSearchCV(model, params, random_state=0)
start_time = time.time()
search.fit(X, y, classes=[0, 1])
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
# Print the best parameters, best score, and the time taken
print("Best parameters found: ", search.best_params_)
print("Best score: ", search.best_score_)
print("Best estimator: ", search.best_estimator_)
print(f"Time taken: {elapsed_time:.2f} seconds")HyperbandSearchCV, which is a hyperparameter search algorithm that is based on the Hyperband algorithmfrom dask_ml.model_selection import HyperbandSearchCV
from dask_ml.datasets import make_classification
from sklearn.linear_model import SGDClassifier
X, y = make_classification(chunks=20)
est = SGDClassifier(tol=1e-3)
param_dist = {'alpha': np.logspace(-4, 0, num=1000),
'loss': ['hinge', 'log_loss', 'modified_huber', 'squared_hinge'],
'average': [True, False]}
start_time = time.time()
search = HyperbandSearchCV(est, param_dist)
search.fit(X, y, classes=np.unique(y))
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print("Best parameters found: ", search.best_params_)
print("Best score: ", search.best_score_)
print("Best estimator: ", search.best_estimator_)
print(f"Time taken: {elapsed_time:.2f} seconds")HyperbandSearchCV class to search for the best hyperparametersload_digits datasetimport time
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# Load the digits dataset
digits = load_digits()
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25)
# Create the TPOTClassifier object
start_time = time.time()
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, random_state=42)
# Fit the model
tpot.fit(X_train, y_train)
end_time = time.time()
elapsed_time = end_time - start_time
# Print the score
print(tpot.score(X_test, y_test))
print(f"Time taken: {elapsed_time:.2f} seconds")use_dask=True to the TPOTClassifier object and you are good to go 😊import time
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# Load the digits dataset
digits = load_digits()
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25)
# Create the TPOTClassifier object
start_time = time.time()
tpot = TPOTClassifier(generations=5, population_size=20,
verbosity=2, random_state=42, use_dask=True)
# Fit the model
tpot.fit(X_train, y_train)
end_time = time.time()
elapsed_time = end_time - start_time
# Print the score
print(tpot.score(X_test, y_test))
print(f"Time taken: {elapsed_time:.2f} seconds")Prophet libraryProphet is a forecasting tool that is open source and maintained by FacebookPyStan, which is a Python interface to Stan, a probabilistic programming languageProphet library:
prophet.diagnostics.cross_validation function method, which uses simulated historical forecasts to provide some idea of a model’s quality