In addition to the default features (array, dataframe, bag etc.) offered by Dask, there are additional domain-specific modules. We are looking at two of them here: machine learning and image processing.
Just like dask-array is a port of Numpy to Dask, dask_ml is a port from sckit-learn to Dask. Scikit-learn is currently probably the most popular machine-learning package in Python. Dask offers a subset of function available in scikit-learn using the same syntax. Let's see an example. The calculation is only for the purpose of illustration and is not realistic.
We look again at the taxi dataset:
from dask.distributed import Client
client = Client("tcp://127.0.0.1:49550")
client
import dask.dataframe as dd
taxi = dd.read_csv('../Data/Chicago_taxi/chicago_taxi_trips_2016_01.csv', dtype={'taxi_id': 'float64'})
taxi
We are working here on a trivial question and checking the relation between the fare and the trip time (in seconds). We only keep those two variables:
taxi = taxi[['trip_seconds', 'fare']]
taxi = taxi.dropna()
Then we can easily split out dataset into train and test sets:
train, test = taxi.random_split([0.8, 0.2])
We pick a linear regression model from dask_ml:
from dask_ml.linear_model import LinearRegression
We create a linear fit object as we would do it in scikit-learn:
linfit = LinearRegression(fit_intercept=False)
And we call the fit()
method like for any other scikit-ml method:
linfit_model = linfit.fit(train[['trip_seconds']].values,train[['fare']].values)
We can then predict values for our test set:
pred = linfit_model.predict(test[['trip_seconds']].values)
And we see that the model takes in dask object and also generates a dask object. Hence we can do prediction for large datasets! Finally we can look at our fit:
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots()
plt.plot(test.trip_seconds.compute().values, test.fare.compute().values,'o',alpha = 0.1)
plt.plot(np.arange(0,20000), linfit_model.coef_[0]*np.arange(0,20000))
ax.set_ylim([0,200])
We have seen before that we could wrap an image importer and other image processing functions into delay()
calls. However Dask offers a built-in set of functions to deal with images. We are going to illistrate this through an example.
We load a series of images:
from dask_image import imread, ndfilters
import dask
We have a series of images in a folder. We want to analyze all of them and create a large dask array:
images = imread.imread('/Users/gw18g940/OneDrive - Universitaet Bern/Courses/DaskCourse/Butterflies/CAM01798*.JPG')
images
Then we only keep a single channel and downscale the image by slicing the array:
im_downscale = images[:,::4,::4,0]
im_downscale
Then we filter each image using a guassian filter implemented in dask-image:
im_filtered = ndfilters.gaussian_filter(im_downscale, sigma=(0,2,2))
We recover both the original and filtere image for comparison. Note that this is not something that one would typically do as it loads all data into RAM:
result = dask.compute(images, im_filtered)
import matplotlib.pyplot as plt
fig, ax = plt.subplots(2,1, figsize=(10,10))
ax[0].imshow(result[0][0,:,:],cmap = 'gray')
ax[1].imshow(result[1][0,:,:],cmap = 'gray')