QTM 350 - Data Science Computing

Lecture 24 - Docker for Data Science

Danilo Freire

danilo.freire@emory.edu

Emory University

Hello, everyone! 😊

Brief recap of last class 📚

Reproducible workflows, virtual environments, and containers

Reproducibility is the ability to recreate the results of a computational analysis
There is currently a reproducibility crisis in science
Dependency management is the process of specifying and installing the software dependencies of a project
Virtual environments are isolated Python environments that allow you to install packages without affecting the system Python installation
- conda, pipenv, virtualenv
Containers are lightweight, standalone, executable packages of software that include everything needed to run an application
- Docker is the most widely used container platform

Today’s lecture 📋

Docker for Data Science

Last time, we saw how to create a simple container with Docker
The example included a short Dockerfile to build the container, which included a Python image, and a few Python packages in it
We then saw how to run the container and execute a Python script, and how to upload the container to Docker Hub
Today, we will see other Docker features that are useful for data science and build more complex containers
We will also discuss how to use Docker in a production environment, and how to deploy containers to the cloud

Docker for Data Science 🐳

A container for all tools we covered in this course

Today we will have a hands-on session where we will build a container with all the tools we have covered in this course (!) 🤓
That includes a bash shell, a git client, a Python interpreter, a Jupyter notebook server, a Quarto document processor, a SQL database, and Jupyter tools
We will build the container using a Dockerfile, and then run it locally
The container will be based on the official Ubuntu image (the same we used when creating an AWS instance), and we will install all the necessary packages in it
We will explore how to mount volumes in the container to persist data
We will also discuss how to manage dependencies effectively within the container

A container for all tools we covered in this course

As we saw in the last lecture, there are many Docker images for data science, such as Jupyter, RStudio, and SQLite
There are many ways to build (multiple) containers. For example:
- Docker Compose, which allows you to define multi-container applications in a single file
- VS Code’s Dev Containers extension, which integrates Docker with Visual Studio Code
- Kubernetes, which is used for automating deployment, scaling, and management of containerised applications
But to keep things simple, we will use a single container today and run it with the docker run command

Let’s get started! 🚀

Docker Desktop and Ubuntu

First, make sure you have Docker Desktop installed on your computer. It has everything you need to build, run, and share containers
If you are using Windows, you may need to enable WSL 2 to run Linux containers
We will use Ubuntu 24.04 as the base image for our container
As you know by now, Ubuntu is the most popular Linux distribution and is widely used in data science and machine learning
But you won’t even feel you are interacting with Ubuntu, as you will be using it inside a container

Downloading Docker Desktop

Please download Docker Desktop for your operating system here: https://www.docker.com/
There are versions for Windows, macOS, and Linux
You can install it like any other application, just follow the instructions of the installer
Docker now has a native Windows application, so it should work without any issues
More information here: https://docs.docker.com/desktop/setup/install/windows-install/

Downloading Docker Desktop

Creating an account on Docker Hub

If you don’t have a Docker Hub account yet, please create one here: https://hub.docker.com/signup
Docker has several types of accounts, but the free account is enough for academic purposes
You can use Docker Hub to store your images and share them with others
If you click on Sign in on the Docker Desktop application or on the website, it will open a web browser and ask you to log in
Please enter your credentials and log in

Anatomy of a Dockerfile revisited

`FROM image:tag`

As we saw in the last lecture, a Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image
The Dockerfile is a recipe for building a container
The base image is the starting point for the container, and it can be any image available on Docker Hub
So please create a text file called Dockerfile (without any extension) in your directory
Any editor will do to create the file! 😉
You will need about 2GB of free space to build the container, so make sure you have enough space on your computer
We start will start our Dockerfile with FROM, followed by the base image we want to use

# Use the official Ubuntu image as the base image
FROM ubuntu:24.04

The : character is used to specify the version of the image. In this case, we are using Ubuntu 24.04
You can see all tags available for the Ubuntu image here
Okay, that’s the easy part. Now let’s install all the tools we need in the container! 🛠️

`LABEL` command

The LABEL instruction adds metadata to an image
This is useful for documentation purposes and to provide information about the image
You can add any key-value pairs you want, but it is a good practice to include the version, description, maintainer, and license
You can also add a MAINTAINER instruction to specify the maintainer of the image
For instance, you can add the following lines to your Dockerfile

# Metadata
LABEL version="1.0"
LABEL description="Container with the tools covered in QTM 350"
LABEL maintainer="Danilo Freire <danilo.freire@emory.edu>"
LABEL license="MIT"

This will add the metadata to the image, and you can see it when you run the docker inspect command

`RUN` command

The RUN instruction executes any commands in a new layer on top of the current image and commits the results
Yes, commits in Docker are like commits in Git, but instead of saving changes to a file system, they save changes to an image
Remember apt? We will use it again to install software packages (as we did with AWS)
The only commands we will need to use are apt-get update (only once) and apt-get install <package>
- For example, to install git, we would run apt-get update && apt-get install git -y
To keep the image smaller, we will also run apt-get clean and rm -rf /var/lib/apt/lists/* after installing the packages
- This will remove the package cache and clean up the image
Let’s add some packages to the Dockerfile 🐳

`RUN` command

# Update and install dependencies
# Versions: https://packages.ubuntu.com/
RUN apt-get update && apt-get install -y --no-install-recommends \
    bash=5.2.21-2ubuntu4 \
    git=1:2.43.0-1ubuntu7.2 \
    sqlite3=3.45.1-1ubuntu2 \
    wget=1.21.4-1ubuntu4.1 \
    python3=3.12.3-1ubuntu0.5 \
    python3.12-venv=3.12.3-1ubuntu0.5 \
    python3-pip=24.0+dfsg-1ubuntu1.1 && \ 
    apt-get clean && rm -rf /var/lib/apt/lists/*

So far, so good! 😃
You can see that we are installing specific versions of the packages, which is a good practice to ensure reproducibility
On Ubuntu, you can find the available versions of a package at https://packages.ubuntu.com/
Make sure to match the version of the operating system with the version of the package (24.04 LTS or noble in this case)

Setting the default shell to `bash`

By default, Docker uses sh as the default shell, but we want to use bash instead
We can set the default shell to bash by adding the following line to the Dockerfile

# Set the default shell to bash
SHELL ["/bin/bash", "-c"]

The command specifies the shell to use when running commands in the container, and the -c option tells bash to run the command and then exit
This way, we can use bash commands in the Dockerfile without having to specify the shell every time

Hey Danilo, wait a minute…🤔

Isn’t that a lot of work?

Yes, it is! 😅
I am purposely making it harder than it should be!
But this is the most flexible way to build a container
We are starting from a really bare-bones image (Ubuntu) and installing everything from scratch
And I (we?) don’t have Ubuntu installed on my computer, so I can’t just run apt list --installed to see which packages are installed on my system and just copy them to the Dockerfile
In this case, it did involve some research to find the right package names and versions
But once you have the Dockerfile, you can reuse it as many times as you want

Reusing the Dockerfile

You could also have saved some time by using a Python image as the base image
The Python image already comes with Python and pip installed, so you would only need to install the other packages
Thus, my suggestion here is, if you are building a container for a specific purpose, start with an image that is closer to what you need
For instance, you can download a Jupyter image with a full data science stack with one line of code
But it wouldn’t be so fun, would it? 😅

https://hub.docker.com/_/python

Back to the Dockerfile

Installing additional packages

After installing the system packages, we have to install the additional packages we need
We will use the RUN instructions to install a few Python libraries with pip3, such as numpy, pandas, jupyterlab, dask, and matplotlib
You can either see the versions you already have on your computer with pip show <package> | grep Version or pip freeze > requirements.txt and then copy the versions from the file
Ubuntu requires us to create a virtual environment to install Python packages, so we will do that too
ENV PATH="/opt/venv/bin:$PATH" prepends the directory/opt/venv/binto the beginning of the existingPATH` environment variable within the Docker image
This allows us to use the Python interpreter and packages installed in the virtual environment without needing to specify the full path every time

# Create and activate virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install Python dependencies in virtual environment
RUN pip install numpy==1.26.4 pandas==2.2.2 \
                jupyterlab==4.2.5 ipykernel==6.29.5 \
                dask==2024.11.2 matplotlib==3.9.2

Installing Quarto with `wget`

We will also install Quarto, but this time we will use wget to download the binary
Why? Because Quarto is not available on pip or apt, so we need to download it from the official website: https://quarto.org/docs/get-started/
wget is a command-line utility that allows you to download files from the web
It runs on Unix-like systems, so it is available on Ubuntu, macOS, and even on Windows
Once we download the .deb file (which is the package format for Ubuntu), we can install it with apt-get install <package> (like we did with the other packages)
We can download any file from the web with wget, as long as we have the URL

# Download and install Quarto
# Install Quarto
RUN wget https://github.com/quarto-dev/quarto-cli/releases/download/v1.6.43/quarto-1.6.43-linux-amd64.deb && \
    apt-get install -y ./quarto-1.6.43-linux-amd64.deb && \
    rm ./quarto-1.6.43-linux-amd64.deb

We are almost there! 🏁

Running the Jupyter notebook server

Finally, we will run the Jupyter notebook server in the container
We will have access to the Jupyter notebook server on port 8888, so we will need to expose this port with the EXPOSE instruction
What is cool is that the new JupyterLab version comes with an embedded terminal, so we can run bash inside the JupyterLab interface and have access to all the tools we installed in the container (like git, sqlite3, and Quarto) 😉
Let’s also create a working directory for the Jupyter notebook server, so we can persist the notebooks outside the container

# Base image
FROM ubuntu:24.04

# Metadata
LABEL version="1.0"
LABEL description="Container with the tools covered in QTM 350"
LABEL maintainer="Danilo Freire <danilo.freire@emory.edu>"
LABEL license="MIT"

# Update and install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends\
    bash=5.2.21-2ubuntu4 \
    git=1:2.43.0-1ubuntu7.2 \
    sqlite3=3.45.1-1ubuntu2 \
    libsqlite3-0=3.45.1-1ubuntu2 \
    wget=1.21.4-1ubuntu4.1 \
    nano=7.2-2ubuntu0.1 \
    python3.12=3.12.3-1ubuntu0.5 \
    python3.12-venv=3.12.3-1ubuntu0.5 \
    python3-pip=24.0+dfsg-1ubuntu1.1 && \ 
    apt-get clean && rm -rf /var/lib/apt/lists/*

# Set default shell to Bash
SHELL ["/bin/bash", "-c"]

# Create and activate virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install Python dependencies in virtual environment
RUN pip install numpy==1.26.4 pandas==2.2.2 \
                jupyterlab==4.2.5 ipykernel==6.29.5 \
                dask==2024.11.2 matplotlib==3.9.2

# Install Quarto
RUN apt-get update && apt-get install -y --no-install-recommends wget ca-certificates && \
    # Download the specific Quarto deb file
    wget https://github.com/quarto-dev/quarto-cli/releases/download/v1.6.37/quarto-1.6.37-linux-arm64.deb && \
    # Install the local deb file (NOTICE the "./" prefix)
    apt-get install -y ./quarto-1.6.37-linux-arm64.deb && \
    # Clean up the downloaded file and apt cache
    rm quarto-1.6.37-linux-arm64.deb && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Create a directory for saving files
RUN mkdir -p /workspace
WORKDIR /workspace

# Expose port for JupyterLab 
EXPOSE 8888

# Start JupyterLab
CMD ["sh", "-c", ". /opt/venv/bin/activate && jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root"]

# Run the Docker container
# docker build -t qtm350-container .
# docker run -it --rm -p 8888:8888 -v $(pwd):/workspace qtm350-container

Building the container

Now that we have the Dockerfile, we can build the container with the docker build command

docker build -t qtm350-container .

The -t flag is used to tag the image with a name, in this case qtm350-container
The . at the end of the command specifies the build context, which is the current directory
Now we just need to wait for the image to be built and then run it with the docker run command
Let’s see how it goes! 🤞

Oh, no! 😱

We got an error when building the container! 😕

[...]

4.585 Some packages could not be installed. This may mean that you have
4.585 requested an impossible situation or if you are using the unstable
4.585 distribution that some required packages have not yet been created
4.585 or been moved out of Incoming.
4.585 The following information may help to resolve the situation:
4.585 
4.585 The following packages have unmet dependencies:
4.650  sqlite3 : Depends: libsqlite3-0 (= 3.45.1-1ubuntu2) but 3.45.1-1ubuntu2.1 is to be installed
4.651 E: Unable to correct problems, you have held broken packages.

The error message indicates that there are unmet dependencies for the sqlite3 package!

Fixing the error 🔧

But that’s fine, we know how to fix it! 😎
We can resolve the unmet dependencies by ensuring the correct versions of the packages are installed
So we go to https://packages.ubuntu.com/, search for libsqlite3-0 and check the available versions
We can see that the version 3.45.1-1ubuntu2 is available, so we can just add it to the apt-get install command
Let’s update the Dockerfile and try again

    libsqlite3-0=3.45.1-1ubuntu2 \

Now we can build the container again with the same command as before

docker build -t qtm350-container .

This time it should work! 🤞

Success! 🎉

Running the container

To run the container, we will use the docker run command
We will also use the -p flag to map the port 8888 of the container to the port 8888 of the host machine
We will also use the -v flag to mount a volume in the container, so we can persist the notebooks outside the container

docker run -p 8888:8888 -v $(pwd):/workspace qtm350-container

The -v flag is used to mount the current directory ($(pwd)) to the /workspace directory in the container
This way, we can save the notebooks outside the container and access them even after the container is stopped
Now we just need to open a web browser and go to http://127.0.0.1:8888 to access the JupyterLab interface 🚀
Jupyter will also generate a token for you to access the interface, so you will need to copy and paste the token from the terminal to the web browser (or click on the link)

Running the container

Accessing the JupyterLab interface

How to manage the container

To stop the container, you can press Ctrl+C in the terminal where the container is running
You can also run the docker ps command to see the list of running containers and then run the docker stop command with the container ID
You can also remove the container with the docker rm command and the image with the docker rmi command
Or you can just check Docker Desktop and stop the container from there, as well as remove or restart it
To tag and push the container to Docker Hub, you can use the docker tag and docker push commands

docker tag qtm350-container danilofreire/qtm350-container
docker push danilofreire/qtm350-container

Summary of the Dockerfile

We have seen how to build a container with all the tools we covered in this course
We started with the FROM instruction to specify the base image, then used the RUN instruction to install the system packages and the Python libraries
We also used the ENV instruction to set the PATH environment variable, the EXPOSE instruction to expose the port for the Jupyter notebook server, and the CMD instruction to start the Jupyter notebook server
We also added some metadata to the container with the LABEL instructions
We then built the container with the docker build command and ran it with the docker run command
We also saw how to manage the container, stop it, remove it, and push it to Docker Hub

And that’s it for today! 🎉

QTM 350 - Data Science Computing

Hello, everyone! 😊

Brief recap of last class 📚

Reproducible workflows, virtual environments, and containers

Today’s lecture 📋

Docker for Data Science

Docker for Data Science 🐳

A container for all tools we covered in this course

A container for all tools we covered in this course

Let’s get started! 🚀

Docker Desktop and Ubuntu

Downloading Docker Desktop

Downloading Docker Desktop

Creating an account on Docker Hub

Anatomy of a Dockerfile revisited

FROM image:tag

LABEL command

RUN command

RUN command

Setting the default shell to bash

Hey Danilo, wait a minute…🤔

Isn’t that a lot of work?

Reusing the Dockerfile

Back to the Dockerfile

Installing additional packages

Installing Quarto with wget

We are almost there! 🏁

Running the Jupyter notebook server

Building the container

Oh, no! 😱

Fixing the error 🔧

Success! 🎉

Running the container

Running the container

Accessing the JupyterLab interface

How to manage the container

Summary of the Dockerfile

And that’s it for today! 🎉

`FROM image:tag`

`LABEL` command

`RUN` command

`RUN` command

Setting the default shell to `bash`

Installing Quarto with `wget`