QTM 350 - Data Science Computing

Lecture 24 - Docker for Data Science

Danilo Freire

Emory University

Hello, everyone! 😊

Brief recap of last class 📚

Reproducible workflows, virtual environments, and containers

  • Reproducibility is the ability to recreate the results of a computational analysis
  • There is currently a reproducibility crisis in science
  • Dependency management is the process of specifying and installing the software dependencies of a project
  • Virtual environments are isolated Python environments that allow you to install packages without affecting the system Python installation
    • conda, pipenv, virtualenv
  • Containers are lightweight, standalone, executable packages of software that include everything needed to run an application
    • Docker is the most widely used container platform

Today’s lecture 📋

Docker for Data Science

  • Last time, we saw how to create a simple container with Docker
  • The example included a short Dockerfile to build the container, which included a Python image, and a few Python packages in it
  • We then saw how to run the container and execute a Python script, and how to upload the container to Docker Hub
  • Today, we will see other Docker features that are useful for data science and build more complex containers
  • We will also discuss how to use Docker in a production environment, and how to deploy containers to the cloud

Docker for Data Science 🐳

A container for all tools we covered in this course

  • Today we will have a hands-on session where we will build a container with all the tools we have covered in this course (!) 🤓
  • That includes a bash shell, a git client, a Python interpreter, a Jupyter notebook server, a Quarto document processor, a SQL database, and Jupyter tools
  • We will build the container using a Dockerfile, and then run it locally
  • The container will be based on the official Ubuntu image (the same we used when creating an AWS instance), and we will install all the necessary packages in it
  • We will explore how to mount volumes in the container to persist data
  • We will also discuss how to manage dependencies effectively within the container

A container for all tools we covered in this course

  • As we saw in the last lecture, there are many Docker images for data science, such as Jupyter, RStudio, and SQLite
  • There are many ways to build (multiple) containers. For example:
  • But to keep things simple, we will use a single container today and run it with the docker run command

Let’s get started! 🚀

Docker Desktop and Ubuntu

  • First, make sure you have Docker Desktop installed on your computer. It has everything you need to build, run, and share containers
  • If you are using Windows, you may need to enable WSL 2 to run Linux containers
  • We will use Ubuntu 24.04 as the base image for our container
  • As you know by now, Ubuntu is the most popular Linux distribution and is widely used in data science and machine learning
  • But you won’t even feel you are interacting with Ubuntu, as you will be using it inside a container

Downloading Docker Desktop

Downloading Docker Desktop

Creating an account on Docker Hub

  • If you don’t have a Docker Hub account yet, please create one here: https://hub.docker.com/signup
  • Docker has several types of accounts, but the free account is enough for academic purposes
  • You can use Docker Hub to store your images and share them with others
  • If you click on Sign in on the Docker Desktop application or on the website, it will open a web browser and ask you to log in
  • Please enter your credentials and log in

Anatomy of a Dockerfile revisited

FROM image:tag

  • As we saw in the last lecture, a Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image
  • The Dockerfile is a recipe for building a container
  • The base image is the starting point for the container, and it can be any image available on Docker Hub
  • So please create a text file called Dockerfile (without any extension) in your directory
  • Any editor will do to create the file! 😉
  • You will need about 2GB of free space to build the container, so make sure you have enough space on your computer
  • We start will start our Dockerfile with FROM, followed by the base image we want to use
# Use the official Ubuntu image as the base image
FROM ubuntu:24.04
  • The : character is used to specify the version of the image. In this case, we are using Ubuntu 24.04
  • You can see all tags available for the Ubuntu image here
  • Okay, that’s the easy part. Now let’s install all the tools we need in the container! 🛠️

LABEL command

  • The LABEL instruction adds metadata to an image
  • This is useful for documentation purposes and to provide information about the image
  • You can add any key-value pairs you want, but it is a good practice to include the version, description, maintainer, and license
  • You can also add a MAINTAINER instruction to specify the maintainer of the image
  • For instance, you can add the following lines to your Dockerfile
# Metadata
LABEL version="1.0"
LABEL description="Container with the tools covered in QTM 350"
LABEL maintainer="Danilo Freire <danilo.freire@emory.edu>"
LABEL license="MIT"
  • This will add the metadata to the image, and you can see it when you run the docker inspect command

RUN command

  • The RUN instruction executes any commands in a new layer on top of the current image and commits the results
  • Yes, commits in Docker are like commits in Git, but instead of saving changes to a file system, they save changes to an image
  • Remember apt? We will use it again to install software packages (as we did with AWS)
  • The only commands we will need to use are apt-get update (only once) and apt-get install <package>
    • For example, to install git, we would run apt-get update && apt-get install git -y
  • To keep the image smaller, we will also run apt-get clean and rm -rf /var/lib/apt/lists/* after installing the packages
    • This will remove the package cache and clean up the image
  • Let’s add some packages to the Dockerfile 🐳

RUN command

# Update and install dependencies
# Versions: https://packages.ubuntu.com/
RUN apt-get update && apt-get install -y --no-install-recommends \
    bash=5.2.21-2ubuntu4 \
    git=1:2.43.0-1ubuntu7.2 \
    sqlite3=3.45.1-1ubuntu2 \
    wget=1.21.4-1ubuntu4.1 \
    python3=3.12.3-1ubuntu0.5 \
    python3.12-venv=3.12.3-1ubuntu0.5 \
    python3-pip=24.0+dfsg-1ubuntu1.1 && \ 
    apt-get clean && rm -rf /var/lib/apt/lists/*
  • So far, so good! 😃
  • You can see that we are installing specific versions of the packages, which is a good practice to ensure reproducibility
  • On Ubuntu, you can find the available versions of a package at https://packages.ubuntu.com/
  • Make sure to match the version of the operating system with the version of the package (24.04 LTS or noble in this case)

Setting the default shell to bash

  • By default, Docker uses sh as the default shell, but we want to use bash instead
  • We can set the default shell to bash by adding the following line to the Dockerfile
# Set the default shell to bash
SHELL ["/bin/bash", "-c"]
  • The command specifies the shell to use when running commands in the container, and the -c option tells bash to run the command and then exit
  • This way, we can use bash commands in the Dockerfile without having to specify the shell every time

Hey Danilo, wait a minute…🤔

Isn’t that a lot of work?

  • Yes, it is! 😅
  • I am purposely making it harder than it should be!
  • But this is the most flexible way to build a container
  • We are starting from a really bare-bones image (Ubuntu) and installing everything from scratch
  • And I (we?) don’t have Ubuntu installed on my computer, so I can’t just run apt list --installed to see which packages are installed on my system and just copy them to the Dockerfile
  • In this case, it did involve some research to find the right package names and versions
  • But once you have the Dockerfile, you can reuse it as many times as you want

Reusing the Dockerfile

  • You could also have saved some time by using a Python image as the base image
  • The Python image already comes with Python and pip installed, so you would only need to install the other packages
  • Thus, my suggestion here is, if you are building a container for a specific purpose, start with an image that is closer to what you need
  • For instance, you can download a Jupyter image with a full data science stack with one line of code
  • But it wouldn’t be so fun, would it? 😅

Back to the Dockerfile

Installing additional packages

  • After installing the system packages, we have to install the additional packages we need
  • We will use the RUN instructions to install a few Python libraries with pip3, such as numpy, pandas, jupyterlab, dask, and matplotlib
  • You can either see the versions you already have on your computer with pip show <package> | grep Version or pip freeze > requirements.txt and then copy the versions from the file
  • Ubuntu requires us to create a virtual environment to install Python packages, so we will do that too
  • ENV PATH="/opt/venv/bin:$PATH" prepends the directory/opt/venv/binto the beginning of the existingPATH` environment variable within the Docker image
  • This allows us to use the Python interpreter and packages installed in the virtual environment without needing to specify the full path every time
# Create and activate virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install Python dependencies in virtual environment
RUN pip install numpy==1.26.4 pandas==2.2.2 \
                jupyterlab==4.2.5 ipykernel==6.29.5 \
                dask==2024.11.2 matplotlib==3.9.2

Installing Quarto with wget

  • We will also install Quarto, but this time we will use wget to download the binary
  • Why? Because Quarto is not available on pip or apt, so we need to download it from the official website: https://quarto.org/docs/get-started/
  • wget is a command-line utility that allows you to download files from the web
  • It runs on Unix-like systems, so it is available on Ubuntu, macOS, and even on Windows
  • Once we download the .deb file (which is the package format for Ubuntu), we can install it with apt-get install <package> (like we did with the other packages)
  • We can download any file from the web with wget, as long as we have the URL
# Download and install Quarto
# Install Quarto
RUN wget https://github.com/quarto-dev/quarto-cli/releases/download/v1.6.43/quarto-1.6.43-linux-amd64.deb && \
    apt-get install -y ./quarto-1.6.43-linux-amd64.deb && \
    rm ./quarto-1.6.43-linux-amd64.deb

We are almost there! 🏁

Running the Jupyter notebook server

  • Finally, we will run the Jupyter notebook server in the container
  • We will have access to the Jupyter notebook server on port 8888, so we will need to expose this port with the EXPOSE instruction
  • What is cool is that the new JupyterLab version comes with an embedded terminal, so we can run bash inside the JupyterLab interface and have access to all the tools we installed in the container (like git, sqlite3, and Quarto) 😉
  • Let’s also create a working directory for the Jupyter notebook server, so we can persist the notebooks outside the container
# Base image
FROM ubuntu:24.04

# Metadata
LABEL version="1.0"
LABEL description="Container with the tools covered in QTM 350"
LABEL maintainer="Danilo Freire <danilo.freire@emory.edu>"
LABEL license="MIT"

# Update and install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends\
    bash=5.2.21-2ubuntu4 \
    git=1:2.43.0-1ubuntu7.2 \
    sqlite3=3.45.1-1ubuntu2 \
    libsqlite3-0=3.45.1-1ubuntu2 \
    wget=1.21.4-1ubuntu4.1 \
    nano=7.2-2ubuntu0.1 \
    python3.12=3.12.3-1ubuntu0.5 \
    python3.12-venv=3.12.3-1ubuntu0.5 \
    python3-pip=24.0+dfsg-1ubuntu1.1 && \ 
    apt-get clean && rm -rf /var/lib/apt/lists/*

# Set default shell to Bash
SHELL ["/bin/bash", "-c"]

# Create and activate virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install Python dependencies in virtual environment
RUN pip install numpy==1.26.4 pandas==2.2.2 \
                jupyterlab==4.2.5 ipykernel==6.29.5 \
                dask==2024.11.2 matplotlib==3.9.2

# Install Quarto
RUN apt-get update && apt-get install -y --no-install-recommends wget ca-certificates && \
    # Download the specific Quarto deb file
    wget https://github.com/quarto-dev/quarto-cli/releases/download/v1.6.37/quarto-1.6.37-linux-arm64.deb && \
    # Install the local deb file (NOTICE the "./" prefix)
    apt-get install -y ./quarto-1.6.37-linux-arm64.deb && \
    # Clean up the downloaded file and apt cache
    rm quarto-1.6.37-linux-arm64.deb && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Create a directory for saving files
RUN mkdir -p /workspace
WORKDIR /workspace

# Expose port for JupyterLab 
EXPOSE 8888

# Start JupyterLab
CMD ["sh", "-c", ". /opt/venv/bin/activate && jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root"]

# Run the Docker container
# docker build -t qtm350-container .
# docker run -it --rm -p 8888:8888 -v $(pwd):/workspace qtm350-container

Building the container

  • Now that we have the Dockerfile, we can build the container with the docker build command
docker build -t qtm350-container .
  • The -t flag is used to tag the image with a name, in this case qtm350-container
  • The . at the end of the command specifies the build context, which is the current directory
  • Now we just need to wait for the image to be built and then run it with the docker run command
  • Let’s see how it goes! 🤞

Oh, no! 😱

  • We got an error when building the container! 😕
[...]

4.585 Some packages could not be installed. This may mean that you have
4.585 requested an impossible situation or if you are using the unstable
4.585 distribution that some required packages have not yet been created
4.585 or been moved out of Incoming.
4.585 The following information may help to resolve the situation:
4.585 
4.585 The following packages have unmet dependencies:
4.650  sqlite3 : Depends: libsqlite3-0 (= 3.45.1-1ubuntu2) but 3.45.1-1ubuntu2.1 is to be installed
4.651 E: Unable to correct problems, you have held broken packages.
  • The error message indicates that there are unmet dependencies for the sqlite3 package!

Fixing the error 🔧

  • But that’s fine, we know how to fix it! 😎
  • We can resolve the unmet dependencies by ensuring the correct versions of the packages are installed
  • So we go to https://packages.ubuntu.com/, search for libsqlite3-0 and check the available versions
  • We can see that the version 3.45.1-1ubuntu2 is available, so we can just add it to the apt-get install command
  • Let’s update the Dockerfile and try again
    libsqlite3-0=3.45.1-1ubuntu2 \
  • Now we can build the container again with the same command as before
docker build -t qtm350-container .
  • This time it should work! 🤞

Success! 🎉

Running the container

  • To run the container, we will use the docker run command
  • We will also use the -p flag to map the port 8888 of the container to the port 8888 of the host machine
  • We will also use the -v flag to mount a volume in the container, so we can persist the notebooks outside the container
docker run -p 8888:8888 -v $(pwd):/workspace qtm350-container
  • The -v flag is used to mount the current directory ($(pwd)) to the /workspace directory in the container
  • This way, we can save the notebooks outside the container and access them even after the container is stopped
  • Now we just need to open a web browser and go to http://127.0.0.1:8888 to access the JupyterLab interface 🚀
  • Jupyter will also generate a token for you to access the interface, so you will need to copy and paste the token from the terminal to the web browser (or click on the link)

Running the container

Accessing the JupyterLab interface

How to manage the container

  • To stop the container, you can press Ctrl+C in the terminal where the container is running
  • You can also run the docker ps command to see the list of running containers and then run the docker stop command with the container ID
  • You can also remove the container with the docker rm command and the image with the docker rmi command
  • Or you can just check Docker Desktop and stop the container from there, as well as remove or restart it
  • To tag and push the container to Docker Hub, you can use the docker tag and docker push commands
docker tag qtm350-container danilofreire/qtm350-container
docker push danilofreire/qtm350-container

Summary of the Dockerfile

  • We have seen how to build a container with all the tools we covered in this course
  • We started with the FROM instruction to specify the base image, then used the RUN instruction to install the system packages and the Python libraries
  • We also used the ENV instruction to set the PATH environment variable, the EXPOSE instruction to expose the port for the Jupyter notebook server, and the CMD instruction to start the Jupyter notebook server
  • We also added some metadata to the container with the LABEL instructions
  • We then built the container with the docker build command and ran it with the docker run command
  • We also saw how to manage the container, stop it, remove it, and push it to Docker Hub

And that’s it for today! 🎉