QTM 350 - Data Science Computing

Lecture 24 - Docker for Data Science

Danilo Freire

Emory University

25 November, 2024

Hello, everyone! 😊

Brief recap of last class 📚

Reproducible workflows, virtual environments, and containers

  • Reproducibility is the ability to recreate the results of a computational analysis
  • There is currently a reproducibility crisis in science
  • Dependency management is the process of specifying and installing the software dependencies of a project
  • Virtual environments are isolated Python environments that allow you to install packages without affecting the system Python installation
    • conda, pipenv, virtualenv
  • Containers are lightweight, standalone, executable packages of software that include everything needed to run an application
    • Docker is the most widely used container platform

Today’s lecture 📋

Docker for Data Science

  • Last time, we saw how to create a simple container with Docker
  • We used a short Dockerfile to build the container, downloaded a Python image, and installed a few Python packages in it
  • We then ran the container and executed a Python script, and finally uploaded the container to Docker Hub
  • Today, we will see other Docker features that are useful for data science and build more complex containers
  • We will also discuss how to use Docker in a production environment, and how to deploy containers to the cloud

Docker for Data Science 🐳

A container for all tools we covered in this course

  • Today we will have a hands-on session where we will build a container with all the tools we have covered in this course (!) 🤓
  • That includes a bash shell, a git client, a Python interpreter, a Jupyter notebook server, a Quarto document processor, and a SQL database
  • We will build the container using a Dockerfile, and then run it locally
  • The container will be based on the official Ubuntu image, and we will install all the necessary packages in it
  • We will also see how to mount volumes in the container to persist data
  • As we saw in the last lecture, there are many Docker images for data science, such as Jupyter, RStudio, and PostgreSQL
  • There are other ways to build containers. For example:
  • But to keep things simple, we will only use a single container today and run it with the docker run command

Let’s get started! 🚀

Docker Desktop and Ubuntu

  • First, make sure you have Docker Desktop installed on your computer. It has everything you need to build, run, and share containers
  • If you are using Windows, you will need to enable WSL 2 to run Linux containers
  • We will use Ubuntu 24.04 as the base image for our container
  • Ubuntu is one of the most popular Linux distributions and is widely used in data science, machine learning, and AI
  • For instance, if you are using Google Colab or Amazon AWS, you are probably running Ubuntu
  • The good thing about Ubuntu is that almost all commands we have covered in this course work the same way in Ubuntu,as it is also a Unix-like system like macOS (it uses bash as the default shell)
  • But you won’t even feel you are interacting with Ubuntu, as you will be inside a container

Anatomy of a Dockerfile

FROM image:tag

  • As we saw in the last lecture, a Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image
  • The Dockerfile is a recipe for building a container, and the first line of the Dockerfile specifies the base image
  • The base image is the starting point for the container, and it can be any image available on Docker Hub
  • We start will start our Dockerfile with FROM, followed by the base image we want to use
# Use the official Ubuntu image as the base image
FROM ubuntu:24.04
  • The : character is used to specify the version of the image. In this case, we are using Ubuntu 24.04
  • You can see all tags available for the Ubuntu image here
  • Okay, that’s the easy part. Now let’s install all the tools we need in the container! 🛠️

Anatomy of a Dockerfile

RUN command

  • The RUN instruction executes any commands in a new layer on top of the current image and commits the results
  • Yes, commits in Docker are like commits in Git, but instead of saving changes to a file system, they save changes to an image
  • Ubuntu has a package manager called apt, which we can use to install software packages. The only commands we will need to use are apt-get update (only once) and apt-get install <package>
    • For example, to install git, we would run apt-get update && apt-get install -y git
  • To keep the image smaller, we will also run apt-get clean and rm -rf /var/lib/apt/lists/* after installing the packages
    • This will remove the package cache and clean up the image
# Update and install dependencies
# Versions: https://packages.ubuntu.com/
RUN apt-get update && apt-get install -y \
    bash=5.2.21-2ubuntu4 \
    git=1:2.43.0-1ubuntu7.1 \
    sqlite3=3.45.1-1ubuntu2 \
    wget=1.21.4-1ubuntu4.1 \
    python3=3.12.3-0ubuntu2 \
    python3.12-venv \
    python3-pip=24.0+dfsg-1ubuntu1.1 && \ 
    apt-get clean && rm -rf /var/lib/apt/lists/*
  • So far, so good! 😃
  • You can see that we are installing specific versions of the packages, which is a good practice to ensure reproducibility
  • On Ubuntu, you can find the available versions of a package here
  • Make sure to match the version of the operating system with the version of the package

Result of the RUN instruction

Hey Danilo, wait a minute…🤔

Isn’t that a lot of work to install all these packages?

  • Yes, it is!
  • But I am purposely making it harder than it should be 😅
  • What makes it harder for us is that we are starting from a really bare-bones image (Ubuntu) and installing everything from scratch
  • And I don’t have Ubuntu installed on my computer, so I can’t just run apt list --installed to see which packages are installed on my system and just copy them to the Dockerfile
  • In this case, it did involve some research to find the right package names and versions
  • But once you have the Dockerfile, you can reuse it as many times as you want
  • You could also have saved some time by using a Python image as the base image
  • The Python image already comes with Python and pip installed, so you would only need to install the other packages
  • Thus, my suggestion here is, if you are building a container for a specific purpose, start with an image that is closer to what you need
  • For instance, you can download a Jupyter image with a full data science stack with one line of code
  • But for educational purposes, I think it is good to start from scratch and understand how things work under the hood. Sorry! 😅

Back to the Dockerfile

Installing additional packages

  • After installing the system packages, we will install the additional packages we need
  • We will install, for example, a few Python libraries with pip3, such as numpy, pandas, jupyterlab, dask, and matplotlib
  • We will use the RUN instructions again
  • You can either see the versions you already have with pip show <package> | grep Version or pip freeze > requirements.txt and then copy the versions from the file
  • Ubuntu requires us to create a virtual environment to install Python packages, so we will do that too
# Create and activate virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install Python dependencies in virtual environment
RUN pip install numpy==1.26.4 pandas==2.2.2 \
                jupyterlab==4.2.5 ipykernel==6.29.5 \
                dask==2024.11.2 matplotlib==3.9.2

Installing Quarto with wget

  • We will also install Quarto, but this time we will use wget to download the binary
  • Why? Because Quarto is not available on pip or apt, so we need to download it from the official website: https://quarto.org/docs/get-started/
  • wget is a command-line utility that allows you to download files from the web
  • It runs on Unix-like systems, so it is available on Ubuntu, macOS, and even on Windows
  • Once we download the .deb file (which is the package format for Ubuntu), we can install it with apt-get install <package> (like we did with the other packages)
  • We can download any file from the web with wget, as long as we have the URL
# Download and install Quarto
# Install Quarto
RUN wget https://github.com/quarto-dev/quarto-cli/releases/download/v1.6.37/quarto-1.6.37-linux-arm64.deb && \
    apt-get install -y quarto-1.6.37-linux-arm64.deb && \
    rm quarto-1.6.37-linux-arm64.deb

We are almost there! 🏁

Running the Jupyter notebook server

  • Finally, we will run the Jupyter notebook server in the container
  • We will have access to the Jupyter notebook server on port 8888, so we will need to expose this port with the EXPOSE instruction
  • What is cool is that the new JupyterLab version comes with an embedded terminal, so we can run bash inside the JupyterLab interface and have access to all the tools we installed in the container (like git, sqlite3, and Quarto) 😉
  • Let’s also create a working directory for the Jupyter notebook server, so we can persist the notebooks outside the container
# Create a directory for saving files
RUN mkdir -p /workspace
WORKDIR /workspace

# Expose the Jupyter notebook server port
EXPOSE 8888

# Start JupyterLab
CMD ["sh", "-c", ". /opt/venv/bin/activate && jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root"]

The final Dockerfile

Ready to run the container! 🐳

# Base image
FROM ubuntu:24.04

# Update and install system dependencies
RUN apt-get update && apt-get install -y \
    bash=5.2.21-2ubuntu4 \
    git=1:2.43.0-1ubuntu7.1 \
    curl=8.5.0-2ubuntu10.5 \
    wget=1.21.4-1ubuntu4.1 \
    python3=3.12.3-0ubuntu2 \
    python3.12-venv \
    python3-pip=24.0+dfsg-1ubuntu1.1 && \ 
    apt-get clean && rm -rf /var/lib/apt/lists/*

# Create and activate virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install Python dependencies in virtual environment
RUN pip install numpy==1.26.4 pandas==2.2.2 \
                jupyterlab==4.2.5 ipykernel==6.29.5 \
                dask==2024.11.2 matplotlib==3.9.2

# Install Quarto
RUN wget https://github.com/quarto-dev/quarto-cli/releases/download/v1.6.37/quarto-1.6.37-linux-arm64.deb && \
    apt-get install -y ./quarto-1.6.37-linux-arm64.deb && \
    rm quarto-1.6.37-linux-arm64.deb

# Create a directory for saving files
RUN mkdir -p /workspace
WORKDIR /workspace

# Expose port for JupyterLab 
EXPOSE 8888

# Start JupyterLab
CMD ["sh", "-c", ". /opt/venv/bin/activate && jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root"]

Building the container

  • Now that we have the Dockerfile, we can build the container with the docker build command
docker build -t qtm350-container .
  • The -t flag is used to tag the image with a name, in this case qtm350-container
  • The . at the end of the command specifies the build context, which is the current directory
  • Now we just need to wait for the image to be built and then run it with the docker run command
  • Let’s see how it goes! 🤞

Running the container

  • To run the container, we will use the docker run command
  • We will also use the -p flag to map the port 8888 of the container to the port 8888 of the host machine
  • We will also use the -v flag to mount a volume in the container, so we can persist the notebooks outside the container
docker run -p 8888:8888 -v $(pwd):/workspace qtm350-container
  • The -v flag is used to mount the current directory ($(pwd)) to the /workspace directory in the container
  • This way, we can save the notebooks outside the container and access them even after the container is stopped
  • Now we just need to open a web browser and go to http://localhost:8888 to access the JupyterLab interface 🚀

Adding some metadata to the container

  • We can personalise the container by adding some metadata to it
  • We can use the LABEL instruction to add metadata to the container, such as the author, the version, and the description
  • We can also add a MAINTAINER instruction to specify the maintainer of the container
  • This information will be displayed when we run the docker inspect command
# Metadata
LABEL version="1.0" \ 
      description="Container with all tools covered in QTM 350" \ 
      maintainer="Danilo Freire <danilo.freire@emory.edu>" \
      license="MIT"
  • We can also add a COPY instruction to copy files from the host machine to the container
# Copy files from host to container
COPY . /workspace
  • We won’t need to do that here, but it may be useful in other situations where you need to copy scripts or datasets to the container

How to manage the container

  • To stop the container, you can press Ctrl+C in the terminal where the container is running
  • You can also run the docker ps command to see the list of running containers and then run the docker stop command with the container ID
  • You can also remove the container with the docker rm command and the image with the docker rmi command
  • Or you can just check Docker Desktop and stop the container from there, as well as remove or restart it
  • To tag and push the container to Docker Hub, you can use the docker tag and docker push commands
docker tag qtm350-container danilofreire/qtm350-container
docker push danilofreire/qtm350-container

Summary of the Dockerfile

  • We have seen how to build a container with all the tools we covered in this course
  • We started with the FROM instruction to specify the base image, then used the RUN instruction to install the system packages and the Python libraries
  • We also used the ENV instruction to set the PATH environment variable, the EXPOSE instruction to expose the port for the Jupyter notebook server, and the CMD instruction to start the Jupyter notebook server
  • We also added some metadata to the container with the LABEL instructions
  • We then built the container with the docker build command and ran it with the docker run command
  • We also saw how to manage the container, stop it, remove it, and push it to Docker Hub

And that’s it for today! 🎉

Happy Thanksgiving! 🦃