DATASCI 350 - Data Science Computing

Lecture 24 - Docker for Data Science

Danilo Freire

Department of Data and Decision Sciences
Emory University

Hello, everyone! 😊

Brief recap of last class 📚

Reproducible workflows, virtual environments, and containers

  • Reproducibility: recreating the results of a computational analysis
  • Dependency management: specifying and installing the software a project needs
  • Virtual environments: isolated Python setups that don’t affect the system installation
    • Popular tools: conda and uv
  • Containers: standalone packages with everything needed to run an application
    • Docker is the most widely used container platform

Today’s lecture 📋

Docker for Data Science

  • Last time: wrote a Dockerfile with a Python image, ran a container, and pushed it to Docker Hub
  • Today: build a container from scratch with all course tools: bash, git, Python, Jupyter, Quarto, and SQL
  • Dockerfile instructions: FROM, LABEL, RUN, SHELL, ENV, EXPOSE, CMD
  • Installing packages: apt (system), pip (Python), wget (Quarto)
  • Debugging dependency errors, mounting volumes, and managing containers
  • Deploying containers to AWS
  • If time allows: Q&A about your projects (more time next class too)

Torvalds on vibe coding then…

https://www.theregister.com/2025/11/18/linus_torvalds_vibe_coding/

Torvalds on vibe coding now!

https://x.com/linuxfoundation/status/2041583296044528122

Docker for Data Science 🐳

A container for all tools we covered in this course

  • Today we will build a container with all the tools we covered in this course (!) 🤓
  • That includes bash, git, Python, Quarto, SQL, and Jupyter tools
  • We will write a Dockerfile and run the container locally
  • The base image is the official Ubuntu image (same as our AWS instance), and we will install all packages on top of it
  • We will also explore volume mounts to persist data and how to manage dependencies inside the container

A container for all tools we covered in this course

  • There are many ready-made Docker images for data science: Jupyter, RStudio, SQLite
  • There are also several ways to build containers:
  • To keep things simple, we will use a single container today and run it with docker run

Let’s get started! 🚀

Docker Desktop and Ubuntu

  • Make sure Docker Desktop is installed on your computer. It has everything you need to build, run, and share containers
  • Windows users: you may need to enable WSL to run Linux containers. You should have done that already!
  • We will use Ubuntu 24.04 LTS as the base image
  • Ubuntu is the most popular Linux distribution in data science, but you won’t even feel you are using it, as it runs inside the container

Downloading Docker Desktop

Creating an account on Docker Hub

  • If you don’t have a Docker Hub account yet, please create one here: https://hub.docker.com/signup
  • Docker has several types of accounts, but the free account is enough for academic purposes
  • You can use Docker Hub to store your images and share them with others
  • If you click on Sign in on the Docker Desktop application or on the website, it will open a web browser and ask you to log in
  • Please enter your credentials and log in

Anatomy of a Dockerfile revisited

FROM image:tag

  • A Dockerfile is a text file with all the commands needed to assemble an image; think of it as a recipe for building a container
  • The base image is the starting point, and it can be any image on Docker Hub
  • Create a text file called Dockerfile (no extension) in your directory. Any editor will do! 😉
  • You will need about 2 GB of free space to build the container
  • We start the Dockerfile with FROM, followed by the base image
# Use the official Ubuntu image as the base image
FROM ubuntu:24.04

LABEL command

  • The LABEL instruction adds metadata to an image
  • This is useful for documentation purposes and to provide information about the image
  • You can add any key-value pairs you want, but it is a good practice to include the version, description, maintainer, and license
  • You can also add a MAINTAINER instruction to specify the maintainer of the image
  • For instance, you can add the following lines to your Dockerfile
# Metadata
LABEL version="1.0"
LABEL description="Container with the tools covered in DATASCI 350"
LABEL maintainer="Danilo Freire <danilo.freire@emory.edu>"
LABEL license="MIT"
  • This will add the metadata to the image, and you can see it when you run the docker inspect command

RUN command

  • RUN executes commands in a new layer on top of the current image
    • Like a Git commit, but for the image instead of the file system
  • We use apt-get update (once) then apt-get install <package> to add software, just like on AWS
    • Example: apt-get update && apt-get install git -y
  • Useful flags:
    • --no-install-recommends: skip unnecessary packages
    • -y: auto-answer “yes” to prompts
  • Clean up after installing to keep the image small:
    • apt-get clean && rm -rf /var/lib/apt/lists/*
  • Let’s add some packages to the Dockerfile 🐳

RUN command

# Update and install dependencies
# Versions: https://packages.ubuntu.com/
RUN apt-get update && apt-get install -y --no-install-recommends\
    bash=5.2.21-2ubuntu4 \
    git=1:2.43.0-1ubuntu7.3 \
    sqlite3=3.45.1-1ubuntu2.5 \
    libsqlite3-0=3.45.1-1ubuntu2.5 \
    wget=1.21.4-1ubuntu4.1 \
    python3.12=3.12.3-1ubuntu0.12 \
    python3.12-venv=3.12.3-1ubuntu0.12 \
    python3-pip=24.0+dfsg-1ubuntu1.3 && \
    apt-get clean && rm -rf /var/lib/apt/lists/*
  • So far, so good! 😎
  • We are installing bash, git, sqlite3, Python 3.12, and pip in the container, similar to what we did on AWS
  • We pin specific versions of each package to ensure reproducibility
  • Available versions: https://packages.ubuntu.com/
    • Match the OS version (24.04 LTS / noble) with the package version

Setting the default shell to bash

  • Docker uses sh by default, but we want bash instead
  • sh (Bourne shell) is a simpler shell that doesn’t support all the features of bash. More here
# Set the default shell to bash
SHELL ["/bin/bash", "-c"]
  • SHELL sets bash as the default for all subsequent RUN commands
  • -c stands for “command”: it tells bash to read the next argument as a command string (e.g., bash -c "echo hello")
  • This way, we can use bash commands without specifying the shell every time

Hey Danilo, wait a minute…🤔

Isn’t that a lot of work?

  • Yes, it is! 😅
  • I am purposely making this harder than it needs to be!
  • But starting from a bare-bones image (Ubuntu) and installing everything from scratch is the most flexible approach
  • I (we?) don’t have Ubuntu installed locally, so I can’t just run apt list --installed and copy the packages to the Dockerfile
  • It did involve some research to find the right package names and versions
  • But once you have the Dockerfile, you can reuse it as many times as you want

Reusing the Dockerfile

  • You could save time by using a Python image as the base, which already comes with Python and pip
  • My suggestion: if you are building a container for a specific purpose, start with an image that is closer to what you need
  • You can even download a Jupyter image with a full data science stack in one line
  • But it wouldn’t be so fun, would it? 😅

Back to the Dockerfile

Installing Python packages with pip

  • We will use RUN to install Python libraries (numpy, pandas, jupyterlab, dask, matplotlib) with pip3
  • You can check which versions you already have with pip freeze > requirements.txt and copy them
  • Ubuntu requires a virtual environment to install Python packages, so we create one first
  • python3 -m venv /opt/venv creates a virtual environment in /opt/venv
  • ENV PATH="/opt/venv/bin:$PATH" prepends /opt/venv/bin to PATH, so we can use the virtual environment’s Python and packages without specifying the full path
# Create and activate virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install Python libraries in virtual environment
RUN pip install httpx==0.27.2 numpy==1.26.4 pandas==2.2.2 \
                jupyterlab==4.2.5 ipykernel==6.29.5 \
                dask==2024.5.0 matplotlib==3.9.2

Installing Quarto with wget

  • Quarto is not on pip or apt, so we download the binary directly with wget
  • We download a .deb file, which is Ubuntu’s package format
    • The ./ prefix tells apt-get to install the local file, not search the repositories
  • After installation, we clean up to keep the image small:
    • rm deletes the downloaded .deb file
    • apt-get clean removes the package cache
    • rm -rf /var/lib/apt/lists/* removes the list of available packages
  • We use the arm64 version here (Apple Silicon Macs). Windows and Intel Mac users: replace arm64 with amd64 in the URL
# Download and install Quarto
# Install Quarto
RUN wget https://github.com/quarto-dev/quarto-cli/releases/download/v1.9.37/quarto-1.9.37-linux-arm64.deb && \
    apt-get install -y ./quarto-1.9.37-linux-arm64.deb && \
    rm ./quarto-1.9.37-linux-arm64.deb && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

Creating a working directory

WORKDIR instruction

  • We need a place inside the container to store our notebooks and data files
  • mkdir -p creates the directory (and any parent directories if needed)
  • WORKDIR sets the default directory for all subsequent commands (RUN, CMD, COPY, etc.)
    • It also becomes the directory you land in when you open a terminal in the container
# Create a directory for saving files
RUN mkdir -p /workspace
WORKDIR /workspace
  • Later, we will mount a volume to /workspace so that files persist even after the container stops

We are almost there! 🏁

EXPOSE and CMD

  • EXPOSE 8888 tells Docker that the container listens on port 8888 (the default for JupyterLab)
    • This is documentation, not enforcement: you still need -p 8888:8888 when running the container
  • CMD defines the command that runs when the container starts
    • Ours activates the virtual environment, then launches JupyterLab
    • --ip=0.0.0.0: listen on all network interfaces (not just localhost), so the host machine can reach it
    • --no-browser: don’t try to open a browser inside the container (there isn’t one!)
    • --allow-root: Docker containers run as root by default, and JupyterLab requires this flag to accept that
  • JupyterLab includes a built-in terminal, so you can run bash, git, sqlite3, and quarto directly from the browser 😉

And here it is, the complete Dockerfile! 🎉

# Base image
FROM ubuntu:24.04

# Metadata
LABEL version="1.0"
LABEL description="Container with the tools covered in DATASCI 350"
LABEL maintainer="Danilo Freire <danilo.freire@emory.edu>"
LABEL license="MIT"

# Update and install system dependencies
# We will see why we need to install libsqlite3-0 later
RUN apt-get update && apt-get install -y --no-install-recommends\
    bash=5.2.21-2ubuntu4 \
    git=1:2.43.0-1ubuntu7.3 \
    sqlite3=3.45.1-1ubuntu2.5 \
    libsqlite3-0=3.45.1-1ubuntu2.5 \
    wget=1.21.4-1ubuntu4.1 \
    nano=7.2-2ubuntu0.1 \
    python3.12=3.12.3-1ubuntu0.12 \
    python3.12-venv=3.12.3-1ubuntu0.12 \
    python3-pip=24.0+dfsg-1ubuntu1.3 && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

# Set default shell to Bash
SHELL ["/bin/bash", "-c"]

# Create and activate virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install Python dependencies in virtual environment
RUN pip install httpx==0.27.2 numpy==1.26.4 pandas==2.2.2 \
                jupyterlab==4.2.5 ipykernel==6.29.5 \
                dask==2024.5.0 matplotlib==3.9.2

# Install Quarto and clean up
RUN wget https://github.com/quarto-dev/quarto-cli/releases/download/v1.9.37/quarto-1.9.37-linux-arm64.deb && \
    # Install the local deb file (notice the "./" prefix)
    apt-get install -y ./quarto-1.9.37-linux-arm64.deb && \
    rm quarto-1.9.37-linux-arm64.deb && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Create a directory for saving files
RUN mkdir -p /workspace
WORKDIR /workspace

# Expose port for JupyterLab 
EXPOSE 8888

# Start JupyterLab
CMD . /opt/venv/bin/activate && jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root

Building the container

  • Now that we have the Dockerfile, we can build the container with the docker build command
docker build -t datasci350-container .
  • The -t flag is used to tag the image with a name, in this case datasci350-container
  • The . at the end of the command specifies the build context, which is the current directory
  • Now we just need to wait for the image to be built and then run it with the docker run command
  • Let’s see how it goes! 🤞

Oh, no! 😱

  • We got an error when building the container! 😕
[...]

4.585 Some packages could not be installed. This may mean that you have
4.585 requested an impossible situation or if you are using the unstable
4.585 distribution that some required packages have not yet been created
4.585 or been moved out of Incoming.
4.585 The following information may help to resolve the situation:
4.585 
4.585 The following packages have unmet dependencies:
4.650  sqlite3 : Depends: libsqlite3-0 (= 3.45.1-1ubuntu2) but 3.45.1-1ubuntu2.1 is to be installed
4.651 E: Unable to correct problems, you have held broken packages.
  • The error message indicates that there are unmet dependencies for the sqlite3 package!

Fixing the error 🔧

  • But that’s fine, we know how to fix it! 😎
  • We can resolve the unmet dependencies by ensuring the correct versions of the packages are installed
  • So we go to https://packages.ubuntu.com/, search for libsqlite3-0 and check the available versions
  • We can see that the version 3.45.1-1ubuntu2 is available, so we can just add it to the apt-get install command
  • Let’s update the Dockerfile and try again
libsqlite3-0=3.45.1-1ubuntu2 \
  • Now we can build the container again with the same command as before
docker build -t datasci350-container .
  • This time it should work! 🤞

Success! 🎉

Running the container

  • We use docker run with -p to map port 8888 (container) to port 8888 (host) and -v to mount a volume
docker run -p 8888:8888 -v $(pwd):/workspace datasci350-container
  • -v mounts the current directory ($(pwd)) to /workspace in the container, so notebooks persist even after the container stops
  • Open http://127.0.0.1:8888 in a web browser to access JupyterLab 🚀
  • Jupyter generates a token for access; copy it from the terminal or click the link

Running the container

Accessing the JupyterLab interface

How to manage the container

  • Press Ctrl+C in the terminal to stop the container, or use docker ps to list running containers and docker stop <id> to stop one
  • Remove a container with docker rm and an image with docker rmi
  • You can also manage containers from Docker Desktop (stop, remove, restart)
  • To tag and push to Docker Hub, use docker tag and docker push
docker tag datasci350-container danilofreire/datasci350-container
docker push danilofreire/datasci350-container

Cleaning up

  • docker ps -a lists all containers (running and stopped). ps = “process status”; -a includes stopped ones
  • docker stop <id> stops a running container
  • docker rm <id> removes a container; docker rmi datasci350-container removes the image
  • docker system prune removes all stopped containers, unused networks, images, and build cache (-a removes all unused images too). Warning: this is aggressive!
docker ps -a
docker stop <container_id_or_name>
docker rm <container_id_or_name>
docker rmi datasci350-container
# Optional: docker system prune -a

Running Docker containers on AWS

Running Docker containers on AWS

An easy way to run our container on an EC2 instance

# On your local machine: upload the Docker image to Docker Hub
docker tag datasci350-container danilofreire/datasci350-container
docker push danilofreire/datasci350-container

# On your EC2 instance: install Docker, start the Docker service, and add your user to the docker group
sudo apt update && \
    sudo apt install -y docker.io -y && \
    sudo systemctl start docker && \
    sudo usermod -aG docker $USER

# Pull the Docker image from Docker Hub
docker pull danilofreire/datasci350-container

# Run the Docker container
docker run -p 8888:8888 -v $(pwd):/workspace danilofreire/datasci350-container

# Access JupyterLab via web browser
http://<EC2_PUBLIC_IP>:8888/lab?token=<TOKEN>

Summary of the Dockerfile

  • We built a container with all the tools covered in this course: bash, git, Python, Jupyter, Quarto, and SQLite
  • FROM set Ubuntu 24.04 as the base image
  • LABEL added metadata (version, description, maintainer, licence)
  • RUN + apt-get installed system packages with pinned versions
  • SHELL switched the default shell from sh to bash
  • RUN + pip installed Python libraries inside a virtual environment
  • ENV prepended the virtual environment to PATH
  • wget downloaded and installed Quarto from a .deb file
  • EXPOSE opened port 8888 for JupyterLab
  • CMD started the Jupyter notebook server
  • We built with docker build, ran with docker run -p -v, and pushed to Docker Hub
  • We also saw how to debug dependency errors, manage containers, and deploy to AWS

And that’s it for today! 🎉