Lecture 24 - Docker for Data Science
25 November, 2024
conda
, pipenv
, virtualenv
bash
shell, a git
client, a Python
interpreter, a Jupyter
notebook server, a Quarto
document processor, and a SQL
databasedocker run
commandbash
as the default shell)
FROM image:tag
FROM
, followed by the base image we want to use:
character is used to specify the version of the image. In this case, we are using Ubuntu 24.04RUN command
RUN
instruction executes any commands in a new layer on top of the current image and commits the resultsapt
, which we can use to install software packages. The only commands we will need to use are apt-get update
(only once) and apt-get install <package>
git
, we would run apt-get update && apt-get install -y git
apt-get clean
and rm -rf /var/lib/apt/lists/*
after installing the packages
# Update and install dependencies
# Versions: https://packages.ubuntu.com/
RUN apt-get update && apt-get install -y \
bash=5.2.21-2ubuntu4 \
git=1:2.43.0-1ubuntu7.1 \
sqlite3=3.45.1-1ubuntu2 \
wget=1.21.4-1ubuntu4.1 \
python3=3.12.3-0ubuntu2 \
python3.12-venv \
python3-pip=24.0+dfsg-1ubuntu1.1 && \
apt-get clean && rm -rf /var/lib/apt/lists/*
RUN
instructionapt list --installed
to see which packages are installed on my system and just copy them to the Dockerfilepip
installed, so you would only need to install the other packagespip3
, such as numpy
, pandas
, jupyterlab
, dask
, and matplotlib
RUN
instructions againpip show <package> | grep Version
or pip freeze > requirements.txt
and then copy the versions from the filewget
wget
to download the binarypip
or apt
, so we need to download it from the official website: https://quarto.org/docs/get-started/wget
is a command-line utility that allows you to download files from the web.deb
file (which is the package format for Ubuntu), we can install it with apt-get install <package>
(like we did with the other packages)wget
, as long as we have the URL8888
, so we will need to expose this port with the EXPOSE
instructionbash
inside the JupyterLab interface and have access to all the tools we installed in the container (like git
, sqlite3
, and Quarto
) 😉# Base image
FROM ubuntu:24.04
# Update and install system dependencies
RUN apt-get update && apt-get install -y \
bash=5.2.21-2ubuntu4 \
git=1:2.43.0-1ubuntu7.1 \
curl=8.5.0-2ubuntu10.5 \
wget=1.21.4-1ubuntu4.1 \
python3=3.12.3-0ubuntu2 \
python3.12-venv \
python3-pip=24.0+dfsg-1ubuntu1.1 && \
apt-get clean && rm -rf /var/lib/apt/lists/*
# Create and activate virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
# Install Python dependencies in virtual environment
RUN pip install numpy==1.26.4 pandas==2.2.2 \
jupyterlab==4.2.5 ipykernel==6.29.5 \
dask==2024.11.2 matplotlib==3.9.2
# Install Quarto
RUN wget https://github.com/quarto-dev/quarto-cli/releases/download/v1.6.37/quarto-1.6.37-linux-arm64.deb && \
apt-get install -y ./quarto-1.6.37-linux-arm64.deb && \
rm quarto-1.6.37-linux-arm64.deb
# Create a directory for saving files
RUN mkdir -p /workspace
WORKDIR /workspace
# Expose port for JupyterLab
EXPOSE 8888
# Start JupyterLab
CMD ["sh", "-c", ". /opt/venv/bin/activate && jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root"]
docker build
command-t
flag is used to tag the image with a name, in this case qtm350-container
.
at the end of the command specifies the build context, which is the current directorydocker run
commanddocker run
command-p
flag to map the port 8888
of the container to the port 8888
of the host machine-v
flag to mount a volume in the container, so we can persist the notebooks outside the container-v
flag is used to mount the current directory ($(pwd)
) to the /workspace
directory in the containerLABEL
instruction to add metadata to the container, such as the author, the version, and the descriptionMAINTAINER
instruction to specify the maintainer of the containerdocker inspect
command# Metadata
LABEL version="1.0" \
description="Container with all tools covered in QTM 350" \
maintainer="Danilo Freire <danilo.freire@emory.edu>" \
license="MIT"
COPY
instruction to copy files from the host machine to the containerCtrl+C
in the terminal where the container is runningdocker ps
command to see the list of running containers and then run the docker stop
command with the container IDdocker rm
command and the image with the docker rmi
commanddocker tag
and docker push
commandsFROM
instruction to specify the base image, then used the RUN
instruction to install the system packages and the Python librariesENV
instruction to set the PATH
environment variable, the EXPOSE
instruction to expose the port for the Jupyter notebook server, and the CMD
instruction to start the Jupyter notebook serverLABEL
instructionsdocker build
command and ran it with the docker run
command