Lecture 24 - Docker for Data Science
conda and uvbash, git, Python, Jupyter, Quarto, and SQLFROM, LABEL, RUN, SHELL, ENV, EXPOSE, CMDapt (system), pip (Python), wget (Quarto)bash, git, Python, Quarto, SQL, and Jupyter toolsdocker runSign in on the Docker Desktop application or on the website, it will open a web browser and ask you to log inFROM image:tagDockerfile (no extension) in your directory. Any editor will do! 😉FROM, followed by the base image: specifies the image version. Here, we use Ubuntu 24.04LABEL commandLABEL instruction adds metadata to an imageMAINTAINER instruction to specify the maintainer of the imagedocker inspect commandRUN commandRUN executes commands in a new layer on top of the current image
apt-get update (once) then apt-get install <package> to add software, just like on AWS
apt-get update && apt-get install git -y--no-install-recommends: skip unnecessary packages-y: auto-answer “yes” to promptsapt-get clean && rm -rf /var/lib/apt/lists/*RUN command# Update and install dependencies
# Versions: https://packages.ubuntu.com/
RUN apt-get update && apt-get install -y --no-install-recommends\
bash=5.2.21-2ubuntu4 \
git=1:2.43.0-1ubuntu7.3 \
sqlite3=3.45.1-1ubuntu2.5 \
libsqlite3-0=3.45.1-1ubuntu2.5 \
wget=1.21.4-1ubuntu4.1 \
python3.12=3.12.3-1ubuntu0.12 \
python3.12-venv=3.12.3-1ubuntu0.12 \
python3-pip=24.0+dfsg-1ubuntu1.3 && \
apt-get clean && rm -rf /var/lib/apt/lists/*bash, git, sqlite3, Python 3.12, and pip in the container, similar to what we did on AWS24.04 LTS / noble) with the package versionbashsh by default, but we want bash insteadsh (Bourne shell) is a simpler shell that doesn’t support all the features of bash. More hereSHELL sets bash as the default for all subsequent RUN commands-c stands for “command”: it tells bash to read the next argument as a command string (e.g., bash -c "echo hello")bash commands without specifying the shell every timeapt list --installed and copy the packages to the DockerfilepippipRUN to install Python libraries (numpy, pandas, jupyterlab, dask, matplotlib) with pip3pip freeze > requirements.txt and copy thempython3 -m venv /opt/venv creates a virtual environment in /opt/venvENV PATH="/opt/venv/bin:$PATH" prepends /opt/venv/bin to PATH, so we can use the virtual environment’s Python and packages without specifying the full pathwgetpip or apt, so we download the binary directly with wget.deb file, which is Ubuntu’s package format
./ prefix tells apt-get to install the local file, not search the repositoriesrm deletes the downloaded .deb fileapt-get clean removes the package cacherm -rf /var/lib/apt/lists/* removes the list of available packagesarm64 version here (Apple Silicon Macs). Windows and Intel Mac users: replace arm64 with amd64 in the URL# Download and install Quarto
# Install Quarto
RUN wget https://github.com/quarto-dev/quarto-cli/releases/download/v1.9.37/quarto-1.9.37-linux-arm64.deb && \
apt-get install -y ./quarto-1.9.37-linux-arm64.deb && \
rm ./quarto-1.9.37-linux-arm64.deb && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*WORKDIR instructionmkdir -p creates the directory (and any parent directories if needed)WORKDIR sets the default directory for all subsequent commands (RUN, CMD, COPY, etc.)
/workspace so that files persist even after the container stopsEXPOSE and CMDEXPOSE 8888 tells Docker that the container listens on port 8888 (the default for JupyterLab)
-p 8888:8888 when running the containerCMD defines the command that runs when the container starts
--ip=0.0.0.0: listen on all network interfaces (not just localhost), so the host machine can reach it--no-browser: don’t try to open a browser inside the container (there isn’t one!)--allow-root: Docker containers run as root by default, and JupyterLab requires this flag to accept thatbash, git, sqlite3, and quarto directly from the browser 😉# Base image
FROM ubuntu:24.04
# Metadata
LABEL version="1.0"
LABEL description="Container with the tools covered in DATASCI 350"
LABEL maintainer="Danilo Freire <danilo.freire@emory.edu>"
LABEL license="MIT"
# Update and install system dependencies
# We will see why we need to install libsqlite3-0 later
RUN apt-get update && apt-get install -y --no-install-recommends\
bash=5.2.21-2ubuntu4 \
git=1:2.43.0-1ubuntu7.3 \
sqlite3=3.45.1-1ubuntu2.5 \
libsqlite3-0=3.45.1-1ubuntu2.5 \
wget=1.21.4-1ubuntu4.1 \
nano=7.2-2ubuntu0.1 \
python3.12=3.12.3-1ubuntu0.12 \
python3.12-venv=3.12.3-1ubuntu0.12 \
python3-pip=24.0+dfsg-1ubuntu1.3 && \
apt-get clean && rm -rf /var/lib/apt/lists/*
# Set default shell to Bash
SHELL ["/bin/bash", "-c"]
# Create and activate virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
# Install Python dependencies in virtual environment
RUN pip install httpx==0.27.2 numpy==1.26.4 pandas==2.2.2 \
jupyterlab==4.2.5 ipykernel==6.29.5 \
dask==2024.5.0 matplotlib==3.9.2
# Install Quarto and clean up
RUN wget https://github.com/quarto-dev/quarto-cli/releases/download/v1.9.37/quarto-1.9.37-linux-arm64.deb && \
# Install the local deb file (notice the "./" prefix)
apt-get install -y ./quarto-1.9.37-linux-arm64.deb && \
rm quarto-1.9.37-linux-arm64.deb && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Create a directory for saving files
RUN mkdir -p /workspace
WORKDIR /workspace
# Expose port for JupyterLab
EXPOSE 8888
# Start JupyterLab
CMD . /opt/venv/bin/activate && jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-rootdocker build command-t flag is used to tag the image with a name, in this case datasci350-container. at the end of the command specifies the build context, which is the current directorydocker run command[...]
4.585 Some packages could not be installed. This may mean that you have
4.585 requested an impossible situation or if you are using the unstable
4.585 distribution that some required packages have not yet been created
4.585 or been moved out of Incoming.
4.585 The following information may help to resolve the situation:
4.585
4.585 The following packages have unmet dependencies:
4.650 sqlite3 : Depends: libsqlite3-0 (= 3.45.1-1ubuntu2) but 3.45.1-1ubuntu2.1 is to be installed
4.651 E: Unable to correct problems, you have held broken packages.sqlite3 package!libsqlite3-0 and check the available versions3.45.1-1ubuntu2 is available, so we can just add it to the apt-get install commanddocker run with -p to map port 8888 (container) to port 8888 (host) and -v to mount a volume-v mounts the current directory ($(pwd)) to /workspace in the container, so notebooks persist even after the container stopsCtrl+C in the terminal to stop the container, or use docker ps to list running containers and docker stop <id> to stop onedocker rm and an image with docker rmidocker tag and docker pushdocker ps -a lists all containers (running and stopped). ps = “process status”; -a includes stopped onesdocker stop <id> stops a running containerdocker rm <id> removes a container; docker rmi datasci350-container removes the imagedocker system prune removes all stopped containers, unused networks, images, and build cache (-a removes all unused images too). Warning: this is aggressive!# On your local machine: upload the Docker image to Docker Hub
docker tag datasci350-container danilofreire/datasci350-container
docker push danilofreire/datasci350-container
# On your EC2 instance: install Docker, start the Docker service, and add your user to the docker group
sudo apt update && \
sudo apt install -y docker.io -y && \
sudo systemctl start docker && \
sudo usermod -aG docker $USER
# Pull the Docker image from Docker Hub
docker pull danilofreire/datasci350-container
# Run the Docker container
docker run -p 8888:8888 -v $(pwd):/workspace danilofreire/datasci350-container
# Access JupyterLab via web browser
http://<EC2_PUBLIC_IP>:8888/lab?token=<TOKEN>bash, git, Python, Jupyter, Quarto, and SQLiteFROM set Ubuntu 24.04 as the base imageLABEL added metadata (version, description, maintainer, licence)RUN + apt-get installed system packages with pinned versionsSHELL switched the default shell from sh to bashRUN + pip installed Python libraries inside a virtual environmentENV prepended the virtual environment to PATHwget downloaded and installed Quarto from a .deb fileEXPOSE opened port 8888 for JupyterLabCMD started the Jupyter notebook serverdocker build, ran with docker run -p -v, and pushed to Docker Hub