Lecture 24 - Docker for Data Science
conda, pipenv, virtualenv
bash shell, a git client, a Python interpreter, a Jupyter notebook server, a Quarto document processor, a SQL database, and Jupyter toolsdocker run command
Sign in on the Docker Desktop application or on the website, it will open a web browser and ask you to log inFROM image:tagDockerfile (without any extension) in your directoryFROM, followed by the base image we want to use: character is used to specify the version of the image. In this case, we are using Ubuntu 24.04LABEL commandLABEL instruction adds metadata to an imageMAINTAINER instruction to specify the maintainer of the imagedocker inspect commandRUN commandRUN instruction executes any commands in a new layer on top of the current image and commits the resultsapt? We will use it again to install software packages (as we did with AWS)apt-get update (only once) and apt-get install <package>
git, we would run apt-get update && apt-get install git -y--no-install-recommends flag to avoid installing unnecessary packages-y flag is used to automatically answer yes to any promptsapt-get clean and rm -rf /var/lib/apt/lists/* after installing the packages
RUN command# Update and install dependencies
# Versions: https://packages.ubuntu.com/
RUN apt-get update && apt-get install -y --no-install-recommends\
bash=5.2.21-2ubuntu4 \
git=1:2.43.0-1ubuntu7.3 \
sqlite3=3.45.1-1ubuntu2 \
libsqlite3-0=3.45.1-1ubuntu2 \
python3.12=3.12.3-1ubuntu0.8 \
python3.12-venv=3.12.3-1ubuntu0.8 \
python3-pip=24.0+dfsg-1ubuntu1.3 && \
apt-get clean && rm -rf /var/lib/apt/lists/*24.04 LTS or noble in this case)bashsh as the default shell, but we want to use bash insteadbash by adding the following line to the DockerfileSHELL instruction sets bash as the default shell for subsequent RUN commands-c flag is required because it tells bash to read commands from a string (text), not from a file or standard inputbash commands in the Dockerfile without having to specify the shell every timeapt list --installed to see which packages are installed on my system and just copy them to the Dockerfilepip installed, so you would only need to install the other packagespipRUN instructions to install a few Python libraries with pip3, such as numpy, pandas, jupyterlab, dask, and matplotlibpip freeze > requirements.txt and then copy the versions from the file-m flag to run the venv module as a script and create a virtual environment in the /opt/venv directoryENV PATH="/opt/venv/bin:$PATH" prepends the directory /opt/venv/bin to the beginning of the existing PATH environment variable within the Docker imagewgetwget to download the binarypip or apt, so we need to download it from the official website: https://quarto.org/docs/get-started/wget is a command-line utility that allows you to download files from the web.deb file (which is the package format for Ubuntu), we can install it with apt-get install <package> (like we did with the other packages)wget, as long as we have the URL./ prefix to ensure that apt-get installs the local file, not a package from the repositoriesrm just removes the downloaded file after installation to keep the image size smaller8888, so we will need to expose this port with the EXPOSE instructionbash inside the JupyterLab interface and have access to all the tools we installed in the container (like git, sqlite3, and Quarto) 😉# Base image
FROM ubuntu:24.04
# Metadata
LABEL version="1.0"
LABEL description="Container with the tools covered in QTM 350"
LABEL maintainer="Danilo Freire <danilo.freire@emory.edu>"
LABEL license="MIT"
# Update and install system dependencies
# We will see why we need to install libsqlite3-0 later
RUN apt-get update && apt-get install -y --no-install-recommends\
bash=5.2.21-2ubuntu4 \
git=1:2.43.0-1ubuntu7.2 \
sqlite3=3.45.1-1ubuntu2 \
libsqlite3-0=3.45.1-1ubuntu2 \
wget=1.21.4-1ubuntu4.1 \
nano=7.2-2ubuntu0.1 \
python3.12=3.12.3-1ubuntu0.5 \
python3.12-venv=3.12.3-1ubuntu0.5 \
python3-pip=24.0+dfsg-1ubuntu1.1 && \
apt-get clean && rm -rf /var/lib/apt/lists/*
# Set default shell to Bash
SHELL ["/bin/bash", "-c"]
# Create and activate virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
# Install Python dependencies in virtual environment
RUN pip install httpx==0.27.2 numpy==1.26.4 pandas==2.2.2 \
jupyterlab==4.2.5 ipykernel==6.29.5 \
dask==2024.5.0 matplotlib==3.9.2
# Install Quarto
RUN wget https://github.com/quarto-dev/quarto-cli/releases/download/v1.6.37/quarto-1.6.37-linux-arm64.deb && \
# Install the local deb file (notice the "./" prefix)
apt-get install -y ./quarto-1.6.37-linux-arm64.deb && \
rm quarto-1.6.37-linux-arm64.deb && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Create a directory for saving files
RUN mkdir -p /workspace
WORKDIR /workspace
# Expose port for JupyterLab
EXPOSE 8888
# Start JupyterLab
CMD . /opt/venv/bin/activate && jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-rootdocker build command-t flag is used to tag the image with a name, in this case qtm350-container. at the end of the command specifies the build context, which is the current directorydocker run command[...]
4.585 Some packages could not be installed. This may mean that you have
4.585 requested an impossible situation or if you are using the unstable
4.585 distribution that some required packages have not yet been created
4.585 or been moved out of Incoming.
4.585 The following information may help to resolve the situation:
4.585
4.585 The following packages have unmet dependencies:
4.650 sqlite3 : Depends: libsqlite3-0 (= 3.45.1-1ubuntu2) but 3.45.1-1ubuntu2.1 is to be installed
4.651 E: Unable to correct problems, you have held broken packages.sqlite3 package!libsqlite3-0 and check the available versions3.45.1-1ubuntu2 is available, so we can just add it to the apt-get install commanddocker run command-p flag to map the port 8888 of the container to the port 8888 of the host machine-v flag to mount a volume in the container, so we can persist the notebooks outside the container-v flag is used to mount the current directory ($(pwd)) to the /workspace directory in the containerCtrl+C in the terminal where the container is runningdocker ps command to see the list of running containers and then run the docker stop command with the container IDdocker rm command and the image with the docker rmi commanddocker tag and docker push commandsdocker ps -a to get the container ID or name. ps is short for “process status” and the -a flag shows all containers, including those that are stoppeddocker stop <container_id_or_name>docker rm <container_id_or_name>docker rmi qtm350-containerdocker system prune to remove all stopped containers, unused networks, images, and build cache (use the -a flag to remove all unused images too). Warning: This is a more aggressive cleanup command!# On your local machine: upload the Docker image to Docker Hub
docker tag qtm350-container danilofreire/qtm350-container
docker push danilofreire/qtm350-container
# On your EC2 instance: install Docker, start the Docker service, and add your user to the docker group
sudo apt update && \
sudo apt install -y docker.io -y && \
sudo systemctl start docker && \
sudo usermod -aG docker $USER
# Pull the Docker image from Docker Hub
docker pull danilofreire/qtm350-container
# Run the Docker container
docker run -p 8888:8888 -v $(pwd):/workspace danilofreire/qtm350-container
# Access JupyterLab via web browser
http://<EC2_PUBLIC_IP>:8888/lab?token=<TOKEN>FROM instruction to specify the base image, then used the RUN instruction to install the system packages and the Python librariesENV instruction to set the PATH environment variable, the EXPOSE instruction to expose the port for the Jupyter notebook server, and the CMD instruction to start the Jupyter notebook serverLABEL instructionsdocker build command and ran it with the docker run command