QTM 350 - Data Science Computing

Lecture 23 - Dependency Management, Virtual Environments, and Containers

Danilo Freire

Emory University

20 November, 2024

Hello again! 😊
How’s everything?

Brief recap of last class 📚

Parallel computing with Dask

  • We discussed how to parallelise the training of machine learning models
  • How to use automated machine learning (AutoML) tools to speed up the process
  • We also discussed how to implement different methods to search for the best hyperparameters using TPOT, scikit-learn, and Dask
  • Finally, we learnt about Dask Clusters, which allow us to scale your computations across multiple cores or machines

Today’s agenda 📅

Lecture outline

  • Today we will talk about a different topic: how to make sure your results are reproducible?
  • We will discuss the importance of dependency management, virtual environments, and containers
  • Replication has been a recurring theme in this course, and today we will learn how to make it easier
  • That is the main reason why we use the command line, git, Quarto, Jupyter, and other tools
  • So today we will discuss some of the best practices to ensure computational reproducibility
  • We will also discuss how to use containers to make your code portable, reproducible, and scalable
  • Let’s get started! 🚀

Dependency management 📦

Congratulations! 🎉

  • You now have a project!
  • Your code works great, it runs pretty fast thanks to Dask, your Quarto reports are beautiful, and your analyses (all done in the command line) are stored in a well-documented GitHub repository
  • Are you done? 🤔
  • Not quite! 😅
  • What if you need to run your code on a different machine? Or share it with a colleague? Or run it again in a few months?
  • You need to make sure your code will run in the future, and that’s where dependency management comes in

Why do we need dependency management? 🤔

The problem

  • As we have seen in this course (and in many others), libraries and packages change constantly
  • New versions are released, old versions are deprecated, and not all operating systems have the required libraries installed to run your code
  • Even extremely simple code can break from one version of a library to the next
  • this code written in Python 2.7:
print "Hello, world!"
  • will not work in Python 3.x
print "Hello world!"
  File "/var/folders/96/r1yycxlj28958p1cdynhbyzw0000gn/T/Rtmpa0OGSM/chunk-code-b08d2b78904b.txt", line 1
    print "Hello world!"
                       ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Hello world!")?

Some definitions

  • Dependency management is the process of ensuring that your code will run in the future
  • Dependencies are the external components necessary to run your code, such as libraries, packages, and software
  • Packages have a name, a type, a version, relevant files for the package’s functionality, and potentially dependencies on other packages
  • A package registry is a directory for packages and stores metadata about packages in the registry, such as CRAN, PyPI, or Conda
  • Dependency management tools help you manage your dependencies, such as pip and conda

The reproducibility crisis 🚨

  • We have already seen how important it is to ensure that your results are reproducible
  • The reproducibility crisis is a term used to describe the inability of researchers to replicate the results of a study
  • This affects computer science, statistics, and many other fields
  • Apart from statistical issues such as p-hacking, publication bias, and low statistical power, one of the main reasons for the reproducibility crisis is the lack of proper documentation and dependency management
  • While the statistical problems are a bit more complex, the latter can be easily solved with tools that we already have at our disposal
  • According to a recent Nature survey, more than 70% of scholars have tried and failed to reproduce another scientist’s research, and more than half have failed to reproduce their own research 😳
  • 90% of researchers believe that there is a reproducibility crisis in science

The reproducibility trade-off 🔄

  • How far should we go to ensure that our results are reproducible?
  • Due diligence starts at declaring dependencies
  • You can empower your declared dependencies with a package/environment manager such as conda or pip
  • Packaging dependencies uses tools like renv (for R), or pipenv (for Python)
  • Online environments can be created for your work (in a relatively user friendly way), such as Code Ocean ($$$), or Binder (free, but with several limitations)
  • Containers are awesome, and container tools like Docker and Singularity can be used to package your code and dependencies in a portable way
  • Which one should you use? It depends on your needs, but the more integrated your workflow is, the better!

How to declare dependencies? 📜

  • At the very least, you should declare (in your README.md) how your project works for you
    • What language, what version?
    • What packages/libraries do you load
    • What OS do you use? (Does it work on your collaborator’s system?)
    • What data do you use?
    • What are the steps to reproduce your results?
  • However, it is better to use a package manager to declare your dependencies
  • A single file describing the necessary dependencies, which can be used to install all dependencies in one step
  • Store the file in the repository root (main folder)
  • This depends on the environment/package manager you want to use:
    • For conda (python and R): generate an environment.yml file
    • For pip (python only): generate a requirements.txt file
  • environment.yml (for conda)
  • requirements.txt (for pip)
    • Generate it with pip freeze > requirements.txt
    • Install dependencies declared with pip install -r requirements.txt
    • Let’s see some examples

Virtual environments 🌐

Example: environment.yml file

  • Conda environments allow you to create isolated environments with specific versions of packages
  • Each environment can have its own dependencies, and you can switch between them easily
  • As they are isolated from each other, you can have different versions of the same package in different environments, and you can share the environment file with others
  • They are probably the most user-friendly way to manage dependencies in Python, and are widely used in data science
  • Here’s how to create one:
conda create --name qtm350 python=3.12 -y
conda activate qtm350
conda install pandas numpy matplotlib -y
cd $(conda info --base)/envs/qtm350
mkdir -p scripts
cd scripts
echo "print('Hello, world!')" > hello.py
  • To create an environment.yml file, you can use the following command:
conda env export --name qtm350 --file environment.yml
  • This will create a file with the following content:
name: qtm350
channels:
  - defaults
dependencies:
  - blas=1.0=openblas
  - bottleneck=1.4.2=py312ha86b861_0
  - brotli=1.0.9=h80987f9_8
  - brotli-bin=1.0.9=h80987f9_8
  - bzip2=1.0.8=h80987f9_6
  - ca-certificates=2024.9.24=hca03da5_0
  - contourpy=1.2.0=py312h48ca7d4_0
  - cycler=0.11.0=pyhd3eb1b0_0
  - expat=2.6.3=h313beb8_0
  - fonttools=4.51.0=py312h80987f9_0
  - freetype=2.12.1=h1192e45_0
  - jpeg=9e=h80987f9_3
  - kiwisolver=1.4.4=py312h313beb8_0
  - lcms2=2.12=hba8e193_0
  - lerc=3.0=hc377ac9_0
  - libbrotlicommon=1.0.9=h80987f9_8
  - libbrotlidec=1.0.9=h80987f9_8
  - libbrotlienc=1.0.9=h80987f9_8
  - libcxx=14.0.6=h848a8c0_0
  - libdeflate=1.17=h80987f9_1
  - libffi=3.4.4=hca03da5_1
  - libgfortran=5.0.0=11_3_0_hca03da5_28
  - libgfortran5=11.3.0=h009349e_28
  - libopenblas=0.3.21=h269037a_0
  - libpng=1.6.39=h80987f9_0
  - libtiff=4.5.1=h313beb8_0
  - libwebp-base=1.3.2=h80987f9_1
  - llvm-openmp=14.0.6=hc6e5704_0
  - lz4-c=1.9.4=h313beb8_1
  - matplotlib=3.9.2=py312hca03da5_0
  - matplotlib-base=3.9.2=py312h2df2da3_0
  - ncurses=6.4=h313beb8_0
  - numexpr=2.10.1=py312h5d9532f_0
  - numpy=1.26.4=py312h7f4fdc5_0
  - numpy-base=1.26.4=py312he047099_0
  - openjpeg=2.5.2=h54b8e55_0
  - openssl=3.0.15=h80987f9_0
  - packaging=24.1=py312hca03da5_0
  - pandas=2.2.2=py312hd77ebd4_0
  - pillow=11.0.0=py312hfaf4e14_0
  - pip=24.2=py312hca03da5_0
  - pyparsing=3.2.0=py312hca03da5_0
  - python=3.12.7=h99e199e_0
  - python-dateutil=2.9.0post0=py312hca03da5_2
  - python-tzdata=2023.3=pyhd3eb1b0_0
  - pytz=2024.1=py312hca03da5_0
  - readline=8.2=h1a28f6b_0
  - setuptools=75.1.0=py312hca03da5_0
  - six=1.16.0=pyhd3eb1b0_1
  - sqlite=3.45.3=h80987f9_0
  - tk=8.6.14=h6ba3021_0
  - tornado=6.4.1=py312h80987f9_0
  - tzdata=2024b=h04d1e81_0
  - unicodedata2=15.1.0=py312h80987f9_0
  - wheel=0.44.0=py312hca03da5_0
  - xz=5.4.6=h80987f9_1
  - zlib=1.2.13=h18a0788_1
  - zstd=1.5.6=hfb09047_0
prefix: /opt/miniconda3/envs/qtm350

Example: environment.yml file

  • So far, so good! But what if you want to share your environment with someone else?
  • You can do that by simply sharing the environment.yml file with them (or by uploading it to your repository!)
  • They can then create the same environment on their machine by running the following command:
conda env create --file environment.yml

Channels:
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate qtm350
#
# To deactivate an active environment, use
#
#     $ conda deactivate

(base) 
  • This will create a new environment called qtm350 with the same packages and versions as the original environment
  • They can then activate the environment and run the code
conda activate qtm350
python scripts/hello.py
  • To delete the environment, they can run:
conda deactivate
conda env remove --name qtm350
  • And you’re done! 🎉

Example: requirements.txt file

  • If you are using pip instead of conda, you can generate a requirements.txt file with the following command:
pip freeze > requirements.txt
  • This will create a file with the following content:
Bottleneck @ file:///private/var/folders/nz/j7p8yfhx1mv_0grj5xl4650h0000gp/T/abs_55txi4fy1u/croot/bottleneck_1731058642212/work
contourpy @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/contourpy_1701814001737/work
cycler @ file:///tmp/build/80754af9/cycler_1637851556182/work
fonttools @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_60c8ux4mkl/croot/fonttools_1713551354374/work
kiwisolver @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/kiwisolver_1699239145780/work
matplotlib==3.9.2
numexpr @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_b3kvvt6tc6/croot/numexpr_1730215947700/work
numpy @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a51i_mbs7m/croot/numpy_and_numpy_base_1708638620867/work/dist/numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl#sha256=37afb6b734a197702d848df93bd67c10b52f6467d56e518950d84b6b1c949d27
packaging @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_81ri4yfpjw/croot/packaging_1720101866878/work
pandas @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_b53hgou29t/croot/pandas_1718308972393/work/dist/pandas-2.2.2-cp312-cp312-macosx_11_0_arm64.whl#sha256=1956b71d1baac8b370fd9deac6100aadefda112447dca816a81ecbf3ea4eb3e6
pillow @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_92egn12how/croot/pillow_1731594702114/work
pyparsing @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_65qfw6vkxg/croot/pyparsing_1731445528142/work
python-dateutil @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_66ud1l42_h/croot/python-dateutil_1716495741162/work
pytz @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a4b76c83ik/croot/pytz_1713974318928/work
setuptools==75.1.0
six @ file:///tmp/build/80754af9/six_1644875935023/work
tornado @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a4w03z48br/croot/tornado_1718740114858/work
tzdata @ file:///croot/python-tzdata_1690578112552/work
unicodedata2 @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a3epjto7gs/croot/unicodedata2_1713212955584/work
wheel==0.44.0
  • To install the dependencies, you can run:
pip install -r requirements.txt
  • And that’s all there is to it! 😊
  • It works in a similar way to conda, and it is also widely used in the Python community
  • However, it is not as user-friendly as conda, and it does not manage environments as well (it only installs packages in the current environment)
  • You can also use pipenv to manage your dependencies
  • Let’s see how it works in the next slide 🤓

pipenv 🐍

  • Pipenv is a package manager for Python that combines pip and virtualenv into a single tool
  • It automatically creates and manages a virtual environment for your projects, as well as adds/removes packages from your Pipfile as you install/uninstall packages
  • It works in a similar way to conda, so if you use one there is little reason to use the other
  • To install pipenv, you can run:
pip install pipenv
  • You can create a new environment by running:
pipenv install
  • And to install a package, type the following:
mkdir qtm350
cd qtm350
pipenv install numpy pandas
  • Now we just need to activate the environment and
pipenv shell
  • To generate a Pipfile file, you can run:
pipenv lock
  • This will create a file with the following content:
{
    "_meta": {
        "hash": {
            "sha256": "4be66337fb4213c1154dc8bcb3176002ff54dbce0fbf6307684078957c86bda1"
        },
        "pipfile-spec": 6,
        "requires": {
            "python_version": "3.12"
        },
.....
  • To reproduce the environment on another machine, share the Pipfile and Pipfile.lock files and run:
pipenv install --ignore-pipfile
  • And that’s all you need to know about pipenv! 😊

Try it yourself! 🧪

  • Create a new conda environment called qtm350 (or any other name) with the following packages: numpy, pandas, and matplotlib
  • Use the commands shown in the previous slides
  • Create a new environment.yml file with the environment you just created
  • Let me know if you have any questions! 😊
  • Appendix 01

Containers 🚢

The challenge

The matrix from hell

Cargo transport pre-1960

Also a matrix from hell

The solution: intermodal containers

Docker is a container for your code

Docker eliminates the matrix from hell

What are software containers?

  • Containers are a way to package software in a format that can run on any system
  • They are similar to virtual machines (VMs), but they are more lightweight and portable
  • A virtual machine runs a full operating system, while a container runs only the necessary libraries and dependencies
  • They are also one step up from virtual environments, as they package the entire computer environment, not just the dependencies and code
  • That means that you can run your code on any system that has Docker installed, regardless of the operating system or the hardware
  • Talk about reproducibility! 😊
  • Containers are created using containerisation tools such as Docker or Singularity
  • They are usually just a stripped-down version of a Linux operating system with the necessary libraries and dependencies
  • Why Linux? Because it is open-source, and it is the most widely used operating system in the world (really!)
  • The Linux distribution of choice is usually Ubuntu or Alpine Linux, which is even more lightweight
  • Containers are usually stored in a container registry such as Docker Hub, so you can share them with others too

What is Docker? 🐳

  • Docker is the leading containerisation platform, mainly in the industry
  • Other software like Singulariy is used in more niche applications, such as high-performance computing, but has a smaller user base
  • The main entities in Docker are containers, images, and registries
  • A container, as we have just seen, is an executable package of software that includes everything needed to run an application
  • An image is a snapshot of a container. Images are created with the build command, and they will produce a container when started with run. More about the distinction here
  • A registry is a collection of repositories from which you can pull images
  • Docker has very similar syntax to Git and Linux, so if you are familiar with the command line tools for them then most of Docker should seem somewhat natural (though you should still read the docs!)
  • To install Docker, you can follow the instructions here
  • Docker Desktop is the easiest way to get started with Docker on Windows and Mac, and it included everything you need to run Docker on your machine
  • It is a good idea to create an account too, as you can use it to store your images in Docker Hub
  • After you have installed Docker, you can run the following command to check if it is working:
docker run hello-world

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
478afc919002: Download complete 
Digest: sha256:305243c734571da2d100c8c8b3c3167a098cab6049c9a5b066b6021a60fcb966
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.
docker -v

Docker version 27.3.1, build ce12230

Docker desktop

Docker architecture

Let’s create our first container 🐳

  • Docker can seem a little daunting at first
  • We need to create and configure a dockerfile to build an image, which is a plain-text recipe that will build from scratch everything you need to recreate a project
  • It is just a textfile that you can put under version control
  • Then we need to run the dockerfile to generate a Docker image
  • However, in most cases you do not need to start from scratch
  • Docker Hub has thousands of images that you can use as a base for your project, and it is very likely that someone has already created an image similar to the one you need
  • For example, as of today there are 6,852 images of data science tools in Docker Hub
  • But we will build one today, just to see how it works 😊
  • Our next class will be a hands-on session on Docker, so we will have plenty of time to practice how to build containers from scratch or from existing images 😉

Dockerfile

  • A Dockerfile is a text file that contains all the commands a user could call on the command line to assemble an image
  • Using docker build, we can create an automated build that executes several command-line instructions in succession
  • Let’s first create a directory called docker and a file called Dockerfile inside it
mkdir docker
cd docker
touch Dockerfile
  • We will use an official Python image as a base, and install some packages on top of it. Here: https://hub.docker.com/_/python
  • Then we will write code that installs some Python packages and runs a Python script
  • The syntax is quite straightforward:
    • FROM <image_name>:<tag> is the base image, RUN is the command to run, and CMD is the command to run when the container starts
  • We will use only those three commands for now
  • The full list of commands is available here
# Use an official Python runtime as a parent image
FROM python:3

# Install libraries
RUN pip install numpy pandas matplotlib

# Add a script
RUN echo "print('Hello, QTM350!')" > hello.py

# Run the script
CMD ["python", "hello.py"]
  • We build the image with the following command (the . is to indicate the current directory):
docker build -t qtm350-example .

Let’s see how it looks like

Looks pretty good! 👍

Looks pretty good! 👍

Let’s run the container! 🏃‍♂️

Testing the container

  • To run the container, we can use the following command:
docker run qtm350-example

Hello, QTM350!
(base)
  • Woo-hoo! 🎉 😂
  • We have successfully created a container with Python, numpy, pandas, and matplotlib installed, and we have run a Python script

Upload to Docker Hub

  • If you want to share your image with others, you can upload it to Docker Hub
  • First, you need to log in to Docker Hub:
docker login
  • Then you can tag your image with your Docker Hub username:
docker tag qtm350-example danilofreire/qtm350-example:latest
  • And finally, you can push the image to Docker Hub:
docker push danilofreire/qtm350-example:latest

And here it is:

Link: https://hub.docker.com/r/danilofreire/qtm350-example

Docker pull

  • Now that the image is on Docker Hub, anyone can pull it and run it on their machine
  • To do that, they just need to run the following command:
docker pull danilofreire/qtm350-example:latest
  • And then they can run the container with:
docker run danilofreire/qtm350-example:latest
  • And that’s it! 😊
  • That’s how easy it is to share your work with others using Docker
  • You can also use Docker to run your code on a server, or to create a reproducible environment for your work
  • Although our example here is extremely simple, you can build anything you can imagine with Docker, from a simple web server to a complex data science pipeline and beyond
  • Hopefully, one day researchers will require that all code is shared in a Docker container 🤓
  • … and now you know how to do it!

Summary

  • Dependency management is more important than people think
  • It is the first step towards reproducibility and transparency
  • You can use conda, pip, or pipenv to manage your dependencies in Python
  • Docker offers a more comprehensive solution to the reproducibility crisis
  • None of them are particularly difficult to use, and they can save you a lot of time and headaches in the future
  • In the next class, we will have a hands-on session on Docker, so you can practice building containers and running them on your machine

And that’s all for today! 🎉

See you next time! 🚀

Appendix 01

  • Here is the solution to Exercise 01:
conda create --name qtm350 python=3.12 -y
conda activate qtm350
conda install numpy pandas matplotlib -y
conda env export --name qtm350 --file environment.yml

Back to Exercise 01