QTM 350 - Data Science Computing

Lecture 23 - Dependency Management, Virtual Environments, and Containers

Danilo Freire

danilo.freire@emory.edu

Emory University

Hello again! 😊
How’s everything?

Brief recap of last class 📚

Parallel computing with Dask

We discussed how to parallelise the training of machine learning models
How to use automated machine learning (AutoML) tools to speed up the process
We also discussed how to implement different methods to search for the best hyperparameters using TPOT, scikit-learn, and Dask
Finally, we learnt about Dask Clusters, which allow us to scale your computations across multiple cores or machines

Today’s agenda 📅

Lecture outline

Today we will talk about a different topic: how to make sure your results are reproducible?
We will discuss the importance of dependency management, virtual environments, and containers
Replication has been a recurring theme in this course, and today we will learn how to make it easier
That is the main reason why we use the command line, git, Quarto, Jupyter, and other tools
So today we will discuss some of the best practices to ensure computational reproducibility
We will also discuss how to use containers to make your code portable, reproducible, and scalable
Let’s get started! 🚀

Dependency management 📦

Congratulations! 🎉

You now have a project!
Your code works great, it runs pretty fast thanks to Dask, your Quarto reports are beautiful, and your analyses (all done in the command line) are stored in a well-documented GitHub repository
Are you done? 🤔
Not quite! 😅
What if you need to run your code on a different machine? Or share it with a colleague? Or run it again in a few months?
You need to make sure your code will run in the future, and that’s where dependency management comes in

Why do we need dependency management? 🤔

The problem

As we have seen in this course (and in many others), libraries and packages change constantly
New versions are released, old versions are deprecated, and not all operating systems have the required libraries installed to run your code
Even extremely simple code can break from one version of a library to the next
this code written in Python 2.7:

print "Hello, world!"

will not work in Python 3.x

print "Hello world!"

  File "/var/folders/96/r1yycxlj28958p1cdynhbyzw0000gn/T/Rtmpa0OGSM/chunk-code-b08d2b78904b.txt", line 1
    print "Hello world!"
                       ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Hello world!")?

Some definitions

Dependency management is the process of ensuring that your code will run in the future
Dependencies are the external components necessary to run your code, such as libraries, packages, and software
Packages have a name, a type, a version, relevant files for the package’s functionality, and potentially dependencies on other packages
A package registry is a directory for packages and stores metadata about packages in the registry, such as CRAN, PyPI, or Conda
Dependency management tools help you manage your dependencies, such as pip and conda

The reproducibility crisis 🚨

We have already seen how important it is to ensure that your results are reproducible
The reproducibility crisis is a term used to describe the inability of researchers to replicate the results of a study
This affects computer science, statistics, and many other fields
Apart from statistical issues such as p-hacking, publication bias, and low statistical power, one of the main reasons for the reproducibility crisis is the lack of proper documentation and dependency management
While the statistical problems are a bit more complex, the latter can be easily solved with tools that we already have at our disposal

According to a recent Nature survey, more than 70% of scholars have tried and failed to reproduce another scientist’s research, and more than half have failed to reproduce their own research 😳
90% of researchers believe that there is a reproducibility crisis in science

The reproducibility trade-off 🔄

How far should we go to ensure that our results are reproducible?
Due diligence starts at declaring dependencies
You can empower your declared dependencies with a package/environment manager such as conda or pip
Packaging dependencies uses tools like renv (for R), or pipenv (for Python)
Online environments can be created for your work (in a relatively user friendly way), such as Code Ocean ($$$), or Binder (free, but with several limitations)
Containers are awesome, and container tools like Docker and Singularity can be used to package your code and dependencies in a portable way
Which one should you use? It depends on your needs, but the more integrated your workflow is, the better!

How to declare dependencies? 📜

At the very least, you should declare (in your README.md) how your project works for you
- What language, what version?
- What packages/libraries do you load
- What OS do you use? (Does it work on your collaborator’s system?)
- What data do you use?
- What are the steps to reproduce your results?
However, it is better to use a package manager to declare your dependencies
A single file describing the necessary dependencies, which can be used to install all dependencies in one step
Store the file in the repository root (main folder)
This depends on the environment/package manager you want to use:
- For conda (python and R): generate an environment.yml file
- For pip (python only): generate a requirements.txt file

environment.yml (for conda)
- Used by conda to create an environment populated by specific packages and languages
- Generate it with conda env export -f environment.yml
- Consider going through this quick intro to conda environments
- Or get the full story in the conda documentation
requirements.txt (for pip)
- Generate it with pip freeze > requirements.txt
- Install dependencies declared with pip install -r requirements.txt
- Let’s see some examples

Virtual environments 🌐

Example: `environment.yml` file

Conda environments allow you to create isolated environments with specific versions of packages
Each environment can have its own dependencies, and you can switch between them easily
As they are isolated from each other, you can have different versions of the same package in different environments, and you can share the environment file with others
They are probably the most user-friendly way to manage dependencies in Python, and are widely used in data science
Here’s how to create one:

conda create --name qtm350 python=3.12 -y
conda activate qtm350
conda install pandas numpy matplotlib -y
cd $(conda info --base)/envs/qtm350
mkdir -p scripts
cd scripts
echo "print('Hello, world!')" > hello.py

To create an environment.yml file, you can use the following command:

conda env export --name qtm350 --file environment.yml

This will create a file with the following content:

name: qtm350
channels:
  - defaults
dependencies:
  - blas=1.0=openblas
  - bottleneck=1.4.2=py312ha86b861_0
  - brotli=1.0.9=h80987f9_8
  - brotli-bin=1.0.9=h80987f9_8
  - bzip2=1.0.8=h80987f9_6
  - ca-certificates=2024.9.24=hca03da5_0
  - contourpy=1.2.0=py312h48ca7d4_0
  - cycler=0.11.0=pyhd3eb1b0_0
  - expat=2.6.3=h313beb8_0
  - fonttools=4.51.0=py312h80987f9_0
  - freetype=2.12.1=h1192e45_0
  - jpeg=9e=h80987f9_3
  - kiwisolver=1.4.4=py312h313beb8_0
  - lcms2=2.12=hba8e193_0
  - lerc=3.0=hc377ac9_0
  - libbrotlicommon=1.0.9=h80987f9_8
  - libbrotlidec=1.0.9=h80987f9_8
  - libbrotlienc=1.0.9=h80987f9_8
  - libcxx=14.0.6=h848a8c0_0
  - libdeflate=1.17=h80987f9_1
  - libffi=3.4.4=hca03da5_1
  - libgfortran=5.0.0=11_3_0_hca03da5_28
  - libgfortran5=11.3.0=h009349e_28
  - libopenblas=0.3.21=h269037a_0
  - libpng=1.6.39=h80987f9_0
  - libtiff=4.5.1=h313beb8_0
  - libwebp-base=1.3.2=h80987f9_1
  - llvm-openmp=14.0.6=hc6e5704_0
  - lz4-c=1.9.4=h313beb8_1
  - matplotlib=3.9.2=py312hca03da5_0
  - matplotlib-base=3.9.2=py312h2df2da3_0
  - ncurses=6.4=h313beb8_0
  - numexpr=2.10.1=py312h5d9532f_0
  - numpy=1.26.4=py312h7f4fdc5_0
  - numpy-base=1.26.4=py312he047099_0
  - openjpeg=2.5.2=h54b8e55_0
  - openssl=3.0.15=h80987f9_0
  - packaging=24.1=py312hca03da5_0
  - pandas=2.2.2=py312hd77ebd4_0
  - pillow=11.0.0=py312hfaf4e14_0
  - pip=24.2=py312hca03da5_0
  - pyparsing=3.2.0=py312hca03da5_0
  - python=3.12.7=h99e199e_0
  - python-dateutil=2.9.0post0=py312hca03da5_2
  - python-tzdata=2023.3=pyhd3eb1b0_0
  - pytz=2024.1=py312hca03da5_0
  - readline=8.2=h1a28f6b_0
  - setuptools=75.1.0=py312hca03da5_0
  - six=1.16.0=pyhd3eb1b0_1
  - sqlite=3.45.3=h80987f9_0
  - tk=8.6.14=h6ba3021_0
  - tornado=6.4.1=py312h80987f9_0
  - tzdata=2024b=h04d1e81_0
  - unicodedata2=15.1.0=py312h80987f9_0
  - wheel=0.44.0=py312hca03da5_0
  - xz=5.4.6=h80987f9_1
  - zlib=1.2.13=h18a0788_1
  - zstd=1.5.6=hfb09047_0
prefix: /opt/miniconda3/envs/qtm350

Example: `environment.yml` file

So far, so good! But what if you want to share your environment with someone else?
You can do that by simply sharing the environment.yml file with them (or by uploading it to your repository!)
They can then create the same environment on their machine by running the following command:

conda env create --file environment.yml

Channels:
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate qtm350
#
# To deactivate an active environment, use
#
#     $ conda deactivate

(base)

This will create a new environment called qtm350 with the same packages and versions as the original environment
They can then activate the environment and run the code

conda activate qtm350
python scripts/hello.py

To delete the environment, they can run:

conda deactivate
conda env remove --name qtm350

And you’re done! 🎉

Example: `requirements.txt` file

If you are using pip instead of conda, you can generate a requirements.txt file with the following command:

pip freeze > requirements.txt

This will create a file with the following content:

Bottleneck @ file:///private/var/folders/nz/j7p8yfhx1mv_0grj5xl4650h0000gp/T/abs_55txi4fy1u/croot/bottleneck_1731058642212/work
contourpy @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/contourpy_1701814001737/work
cycler @ file:///tmp/build/80754af9/cycler_1637851556182/work
fonttools @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_60c8ux4mkl/croot/fonttools_1713551354374/work
kiwisolver @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/kiwisolver_1699239145780/work
matplotlib==3.9.2
numexpr @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_b3kvvt6tc6/croot/numexpr_1730215947700/work
numpy @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a51i_mbs7m/croot/numpy_and_numpy_base_1708638620867/work/dist/numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl#sha256=37afb6b734a197702d848df93bd67c10b52f6467d56e518950d84b6b1c949d27
packaging @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_81ri4yfpjw/croot/packaging_1720101866878/work
pandas @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_b53hgou29t/croot/pandas_1718308972393/work/dist/pandas-2.2.2-cp312-cp312-macosx_11_0_arm64.whl#sha256=1956b71d1baac8b370fd9deac6100aadefda112447dca816a81ecbf3ea4eb3e6
pillow @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_92egn12how/croot/pillow_1731594702114/work
pyparsing @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_65qfw6vkxg/croot/pyparsing_1731445528142/work
python-dateutil @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_66ud1l42_h/croot/python-dateutil_1716495741162/work
pytz @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a4b76c83ik/croot/pytz_1713974318928/work
setuptools==75.1.0
six @ file:///tmp/build/80754af9/six_1644875935023/work
tornado @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a4w03z48br/croot/tornado_1718740114858/work
tzdata @ file:///croot/python-tzdata_1690578112552/work
unicodedata2 @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a3epjto7gs/croot/unicodedata2_1713212955584/work
wheel==0.44.0

To install the dependencies, you can run:

pip install -r requirements.txt

And that’s all there is to it! 😊
It works in a similar way to conda, and it is also widely used in the Python community
However, it is not as user-friendly as conda, and it does not manage environments as well (it only installs packages in the current environment)
You can also use pipenv to manage your dependencies
Let’s see how it works in the next slide 🤓

`pipenv` 🐍

Pipenv is a package manager for Python that combines pip and virtualenv into a single tool
It automatically creates and manages a virtual environment for your projects, as well as adds/removes packages from your Pipfile as you install/uninstall packages
It works in a similar way to conda, so if you use one there is little reason to use the other
To install pipenv, you can run:

pip install pipenv

You can create a new environment by running:

pipenv install

And to install a package, type the following:

mkdir qtm350
cd qtm350
pipenv install numpy pandas

Now we just need to activate the environment and

pipenv shell

To generate a Pipfile file, you can run:

pipenv lock

This will create a file with the following content:

{
    "_meta": {
        "hash": {
            "sha256": "4be66337fb4213c1154dc8bcb3176002ff54dbce0fbf6307684078957c86bda1"
        },
        "pipfile-spec": 6,
        "requires": {
            "python_version": "3.12"
        },
.....

To reproduce the environment on another machine, share the Pipfile and Pipfile.lock files and run:

pipenv install --ignore-pipfile

And that’s all you need to know about pipenv! 😊

Try it yourself! 🧪

Create a new conda environment called qtm350 (or any other name) with the following packages: numpy, pandas, and matplotlib
Use the commands shown in the previous slides
Create a new environment.yml file with the environment you just created
Let me know if you have any questions! 😊
Appendix 01

Containers 🚢

The challenge

The matrix from hell

Cargo transport pre-1960

Also a matrix from hell

The solution: intermodal containers

Docker is a container for your code

Docker eliminates the matrix from hell

What are software containers?

Containers are a way to package software in a format that can run on any system
They are similar to virtual machines (VMs), but they are more lightweight and portable
A virtual machine runs a full operating system, while a container runs only the necessary libraries and dependencies
They are also one step up from virtual environments, as they package the entire computer environment, not just the dependencies and code
That means that you can run your code on any system that has Docker installed, regardless of the operating system or the hardware
Talk about reproducibility! 😊

Containers are created using containerisation tools such as Docker or Singularity
They are usually just a stripped-down version of a Linux operating system with the necessary libraries and dependencies
Why Linux? Because it is open-source, and it is the most widely used operating system in the world (really!)
The Linux distribution of choice is usually Ubuntu or Alpine Linux, which is even more lightweight
Containers are usually stored in a container registry such as Docker Hub, so you can share them with others too

What is Docker? 🐳

Docker is the leading containerisation platform, mainly in the industry
Other software like Singulariy is used in more niche applications, such as high-performance computing, but has a smaller user base
The main entities in Docker are containers, images, and registries
A container, as we have just seen, is an executable package of software that includes everything needed to run an application
An image is a snapshot of a container. Images are created with the build command, and they will produce a container when started with run. More about the distinction here
A registry is a collection of repositories from which you can pull images
Docker has very similar syntax to Git and Linux, so if you are familiar with the command line tools for them then most of Docker should seem somewhat natural (though you should still read the docs!)

To install Docker, you can follow the instructions here
Docker Desktop is the easiest way to get started with Docker on Windows and Mac, and it included everything you need to run Docker on your machine
It is a good idea to create an account too, as you can use it to store your images in Docker Hub
After you have installed Docker, you can run the following command to check if it is working:

docker run hello-world

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
478afc919002: Download complete 
Digest: sha256:305243c734571da2d100c8c8b3c3167a098cab6049c9a5b066b6021a60fcb966
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

docker -v

Docker version 27.3.1, build ce12230

Docker desktop

Docker architecture

Let’s create our first container 🐳

Docker can seem a little daunting at first
We need to create and configure a dockerfile to build an image, which is a plain-text recipe that will build from scratch everything you need to recreate a project
It is just a textfile that you can put under version control
Then we need to run the dockerfile to generate a Docker image
However, in most cases you do not need to start from scratch
Docker Hub has thousands of images that you can use as a base for your project, and it is very likely that someone has already created an image similar to the one you need
For example, as of today there are ~7,600 images of data science tools in Docker Hub
But we will build one today, just to see how it works 😊
Our next class will be a hands-on session on Docker, so we will have plenty of time to practice how to build containers from scratch or from existing images 😉

Dockerfile

A Dockerfile is a text file that contains all the commands a user could call on the command line to assemble an image
Using docker build, we can create an automated build that executes several command-line instructions in succession
Let’s first create a directory called docker and a file called Dockerfile inside it

mkdir docker
cd docker
touch Dockerfile

We will use an official Python image as a base, and install some packages on top of it. Here: https://hub.docker.com/_/python
Then we will write code that installs some Python packages and runs a Python script

The syntax is quite straightforward:
- FROM <image_name>:<tag> is the base image, RUN is the command to run, and CMD is the command to run when the container starts
We will use only those three commands for now
The full list of commands is available here

# Use an official Python runtime as a parent image
FROM python:3

# Install libraries
RUN pip install numpy pandas matplotlib

# Add a script
RUN echo "print('Hello, QTM350!')" > hello.py

# Run the script
CMD ["python", "hello.py"]

We build the image with the following command (the . is to indicate the current directory):

docker build -t qtm350-example .

Let’s see how it looks like

Looks pretty good! 👍

Let’s run the container! 🏃‍♂️

Testing the container

To run the container, we can use the following command:

docker run qtm350-example

Hello, QTM350!
(base)

Woo-hoo! 🎉 😂
We have successfully created a container with Python, numpy, pandas, and matplotlib installed, and we have run a Python script

Upload to Docker Hub

If you want to share your image with others, you can upload it to Docker Hub
First, you need to log in to Docker Hub:

docker login

Then you can tag your image with your Docker Hub username:

docker tag qtm350-example danilofreire/qtm350-example:latest

And finally, you can push the image to Docker Hub:

docker push danilofreire/qtm350-example:latest

And here it is:

Link: https://hub.docker.com/r/danilofreire/qtm350-example

Docker pull

Now that the image is on Docker Hub, anyone can pull it and run it on their machine
To do that, they just need to run the following command:

docker pull danilofreire/qtm350-example:latest

And then they can run the container with:

docker run danilofreire/qtm350-example:latest

And that’s it! 😊
That’s how easy it is to share your work with others using Docker
You can also use Docker to run your code on a server, or to create a reproducible environment for your work
Although our example here is extremely simple, you can build anything you can imagine with Docker, from a simple web server to a complex data science pipeline and beyond
Hopefully, one day researchers will require that all code is shared in a Docker container 🤓
… and now you know how to do it!

Summary

Dependency management is more important than people think
It is the first step towards reproducibility and transparency
You can use conda, pip, or pipenv to manage your dependencies in Python
Docker offers a more comprehensive solution to the reproducibility crisis
None of them are particularly difficult to use, and they can save you a lot of time and headaches in the future
In the next class, we will have a hands-on session on Docker, so you can practice building containers and running them on your machine

And that’s all for today! 🎉

See you next time! 🚀

Appendix 01

Here is the solution to Exercise 01:

conda create --name qtm350 python=3.12 -y
conda activate qtm350
conda install numpy pandas matplotlib -y
conda env export --name qtm350 --file environment.yml

Back to Exercise 01

QTM 350 - Data Science Computing

Hello again! 😊 How’s everything?

Brief recap of last class 📚

Parallel computing with Dask

Today’s agenda 📅

Lecture outline

Dependency management 📦

Congratulations! 🎉

Why do we need dependency management? 🤔

The problem

Some definitions

The reproducibility crisis 🚨

The reproducibility trade-off 🔄

How to declare dependencies? 📜

Virtual environments 🌐

Example: environment.yml file

Example: environment.yml file

Example: requirements.txt file

pipenv 🐍

Try it yourself! 🧪

Containers 🚢

The challenge

The matrix from hell

Cargo transport pre-1960

Also a matrix from hell

The solution: intermodal containers

Docker is a container for your code

Docker eliminates the matrix from hell

What are software containers?

What is Docker? 🐳

Docker desktop

Docker architecture

Let’s create our first container 🐳

Dockerfile

Let’s see how it looks like

Looks pretty good! 👍

Looks pretty good! 👍

Let’s run the container! 🏃‍♂️

Testing the container

Upload to Docker Hub

And here it is:

Docker pull

Summary

And that’s all for today! 🎉

See you next time! 🚀

Appendix 01

Hello again! 😊
How’s everything?

Example: `environment.yml` file

Example: `environment.yml` file

Example: `requirements.txt` file

`pipenv` 🐍