DATASCI 350 - Data Science Computing

Lecture 23 - Dependency Management, Virtual Environments, and Containers

Danilo Freire

danilo.freire@emory.edu

Department of Data and Decision Sciences
Emory University

Hello again! 😊

Brief recap of last class 📚

Parallelising data analysis with Dask and AutoML

Set up local Dask Clusters with workers, schedulers, and dashboards
Workers compute tasks and store results; the scheduler coordinates them
Used Dask ML for distributed machine learning and model training
Parallelised hyperparameter tuning with IncrementalSearchCV, HyperbandSearchCV, and TPOT
Handled async/sync compatibility issues and cached computations
Measured speedup and efficiency gains from parallel execution

Today’s agenda 📅

Lecture outline

New topic today: how do we make sure our results are reproducible?
Replication has been a recurring theme in this course, and that is why we use the command line, git, Quarto, and Jupyter
Today we cover dependency management, virtual environments, and containers
We will also learn how to use Docker to make code portable and reproducible
Let’s get started! 🚀

Dependency management 📦

Congratulations! 🎉

You now have a project!
Your code works great, it runs pretty fast thanks to Dask, your Quarto reports are beautiful, and your analyses (all done in the command line) are stored in a well-documented GitHub repository 😁
Are you done? 🤔
Not quite! 😅
What if you need to run your code on a different machine? Or share it with a colleague? Or run it again in a few months?
You need to make sure your code will run in the future, and that’s where dependency management comes in

Why do we need dependency management? 🤔

The problem

Libraries and packages change constantly: new versions appear, old ones are deprecated, and different OSes ship different libraries
Even simple code can break between versions. This Python 2.7 code:

print "Hello, world!"

does not work in Python 3.x

print "Hello world!"

  File "/var/folders/96/r1yycxlj28958p1cdynhbyzw0000gn/T/Rtmpa0OGSM/chunk-code-b08d2b78904b.txt", line 1
    print "Hello world!"
                       ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Hello world!")?

Some definitions

Dependency management: making sure your code will run in the future
Dependencies: external libraries, packages, and software your code needs
Packages: have a name, version, and possibly their own dependencies
Package registry: a directory that stores packages and metadata, e.g., CRAN, PyPI, Conda
Dependency management tools: pip, conda, and others help you track and install dependencies

The reproducibility crisis 🚨

Reproducibility crisis: many researchers cannot replicate published results
Affects CS, statistics, psychology, medicine, and other fields
Some causes are statistical (p-hacking, publication bias, low power)
But poor documentation and dependency management are just as common
The good news: the computational side can be solved with tools we already have
A Nature survey found that 70%+ of scholars failed to reproduce others’ work
Over half could not reproduce their own results 😳

The reproducibility trade-off

How far should we go to ensure reproducibility? It depends on your needs
At minimum, declare your dependencies so others know what you used
Manage them with a package/environment manager (conda, pip, uv)
Package them with tools like renv (R) or uv (Python) for version pinning
Host them online with Code Ocean ($$$) or Binder (free but limited)
Go further with containers (Docker, Singularity) for full portability
The more integrated your workflow, the better!

How to declare dependencies?

Say you have an analysis using Python, NumPy, pandas, and matplotlib. Ask yourself:
- What packages do I need?
- Will this work on my collaborator’s OS?
- Can someone else reproduce my results?
Use a package manager to declare your dependencies in a single file
Store it in the repository root so collaborators can install everything in one step
Which file depends on your package manager:
- conda (Python and R): environment.yml
- pip (Python only): requirements.txt
- uv (Python only): pyproject.toml + uv.lock
Let’s see some examples

Virtual environments 🌐

Conda: `environment.yml` file

Conda environments let you create isolated environments with specific versions of packages
Each environment has its own dependencies, and you can switch between them easily
You can have different versions of the same package in different environments
Probably the most user-friendly way to manage dependencies in Python
Here’s how to create one:

conda create --name datasci350 python=3.12 -y
conda activate datasci350
conda install numpy matplotlib pandas -y

# Or in one line:
conda create --name datasci350 python=3.12 numpy matplotlib pandas -y

# Check current environment and installed packages
conda info --envs && conda list

You can create any file in the environment, e.g., a scripts folder (but it won’t be shared with others)

cd $(conda info --base)/envs/datasci350
mkdir scripts
echo "print('Hello, world!')" > scripts/hello.py

Conda: `environment.yml` file

To create an environment.yml file, run:

conda env export --name datasci350 --file ~/Desktop/environment.yml

This will create a file with the following content:

name: datasci350
channels:
  - defaults
dependencies:
  - blas=1.0=openblas
  - bottleneck=1.4.2=py312ha86b861_0
  - brotli=1.0.9=h80987f9_8
  - brotli-bin=1.0.9=h80987f9_8
  - bzip2=1.0.8=h80987f9_6
  - ca-certificates=2024.9.24=hca03da5_0
  - contourpy=1.2.0=py312h48ca7d4_0
  - cycler=0.11.0=pyhd3eb1b0_0
  - expat=2.6.3=h313beb8_0
  - fonttools=4.51.0=py312h80987f9_0
  - freetype=2.12.1=h1192e45_0
  - jpeg=9e=h80987f9_3
  - kiwisolver=1.4.4=py312h313beb8_0
  - lcms2=2.12=hba8e193_0
  - lerc=3.0=hc377ac9_0
  - libbrotlicommon=1.0.9=h80987f9_8
  - libbrotlidec=1.0.9=h80987f9_8
  - libbrotlienc=1.0.9=h80987f9_8
  - libcxx=14.0.6=h848a8c0_0
  - libdeflate=1.17=h80987f9_1
  - libffi=3.4.4=hca03da5_1
  - libgfortran=5.0.0=11_3_0_hca03da5_28
  - libgfortran5=11.3.0=h009349e_28
  - libopenblas=0.3.21=h269037a_0
  - libpng=1.6.39=h80987f9_0
  - libtiff=4.5.1=h313beb8_0
  - libwebp-base=1.3.2=h80987f9_1
  - llvm-openmp=14.0.6=hc6e5704_0
  - lz4-c=1.9.4=h313beb8_1
  - matplotlib=3.9.2=py312hca03da5_0
  - matplotlib-base=3.9.2=py312h2df2da3_0
  - ncurses=6.4=h313beb8_0
  - numexpr=2.10.1=py312h5d9532f_0
  - numpy=1.26.4=py312h7f4fdc5_0
  - numpy-base=1.26.4=py312he047099_0
  - openjpeg=2.5.2=h54b8e55_0
  - openssl=3.0.15=h80987f9_0
  - packaging=24.1=py312hca03da5_0
  - pandas=2.2.2=py312hd77ebd4_0
  - pillow=11.0.0=py312hfaf4e14_0
  - pip=24.2=py312hca03da5_0
  - pyparsing=3.2.0=py312hca03da5_0
  - python=3.12.7=h99e199e_0
  - python-dateutil=2.9.0post0=py312hca03da5_2
  - python-tzdata=2023.3=pyhd3eb1b0_0
  - pytz=2024.1=py312hca03da5_0
  - readline=8.2=h1a28f6b_0
  - setuptools=75.1.0=py312hca03da5_0
  - six=1.16.0=pyhd3eb1b0_1
  - sqlite=3.45.3=h80987f9_0
  - tk=8.6.14=h6ba3021_0
  - tornado=6.4.1=py312h80987f9_0
  - tzdata=2024b=h04d1e81_0
  - unicodedata2=15.1.0=py312h80987f9_0
  - wheel=0.44.0=py312hca03da5_0
  - xz=5.4.6=h80987f9_1
  - zlib=1.2.13=h18a0788_1
  - zstd=1.5.6=hfb09047_0
prefix: /opt/miniconda3/envs/datasci350

What are build strings?

A conda build is a specific compiled binary of a package
The same version number can be compiled differently for different platforms, Python versions, or feature sets
A full conda package specification:

numpy=1.21.5=py39h12345_0
└──┬─┘└─┬──┘ └────┬───┘└┬┘
   │    │         │     │
   │    │         │     └─ Build number (0)
   │    │         └─────── Build string (py39h12345)
   │    └───────────────── Version (1.21.5)
   └────────────────────── Package name (numpy)

The --no-builds flag strips build info, making environment files portable across platforms

The build string (py39h12345) encodes metadata:
- Python version (py39, py310)
- Architecture (linux_64, osx_arm64)
- Compiler (gcc9, clang)
- Features (nomkl, cuda)
Why do builds exist?
- Software is compiled differently per OS
- Packages target specific Python versions
- Builds can enable or disable optional features
- Different compilers produce different binaries

Conda: `environment.yml` file

To share your environment, upload the environment.yml file to your repository
Others can recreate the same environment by running:

conda env create --file environment.yml

Channels:
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate datasci350
#
# To deactivate an active environment, use
#
#     $ conda deactivate

(base)

This creates a datasci350 environment with the same packages and versions
Activate the environment and run your code:

conda activate datasci350
python scripts/hello.py

To delete the environment:

conda deactivate
conda env remove --name datasci350

And you’re done! 🎉

Pip: `requirements.txt` file

If you are using pip instead of conda, you can generate a requirements.txt file with the following command:

pip freeze > requirements.txt

This will create a file with the following content:

Bottleneck @ file:///private/var/folders/nz/j7p8yfhx1mv_0grj5xl4650h0000gp/T/abs_55txi4fy1u/croot/bottleneck_1731058642212/work
contourpy @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/contourpy_1701814001737/work
cycler @ file:///tmp/build/80754af9/cycler_1637851556182/work
fonttools @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_60c8ux4mkl/croot/fonttools_1713551354374/work
kiwisolver @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/kiwisolver_1699239145780/work
matplotlib==3.9.2
numexpr @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_b3kvvt6tc6/croot/numexpr_1730215947700/work
numpy @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a51i_mbs7m/croot/numpy_and_numpy_base_1708638620867/work/dist/numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl#sha256=37afb6b734a197702d848df93bd67c10b52f6467d56e518950d84b6b1c949d27
packaging @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_81ri4yfpjw/croot/packaging_1720101866878/work
pandas @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_b53hgou29t/croot/pandas_1718308972393/work/dist/pandas-2.2.2-cp312-cp312-macosx_11_0_arm64.whl#sha256=1956b71d1baac8b370fd9deac6100aadefda112447dca816a81ecbf3ea4eb3e6
pillow @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_92egn12how/croot/pillow_1731594702114/work
pyparsing @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_65qfw6vkxg/croot/pyparsing_1731445528142/work
python-dateutil @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_66ud1l42_h/croot/python-dateutil_1716495741162/work
pytz @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a4b76c83ik/croot/pytz_1713974318928/work
setuptools==75.1.0
six @ file:///tmp/build/80754af9/six_1644875935023/work
tornado @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a4w03z48br/croot/tornado_1718740114858/work
tzdata @ file:///croot/python-tzdata_1690578112552/work
unicodedata2 @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a3epjto7gs/croot/unicodedata2_1713212955584/work
wheel==0.44.0

Making `requirements.txt` more portable

However, pip freeze often includes local file paths that won’t work on other machines
To create a truly portable requirements.txt, we need to remove these local paths
The easiest way is to use grep to filter out lines containing file://

# Remove local file paths from requirements.txt
pip freeze | grep -v 'file://' > clean-requirements.txt

What this command does:
- pip freeze: Lists all installed packages
- grep -v 'file://': Excludes lines containing file://
- > clean-requirements.txt: Saves the result to a new file
Before:

Bottleneck @ file:///private/var/folders/.../bottleneck_1731058642212/work
contourpy @ file:///Users/builder/.../contourpy_1701814001737/work
matplotlib==3.9.2
numpy @ file:///tmp/build/.../numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl
pandas==2.2.2

After (portable):

matplotlib==3.9.2
pandas==2.2.2

Pip: `requirements.txt` file

To install the dependencies, you can run:

pip install -r clean-requirements.txt  # -r reads package names from the file

And that’s all there is to it! 😊
It works in a similar way to conda, and it is also widely used in the Python community
It is recommended to create a virtual environment (e.g., with venv or virtualenv) before installing packages with pip to avoid conflicts with system packages
However, it is not as user-friendly as conda, and it does not manage environments as well (it only installs packages in the current environment)
You can also use uv to manage your dependencies
Let’s see how it works in the next slide 🤓

`uv`: a fast Python package manager

uv is a modern, very fast Python package and project manager built by Astral (the creators of Ruff)
Written in Rust, it replaces pip, pip-tools, pipenv, poetry, virtualenv, and pyenv in a single tool!
10-100x faster than pip for installing packages 🏎️
Installs Python itself, so you don’t need a separate Python installer
Works on macOS, Linux, and Windows

# Install uv (macOS/Linux)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or with Homebrew
brew install uv

# Check installation
uv --version

Create a new project with uv init:

# Create a new project
uv init datasci350-project
cd datasci350-project

# Add dependencies
uv add numpy pandas matplotlib

# Run a script (uv handles the environment)
uv run python script.py

# Add development dependencies
uv add --dev pytest ruff

uv init creates a pyproject.toml file with your project metadata and dependencies
uv add installs packages and updates the lock file (uv.lock) automatically
--dev flag adds packages to a separate development group (e.g., for testing and linting)
uv run executes commands inside the project environment without needing to activate it

`uv`: project files and sharing

A uv project has two key files:
- pyproject.toml: your direct dependencies (human-readable)
- uv.lock: exact versions for reproducibility (auto-generated)
TOML (Tom’s Obvious Minimal Language) is a simple config format with key-value pairs and [sections], similar to YAML. Created by Tom Preston-Werner (co-founder of GitHub). More here
Example pyproject.toml:

[project]
name = "datasci350-project"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = [
    "numpy>=1.26.0",
    "pandas>=2.2.0",
    "matplotlib>=3.9.0",
]

[dependency-groups]
dev = ["pytest>=8.0", "ruff>=0.4"]

To share your environment with others:

Push pyproject.toml and uv.lock to your repository
Your collaborator runs uv sync, and they are done! 🎉

# Clone the repo and sync dependencies
git clone https://github.com/user/project.git
cd project
uv sync

# Run the project
uv run python analysis.py

uv can also manage Python versions directly:

# Install a specific Python version
uv python install 3.12

# Pin Python version for the project
uv python pin 3.12

More details in the uv documentation

Containers 🚢

The challenge

The matrix from hell

Cargo transport pre-1960

Also a matrix from hell

The solution: intermodal containers

Docker is a container for your code

Docker eliminates the matrix from hell

What are software containers?

Containers package software so it can run on any system
Similar to virtual machines (VMs), but lighter: a VM runs a full OS, while a container includes only the libraries needed for the application
One step up from virtual environments: containers package the entire computer environment, not just dependencies and code
Your code runs on any system with Docker installed, regardless of OS or hardware

Created with tools like Docker or Singularity
Usually a stripped-down Linux (e.g., Ubuntu or Alpine Linux) with only the required libraries
Why Linux? It is open-source and the most widely used server OS
Stored in container registries like Docker Hub, so you can share them with others

What is Docker? 🐳

Docker is the leading containerisation platform
Three main concepts:
- Image: a read-only template (like a recipe) that contains the OS, libraries, code, and configuration. You build images from a Dockerfile
- Container: a running instance of an image. You can run many containers from the same image
- Registry: a collection of repositories from which you pull images (e.g., Docker Hub)
Think of it this way: an image is a class, a container is an object (more here)
The syntax is similar to Git and Linux, so it should feel familiar

Install Docker following these instructions
Docker Desktop is the easiest way to start on Windows and macOS
Create an account to store images on Docker Hub
Check your installation:

docker info
docker run hello-world

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
478afc919002: Download complete 
Digest: sha256:305243c734571da2d100c8c8b3c3167a098cab6049c9a5b066b6021a60fcb966
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

docker -v

Docker version 28.5.2, build ecc6942

Docker desktop

Docker architecture

Let’s create our first container 🐳

A Dockerfile is a plain-text recipe that builds everything you need to recreate a project
It is just a text file under version control, usually called Dockerfile (no extension)
You rarely need to start from scratch: Docker Hub has thousands of base images you can build on
For example, there are ~9,000 images of data science tools on Docker Hub
We will build one today just to see how it works 😊
Next class will be a hands-on Docker session 😉

Dockerfile

A Dockerfile lists all the commands needed to assemble an image
docker build runs these instructions in sequence, layer by layer
Each instruction (e.g., RUN, COPY) adds a new layer; Docker caches layers so unchanged steps are not re-run on rebuild
Let’s create a directory and a Dockerfile:

mkdir docker
cd docker
touch Dockerfile

We will use an official Python image as a base and install packages on top of it

FROM sets the base image, RUN executes commands during the build, and CMD runs when the container starts. Full reference here

# Use an official Python runtime as 
# a parent image
FROM python:3

# Install libraries
RUN pip install numpy pandas matplotlib

# Add a script
RUN echo "print('Hello, DATASCI350!')" > hello.py

# Run the script
CMD ["python", "hello.py"]

We build the image with the following command (-t is for “tag”, which names the image datasci350-example, and . is the current directory):

docker build -t datasci350-example .

Let’s see how it looks like

Looks pretty good! 👍

Let’s run and share the container! 🏃‍♂️

To run the container, we can use the following command:

docker run datasci350-example

Hello, DATASCI350!
(base)

Woo-hoo! 🎉 😂
We have successfully created a container with Python, numpy, pandas, and matplotlib installed, and we have run a Python script
If you want to share your image with others, you can upload it to Docker Hub: https://hub.docker.com/
First, you need to log in

docker login

Then you can tag your image with your Docker Hub username:

docker tag datasci350-example danilofreire/datasci350-example:latest

And finally, you can push the image to Docker Hub:

docker push danilofreire/datasci350-example:latest

And here it is:

Link: https://hub.docker.com/r/danilofreire/datasci350-example

Docker pull

Now that the image is on Docker Hub, anyone can pull it and run it on their machine
To do that, they just need to run the following command:

docker pull danilofreire/datasci350-example:latest

And then they can run the container with:

docker run danilofreire/datasci350-example:latest

And that’s it! 😊
That’s how easy it is to share your work with others using Docker
You can also use Docker to run your code on a server, or to create a reproducible environment for your work
Although our example here is extremely simple, you can build anything you can imagine with Docker!
Hopefully, one day researchers will require that all code is shared in a Docker container 🤓
… and now you know how to do it!

Summary

Dependency management is more important than people think
It is the first step towards reproducibility and transparency
You can use conda, pip, or uv to manage your dependencies in Python
Docker offers a more comprehensive solution to the reproducibility crisis
None of them are difficult to use, and they can save you a lot of time and headaches in the future 😅
In the next class, we will have a hands-on session on Docker, so you can practice building containers and running them on your machine

And that’s all for today! 🎉

DATASCI 350 - Data Science Computing

Hello again! 😊

Brief recap of last class 📚

Parallelising data analysis with Dask and AutoML

Today’s agenda 📅

Lecture outline

Dependency management 📦

Congratulations! 🎉

Why do we need dependency management? 🤔

The problem

Some definitions

The reproducibility crisis 🚨

The reproducibility trade-off

How to declare dependencies?

Virtual environments 🌐

Conda: environment.yml file

Conda: environment.yml file

What are build strings?

Conda: environment.yml file

Pip: requirements.txt file

Making requirements.txt more portable

Pip: requirements.txt file

uv: a fast Python package manager

uv: project files and sharing

Containers 🚢

The challenge

The matrix from hell

Cargo transport pre-1960

Also a matrix from hell

The solution: intermodal containers

Docker is a container for your code

Docker eliminates the matrix from hell

What are software containers?

What is Docker? 🐳

Docker desktop

Docker architecture

Let’s create our first container 🐳

Dockerfile

Let’s see how it looks like

Looks pretty good! 👍

Looks pretty good! 👍

Let’s run and share the container! 🏃‍♂️

And here it is:

Docker pull

Summary

And that’s all for today! 🎉

See you next time! 🚀

Conda: `environment.yml` file

Conda: `environment.yml` file

Conda: `environment.yml` file

Pip: `requirements.txt` file

Making `requirements.txt` more portable

Pip: `requirements.txt` file

`uv`: a fast Python package manager

`uv`: project files and sharing