DATASCI 350 - Data Science Computing

Lecture 23 - Dependency Management, Virtual Environments, and Containers

Danilo Freire

Department of Data and Decision Sciences
Emory University

Hello again! 😊

Brief recap of last class 📚

Parallelising data analysis with Dask and AutoML

  • Set up local Dask Clusters with workers, schedulers, and dashboards
  • Workers compute tasks and store results; the scheduler coordinates them
  • Used Dask ML for distributed machine learning and model training
  • Parallelised hyperparameter tuning with IncrementalSearchCV, HyperbandSearchCV, and TPOT
  • Handled async/sync compatibility issues and cached computations
  • Measured speedup and efficiency gains from parallel execution

Today’s agenda 📅

Lecture outline

  • New topic today: how do we make sure our results are reproducible?
  • Replication has been a recurring theme in this course, and that is why we use the command line, git, Quarto, and Jupyter
  • Today we cover dependency management, virtual environments, and containers
  • We will also learn how to use Docker to make code portable and reproducible
  • Let’s get started! 🚀

Dependency management 📦

Congratulations! 🎉

  • You now have a project!
  • Your code works great, it runs pretty fast thanks to Dask, your Quarto reports are beautiful, and your analyses (all done in the command line) are stored in a well-documented GitHub repository 😁
  • Are you done? 🤔
  • Not quite! 😅
  • What if you need to run your code on a different machine? Or share it with a colleague? Or run it again in a few months?
  • You need to make sure your code will run in the future, and that’s where dependency management comes in

Why do we need dependency management? 🤔

The problem

  • Libraries and packages change constantly: new versions appear, old ones are deprecated, and different OSes ship different libraries
  • Even simple code can break between versions. This Python 2.7 code:
print "Hello, world!"
  • does not work in Python 3.x
print "Hello world!"
  File "/var/folders/96/r1yycxlj28958p1cdynhbyzw0000gn/T/Rtmpa0OGSM/chunk-code-b08d2b78904b.txt", line 1
    print "Hello world!"
                       ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Hello world!")?

Some definitions

  • Dependency management: making sure your code will run in the future
  • Dependencies: external libraries, packages, and software your code needs
  • Packages: have a name, version, and possibly their own dependencies
  • Package registry: a directory that stores packages and metadata, e.g., CRAN, PyPI, Conda
  • Dependency management tools: pip, conda, and others help you track and install dependencies

The reproducibility crisis 🚨

  • Reproducibility crisis: many researchers cannot replicate published results
  • Affects CS, statistics, psychology, medicine, and other fields
  • Some causes are statistical (p-hacking, publication bias, low power)
  • But poor documentation and dependency management are just as common
  • The good news: the computational side can be solved with tools we already have
  • A Nature survey found that 70%+ of scholars failed to reproduce others’ work
  • Over half could not reproduce their own results 😳

The reproducibility trade-off

  • How far should we go to ensure reproducibility? It depends on your needs
  • At minimum, declare your dependencies so others know what you used
  • Manage them with a package/environment manager (conda, pip, uv)
  • Package them with tools like renv (R) or uv (Python) for version pinning
  • Host them online with Code Ocean ($$$) or Binder (free but limited)
  • Go further with containers (Docker, Singularity) for full portability
  • The more integrated your workflow, the better!

How to declare dependencies?

  • Say you have an analysis using Python, NumPy, pandas, and matplotlib. Ask yourself:
    • What packages do I need?
    • Will this work on my collaborator’s OS?
    • Can someone else reproduce my results?
  • Use a package manager to declare your dependencies in a single file
  • Store it in the repository root so collaborators can install everything in one step
  • Which file depends on your package manager:
    • conda (Python and R): environment.yml
    • pip (Python only): requirements.txt
    • uv (Python only): pyproject.toml + uv.lock
  • Let’s see some examples

Virtual environments 🌐

Conda: environment.yml file

  • Conda environments let you create isolated environments with specific versions of packages
  • Each environment has its own dependencies, and you can switch between them easily
  • You can have different versions of the same package in different environments
  • Probably the most user-friendly way to manage dependencies in Python
  • Here’s how to create one:
conda create --name datasci350 python=3.12 -y
conda activate datasci350
conda install numpy matplotlib pandas -y

# Or in one line:
conda create --name datasci350 python=3.12 numpy matplotlib pandas -y

# Check current environment and installed packages
conda info --envs && conda list
  • You can create any file in the environment, e.g., a scripts folder (but it won’t be shared with others)
cd $(conda info --base)/envs/datasci350
mkdir scripts
echo "print('Hello, world!')" > scripts/hello.py

Conda: environment.yml file

  • To create an environment.yml file, run:
conda env export --name datasci350 --file ~/Desktop/environment.yml
  • This will create a file with the following content:
name: datasci350
channels:
  - defaults
dependencies:
  - blas=1.0=openblas
  - bottleneck=1.4.2=py312ha86b861_0
  - brotli=1.0.9=h80987f9_8
  - brotli-bin=1.0.9=h80987f9_8
  - bzip2=1.0.8=h80987f9_6
  - ca-certificates=2024.9.24=hca03da5_0
  - contourpy=1.2.0=py312h48ca7d4_0
  - cycler=0.11.0=pyhd3eb1b0_0
  - expat=2.6.3=h313beb8_0
  - fonttools=4.51.0=py312h80987f9_0
  - freetype=2.12.1=h1192e45_0
  - jpeg=9e=h80987f9_3
  - kiwisolver=1.4.4=py312h313beb8_0
  - lcms2=2.12=hba8e193_0
  - lerc=3.0=hc377ac9_0
  - libbrotlicommon=1.0.9=h80987f9_8
  - libbrotlidec=1.0.9=h80987f9_8
  - libbrotlienc=1.0.9=h80987f9_8
  - libcxx=14.0.6=h848a8c0_0
  - libdeflate=1.17=h80987f9_1
  - libffi=3.4.4=hca03da5_1
  - libgfortran=5.0.0=11_3_0_hca03da5_28
  - libgfortran5=11.3.0=h009349e_28
  - libopenblas=0.3.21=h269037a_0
  - libpng=1.6.39=h80987f9_0
  - libtiff=4.5.1=h313beb8_0
  - libwebp-base=1.3.2=h80987f9_1
  - llvm-openmp=14.0.6=hc6e5704_0
  - lz4-c=1.9.4=h313beb8_1
  - matplotlib=3.9.2=py312hca03da5_0
  - matplotlib-base=3.9.2=py312h2df2da3_0
  - ncurses=6.4=h313beb8_0
  - numexpr=2.10.1=py312h5d9532f_0
  - numpy=1.26.4=py312h7f4fdc5_0
  - numpy-base=1.26.4=py312he047099_0
  - openjpeg=2.5.2=h54b8e55_0
  - openssl=3.0.15=h80987f9_0
  - packaging=24.1=py312hca03da5_0
  - pandas=2.2.2=py312hd77ebd4_0
  - pillow=11.0.0=py312hfaf4e14_0
  - pip=24.2=py312hca03da5_0
  - pyparsing=3.2.0=py312hca03da5_0
  - python=3.12.7=h99e199e_0
  - python-dateutil=2.9.0post0=py312hca03da5_2
  - python-tzdata=2023.3=pyhd3eb1b0_0
  - pytz=2024.1=py312hca03da5_0
  - readline=8.2=h1a28f6b_0
  - setuptools=75.1.0=py312hca03da5_0
  - six=1.16.0=pyhd3eb1b0_1
  - sqlite=3.45.3=h80987f9_0
  - tk=8.6.14=h6ba3021_0
  - tornado=6.4.1=py312h80987f9_0
  - tzdata=2024b=h04d1e81_0
  - unicodedata2=15.1.0=py312h80987f9_0
  - wheel=0.44.0=py312hca03da5_0
  - xz=5.4.6=h80987f9_1
  - zlib=1.2.13=h18a0788_1
  - zstd=1.5.6=hfb09047_0
prefix: /opt/miniconda3/envs/datasci350

What are build strings?

  • A conda build is a specific compiled binary of a package
  • The same version number can be compiled differently for different platforms, Python versions, or feature sets
  • A full conda package specification:
numpy=1.21.5=py39h12345_0
└──┬─┘└─┬──┘ └────┬───┘└┬┘
   │    │         │     │
   │    │         │     └─ Build number (0)
   │    │         └─────── Build string (py39h12345)
   │    └───────────────── Version (1.21.5)
   └────────────────────── Package name (numpy)
  • The --no-builds flag strips build info, making environment files portable across platforms
  • The build string (py39h12345) encodes metadata:
    • Python version (py39, py310)
    • Architecture (linux_64, osx_arm64)
    • Compiler (gcc9, clang)
    • Features (nomkl, cuda)
  • Why do builds exist?
    • Software is compiled differently per OS
    • Packages target specific Python versions
    • Builds can enable or disable optional features
    • Different compilers produce different binaries

Conda: environment.yml file

  • To share your environment, upload the environment.yml file to your repository
  • Others can recreate the same environment by running:
conda env create --file environment.yml

Channels:
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate datasci350
#
# To deactivate an active environment, use
#
#     $ conda deactivate

(base) 
  • This creates a datasci350 environment with the same packages and versions
  • Activate the environment and run your code:
conda activate datasci350
python scripts/hello.py
  • To delete the environment:
conda deactivate
conda env remove --name datasci350
  • And you’re done! 🎉

Pip: requirements.txt file

  • If you are using pip instead of conda, you can generate a requirements.txt file with the following command:
pip freeze > requirements.txt
  • This will create a file with the following content:
Bottleneck @ file:///private/var/folders/nz/j7p8yfhx1mv_0grj5xl4650h0000gp/T/abs_55txi4fy1u/croot/bottleneck_1731058642212/work
contourpy @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/contourpy_1701814001737/work
cycler @ file:///tmp/build/80754af9/cycler_1637851556182/work
fonttools @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_60c8ux4mkl/croot/fonttools_1713551354374/work
kiwisolver @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/kiwisolver_1699239145780/work
matplotlib==3.9.2
numexpr @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_b3kvvt6tc6/croot/numexpr_1730215947700/work
numpy @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a51i_mbs7m/croot/numpy_and_numpy_base_1708638620867/work/dist/numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl#sha256=37afb6b734a197702d848df93bd67c10b52f6467d56e518950d84b6b1c949d27
packaging @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_81ri4yfpjw/croot/packaging_1720101866878/work
pandas @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_b53hgou29t/croot/pandas_1718308972393/work/dist/pandas-2.2.2-cp312-cp312-macosx_11_0_arm64.whl#sha256=1956b71d1baac8b370fd9deac6100aadefda112447dca816a81ecbf3ea4eb3e6
pillow @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_92egn12how/croot/pillow_1731594702114/work
pyparsing @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_65qfw6vkxg/croot/pyparsing_1731445528142/work
python-dateutil @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_66ud1l42_h/croot/python-dateutil_1716495741162/work
pytz @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a4b76c83ik/croot/pytz_1713974318928/work
setuptools==75.1.0
six @ file:///tmp/build/80754af9/six_1644875935023/work
tornado @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a4w03z48br/croot/tornado_1718740114858/work
tzdata @ file:///croot/python-tzdata_1690578112552/work
unicodedata2 @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a3epjto7gs/croot/unicodedata2_1713212955584/work
wheel==0.44.0

Making requirements.txt more portable

  • However, pip freeze often includes local file paths that won’t work on other machines
  • To create a truly portable requirements.txt, we need to remove these local paths
  • The easiest way is to use grep to filter out lines containing file://
# Remove local file paths from requirements.txt
pip freeze | grep -v 'file://' > clean-requirements.txt
  • What this command does:
    • pip freeze: Lists all installed packages
    • grep -v 'file://': Excludes lines containing file://
    • > clean-requirements.txt: Saves the result to a new file
  • Before:
Bottleneck @ file:///private/var/folders/.../bottleneck_1731058642212/work
contourpy @ file:///Users/builder/.../contourpy_1701814001737/work
matplotlib==3.9.2
numpy @ file:///tmp/build/.../numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl
pandas==2.2.2
  • After (portable):
matplotlib==3.9.2
pandas==2.2.2

Pip: requirements.txt file

  • To install the dependencies, you can run:
pip install -r clean-requirements.txt  # -r reads package names from the file
  • And that’s all there is to it! 😊
  • It works in a similar way to conda, and it is also widely used in the Python community
  • It is recommended to create a virtual environment (e.g., with venv or virtualenv) before installing packages with pip to avoid conflicts with system packages
  • However, it is not as user-friendly as conda, and it does not manage environments as well (it only installs packages in the current environment)
  • You can also use uv to manage your dependencies
  • Let’s see how it works in the next slide 🤓

uv: a fast Python package manager

  • uv is a modern, very fast Python package and project manager built by Astral (the creators of Ruff)
  • Written in Rust, it replaces pip, pip-tools, pipenv, poetry, virtualenv, and pyenv in a single tool!
  • 10-100x faster than pip for installing packages 🏎️
  • Installs Python itself, so you don’t need a separate Python installer
  • Works on macOS, Linux, and Windows
# Install uv (macOS/Linux)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or with Homebrew
brew install uv

# Check installation
uv --version
  • Create a new project with uv init:
# Create a new project
uv init datasci350-project
cd datasci350-project

# Add dependencies
uv add numpy pandas matplotlib

# Run a script (uv handles the environment)
uv run python script.py

# Add development dependencies
uv add --dev pytest ruff
  • uv init creates a pyproject.toml file with your project metadata and dependencies
  • uv add installs packages and updates the lock file (uv.lock) automatically
  • --dev flag adds packages to a separate development group (e.g., for testing and linting)
  • uv run executes commands inside the project environment without needing to activate it

uv: project files and sharing

  • A uv project has two key files:
    • pyproject.toml: your direct dependencies (human-readable)
    • uv.lock: exact versions for reproducibility (auto-generated)
  • TOML (Tom’s Obvious Minimal Language) is a simple config format with key-value pairs and [sections], similar to YAML. Created by Tom Preston-Werner (co-founder of GitHub). More here
  • Example pyproject.toml:
[project]
name = "datasci350-project"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = [
    "numpy>=1.26.0",
    "pandas>=2.2.0",
    "matplotlib>=3.9.0",
]

[dependency-groups]
dev = ["pytest>=8.0", "ruff>=0.4"]
  • To share your environment with others:
  1. Push pyproject.toml and uv.lock to your repository
  2. Your collaborator runs uv sync, and they are done! 🎉
# Clone the repo and sync dependencies
git clone https://github.com/user/project.git
cd project
uv sync

# Run the project
uv run python analysis.py
  • uv can also manage Python versions directly:
# Install a specific Python version
uv python install 3.12

# Pin Python version for the project
uv python pin 3.12

Containers 🚢

The challenge

The matrix from hell

Cargo transport pre-1960

Also a matrix from hell

The solution: intermodal containers

Docker is a container for your code

Docker eliminates the matrix from hell

What are software containers?

  • Containers package software so it can run on any system
  • Similar to virtual machines (VMs), but lighter: a VM runs a full OS, while a container includes only the libraries needed for the application
  • One step up from virtual environments: containers package the entire computer environment, not just dependencies and code
  • Your code runs on any system with Docker installed, regardless of OS or hardware

What is Docker? 🐳

  • Docker is the leading containerisation platform
  • Three main concepts:
    • Image: a read-only template (like a recipe) that contains the OS, libraries, code, and configuration. You build images from a Dockerfile
    • Container: a running instance of an image. You can run many containers from the same image
    • Registry: a collection of repositories from which you pull images (e.g., Docker Hub)
  • Think of it this way: an image is a class, a container is an object (more here)
  • The syntax is similar to Git and Linux, so it should feel familiar
docker info
docker run hello-world

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
478afc919002: Download complete 
Digest: sha256:305243c734571da2d100c8c8b3c3167a098cab6049c9a5b066b6021a60fcb966
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.
docker -v

Docker version 28.5.2, build ecc6942

Docker desktop

Docker architecture

Let’s create our first container 🐳

  • A Dockerfile is a plain-text recipe that builds everything you need to recreate a project
  • It is just a text file under version control, usually called Dockerfile (no extension)
  • You rarely need to start from scratch: Docker Hub has thousands of base images you can build on
  • For example, there are ~9,000 images of data science tools on Docker Hub
  • We will build one today just to see how it works 😊
  • Next class will be a hands-on Docker session 😉

Dockerfile

  • A Dockerfile lists all the commands needed to assemble an image
  • docker build runs these instructions in sequence, layer by layer
  • Each instruction (e.g., RUN, COPY) adds a new layer; Docker caches layers so unchanged steps are not re-run on rebuild
  • Let’s create a directory and a Dockerfile:
mkdir docker
cd docker
touch Dockerfile
  • FROM sets the base image, RUN executes commands during the build, and CMD runs when the container starts. Full reference here
# Use an official Python runtime as 
# a parent image
FROM python:3

# Install libraries
RUN pip install numpy pandas matplotlib

# Add a script
RUN echo "print('Hello, DATASCI350!')" > hello.py

# Run the script
CMD ["python", "hello.py"]
  • We build the image with the following command (-t is for “tag”, which names the image datasci350-example, and . is the current directory):
docker build -t datasci350-example .

Let’s see how it looks like

Looks pretty good! 👍

Looks pretty good! 👍

Let’s run and share the container! 🏃‍♂️

  • To run the container, we can use the following command:
docker run datasci350-example

Hello, DATASCI350!
(base)
  • Woo-hoo! 🎉 😂
  • We have successfully created a container with Python, numpy, pandas, and matplotlib installed, and we have run a Python script
  • If you want to share your image with others, you can upload it to Docker Hub: https://hub.docker.com/
  • First, you need to log in
docker login
  • Then you can tag your image with your Docker Hub username:
docker tag datasci350-example danilofreire/datasci350-example:latest
  • And finally, you can push the image to Docker Hub:
docker push danilofreire/datasci350-example:latest

And here it is:

Link: https://hub.docker.com/r/danilofreire/datasci350-example

Docker pull

  • Now that the image is on Docker Hub, anyone can pull it and run it on their machine
  • To do that, they just need to run the following command:
docker pull danilofreire/datasci350-example:latest
  • And then they can run the container with:
docker run danilofreire/datasci350-example:latest
  • And that’s it! 😊
  • That’s how easy it is to share your work with others using Docker
  • You can also use Docker to run your code on a server, or to create a reproducible environment for your work
  • Although our example here is extremely simple, you can build anything you can imagine with Docker!
  • Hopefully, one day researchers will require that all code is shared in a Docker container 🤓
  • … and now you know how to do it!

Summary

  • Dependency management is more important than people think
  • It is the first step towards reproducibility and transparency
  • You can use conda, pip, or uv to manage your dependencies in Python
  • Docker offers a more comprehensive solution to the reproducibility crisis
  • None of them are difficult to use, and they can save you a lot of time and headaches in the future 😅
  • In the next class, we will have a hands-on session on Docker, so you can practice building containers and running them on your machine

And that’s all for today! 🎉

See you next time! 🚀