Building Reproducible Analytical Pipelines with R, Docker and Nix

Bruno Rodrigues

Intro: Who am I

Bruno Rodrigues, head of the statistics and data strategy departments at the Ministry of Research and Higher education in Luxembourg

Slides available online at https://is.gd/raps_rladies_rome_2024

Code available at: https://github.com/b-rodrigues/rladies_rome_repro_2024

Goal of this workshop

Identify what must be managed for reproducibility
Learn about the following tools to turn your projects reproducible
- {renv}, Docker and Nix
What we will not learn (but is very useful!):
- FP, Git, Documenting, testing and packaging code, {targets}

Main reference for this workshop

This workshop is a two hours, very surface-level, summary of my latest book Building reproducible analytical pipelines with R
You can read it for free here
Nix is not covered in the book, but you can learn more by visiting this link

What I mean by reproducibility

Ability to recover exactly the same results from an analysis
Why would you want that?
Auditing purposes
Update of data (only impact must be from data update)
Reproducibility as a cornerstone of science
(Work on an immutable dev environment)
“But if I have the original script and data, what’s the problem?”

Reproducibility is on a continuum (1/2)

Here are the 4 main things influencing an analysis’ reproducibility:

Version of R used
Versions of packages used
Operating system
Hardware

Reproducibility is on a continuum (2/2)

Source: Peng, Roger D. 2011. “Reproducible Research in Computational Science.” Science 334 (6060): 1226–27

The problem

Works on my machine!

We’ll ship your computer then.

Project start

Our project: housing in Luxembourg
Data to analyse: vente-maison-2010-2021.xlsx in the data folder
2 scripts to analyse data (in the scripts/project_start folder):
1. One to scrape the Excel file save_data.R
2. One to analyse the data analysis.R

Project start - What’s wrong with these scripts?

The first two scripts -> script-based workflow
Just a long series of calls
No functions
- difficult to re-use!
- difficult to test!
- difficult to parallelise!
- lots of repetition (plots)
Usually we want a report not just a script
No record of package, nor R, versions used

Turning our scripts reproducible

We need to answer these questions

How easy would it be for someone else to rerun the analysis?
How easy would it be to update the project?
How easy would it be to reuse this code for another project?
What guarantee do we have that the output is stable through time?

The easiest, cheapest thing you should do

Generate a list of used packages and R using {renv}

Recording packages and R version used

Create a renv.lock file in 2 steps!

Open an R session in the folder containing the scripts
Run renv::init() and check the folder for renv.lock

(renv::init() will take some time to run the first time)

`renv.lock` file

Open the renv.lock file

{
"R": {
  "Version": "4.2.2",
  "Repositories": [
  {
   "Name": "CRAN",
   "URL": "https://packagemanager.rstudio.com/all/latest"
  }
  ]
},
"Packages": {
  "MASS": {
    "Package": "MASS",
    "Version": "7.3-58.1",
    "Source": "Repository",
    "Repository": "CRAN",
    "Hash": "762e1804143a332333c054759f89a706",
    "Requirements": []
  },
  "Matrix": {
    "Package": "Matrix",
    "Version": "1.5-1",
    "Source": "Repository",
    "Repository": "CRAN",
    "Hash": "539dc0c0c05636812f1080f473d2c177",
    "Requirements": [
      "lattice"
    ]

    ***and many more packages***

Restoring a library using an `renv.lock` file

renv.lock file not just a record
Can be used to restore as well!
Go to scripts/renv_restore
Run renv::restore() (answer Y to active the project when asked)
Will take some time to run (so maybe don’t do it now)… and it might not work!

`{renv}` conclusion

Shortcomings:

Records, but does not restore the version of R
Installation of old packages can fail (due to missing OS-dependencies)

but… :

Generating a renv.lock file is “free”
Provides a blueprint for dockerizing our pipeline
Creates a project-specific library (no interferences)

Where are we in the continuum?

Package and R versions are recorded
Packages can be restored (but not always!)
But where’s the pipeline? (not covered here, but…)

Ensuring long-term reproducibility using Docker

Remember the problem: works on my machine?

Turns out we will ship the whole computer to solve the issue using Docker.

What is Docker

Docker is a containerisation tool that you install on your computer
Docker allows you to build images and run containers (a container is an instance of an image)
Docker images:
1. contain all the software and code needed for your project
2. are immutable (cannot be changed at run-time)
3. can be shared on- and offline

A word of warning

Docker works best on Linux and macOS
Possible to run on Windows, but need to enable options in the BIOS and WSL2
This intro will be as gentle as possible

“Hello, Docker!”

Start by creating a Dockerfile (see scripts/Docker/hello_docker/Dockerfile)
Dockerfile = recipe for an image
Build the image: docker build -t hello .
Run a container: docker run --rm --name hello_container hello
--rm: remove the container after running
--name some_name: name your container some_name

Without Docker

With Docker

Dockerizing a project (1/2)

At image build-time:
1. install R (or use an image that ships R)
2. install packages (using our renv.lock file)
3. copy all scripts to the image
4. run the analysis using targets::tar_make()
At container run-time:
1. copy the outputs of the analysis from the container to your computer

Dockerizing a project (2/2)

The built image can be shared, or only the Dockerfile (and users can then rebuild the image)
The outputs will always stay the same!

Build-time vs run-time

Important to understand the distinction
Build-time:
1. builds the image: software, packages and dependencies get installed using RUN statements
2. must ensure that correct versions get installed (no difference between building today and in 2 years)
Run-time:
1. The last command, CMD, gets executed

The Rocker project

Possible to build new images from other images
The Rocker project provides many images with R, RStudio, Shiny, and other packages pre-installed
We will use the Rocker images “r-ver”, specifically made for reproducibility

Docker Hub

Images get automatically downloaded from Docker Hub
You can build an image and share it on Docker Hub (see here for an example)
It’s also possible to share images on another image registry, or without one at all

An example of a Dockerized project

Look at the Dockerfile here.

In your opinion, what does the first line do?
In your opinion, what are the lines 3 to 24 doing? See ‘system prerequisites’ here
What do all the lines starting with RUN do?
What do all the lines starting with COPY do?
What does the very last line do?

Dockerizing our project (1/2)

The project is dockerized in scripts/Docker/dockerized_project
There’s:

A Dockerfile
A renv.lock file
A _targets.R (didn’t discuss it here)
The source to our analysis analyse_data.Rmd
Required functions in the functions/ folder

Build the image docker build -t housing_image .

Dockerizing our project (2/2)

Run a container:
1. First, create a shared folder on your computer
2. Then, use this command, but change /path/to/shared_folder to the one you made: docker run --rm --name housing_container -v /path/to/shared_folder:/home/housing/shared_folder:rw housing_image
Check the shared folder on your computer: the output is now there!

Docker: a panacea?

Docker is very useful and widely used
But the entry cost is high
Single point of failure (what happens if Docker gets bought, abandoned, etc?)
Not actually dealing with reproducibility per se, we’re “abusing” Docker in a way

The Nix package manager

Package manager: tool to install and manage packages

Package: any piece of software (not just R packages)

A popular package manager:

The Nix package manager

Google Play Store

Reproducibility in the R ecosystem

Per-project environments not often used
Popular choice: {renv}, but deals with R packages only
Still need to take care of R itself
System-level dependencies as well!

A popular approach: Docker + {renv} (see Rocker project)

Nix deals with everything, with one single text file (called a Nix expression)!

A basic Nix expression (1/6)

let
  pkgs = import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/976fa3369d722e76f37c77493d99829540d43845.tar.gz") {};
  system_packages = builtins.attrValues {
    inherit (pkgs) R ;
  };
in
  pkgs.mkShell {
    buildInputs = [ system_packages ];
    shellHook = "R --vanilla";
  }

There’s a lot to discuss here!

A basic Nix expression (2/6)

Written in the Nix language (not discussed)
Defines the repository to use (with a fixed revision)
Lists packages to install
Defines the output: a development shell

A basic Nix expression (3/6)

Software for Nix is defined as a mono-repository of tens of thousands of expressions on Github
Github: we can use any commit to pin package versions for reproducibility!
For example, the following commit installs R 4.3.1 and associated packages:

pkgs = import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/976fa3369d722e76f37c77493d99829540d43845.tar.gz") {};

Let’s take a look at the repository…

A basic Nix expression (4/6)

system_packages: a variable that lists software to install
In this case, only R:

system_packages = builtins.attrValues {
  inherit (pkgs) R ;
};

A basic Nix expression (5/6)

Finally, we define a shell:

pkgs.mkShell {
  buildInputs = [ system_packages ];
  shellHook = "R --vanilla";
}

This shell will come with the software defined in system_packages (buildInputs)
And launch R --vanilla when started (shellHook)

A basic Nix expression (6/6)

Writing these expressions requires learning a new language
While incredibly powerful, if all we want are per-project reproducible dev shells…
…then {rix} will help!

Nix expressions

Nix expressions can be used to install software
But we will use them to build per-project development shells
We will include R, LaTeX packages, or Quarto, Python, Julia….
Nix takes care of installing every dependency down to the compiler!

CRAN and Bioconductor

CRAN is the repository of R packages to extend the language
As of writing, +20000 packages available
Biocondcutor: repository with a focus on Bioinformatics: +2000 more packages
Almost all available through nixpkgs in the rPackages set!
Find packages here

rix: reproducible development environments with Nix (1/4)

{rix} (website) makes writing Nix expression easy!
Simply use the provided rix() function:

library(rix)

rix(r_ver = "4.3.1",
    r_pkgs = c("dplyr", "ggplot2"),
    system_pkgs = NULL,
    git_pkgs = NULL,
    tex_pkgs = NULL,
    ide = "rstudio",
    # This shellHook is required to run Rstudio on Linux
    # you can ignore it on other systems
    shell_hook = "export QT_XCB_GL_INTEGRATION=none",
    project_path = ".")

rix: reproducible development environments with Nix (2/4)

List required R version and packages
Optionally: more system packages, packages hosted on Github, or LaTeX packages
Optionally: an IDE (Rstudio, Radian, VS Code or “other”)
Work interactively in an isolated environment!

rix: reproducible development environments with Nix (3/4)

rix::rix() generates a default.nix file
Build expressions using nix-build (in terminal) or rix::nix_build() from R
“Drop” into the development environment using nix-shell
Expressions can be generated even without Nix installed

rix: reproducible development environments with Nix (4/4)

Can install specific versions of packages (write "dplyr@1.0.0")
Can install packages hosted on Github
Many vignettes to get you started! See here

Let’s check out scripts/nix_expressions/rix_intro/

Non-interactive use

{rix} makes it easy to run pipelines in the right environment
(Little side note: the best tool to build pipelines in R is {targets})
See scripts/nix_expressions/nix_targets_pipeline
Can also run the pipeline like so:

cd /absolute/path/to/pipeline/ && nix-shell default.nix --run "Rscript -e 'targets::tar_make()'"

Nix and Github Actions: running pipelines

Possible to easily run a {targets} pipeline on Github actions
Simply run rix::tar_nix_ga() to generate the required files
Commit and push, and watch the actions run!
See here.

Nix and Github Actions: writing papers

Easy collaboration on papers as well
See here
Just focus on writing!

Subshells

Also possible to evaluate single functions inside a “subshell”
Works from R installed via Nix or not!
Very useful to use hard-to-install packages such as {arrow}
See scripts/nix_expressions/subshell

R packages release cycle

CRAN is updated daily, but it’s not reflected in nixpkgs
The rPackages set gets updated around new R releases (every 3 months or so)
What if more recent packages are required?
One solution: use our nixpkgs fork from our rstats-on-nix organisation!
See scripts/nix_expressions/bleeding

Conclusion

Very vast and complex topic!
At the very least, generate an renv.lock file
Always possible to rebuild a Docker image in the future (either you, or someone else!)
Consider using {targets}: not only good for reproducibility, but also an amazing tool all around
Long-term reproducibility: must use Docker or Nix (better: both!) and maintenance effort is required as well

The end

Contact me if you have questions:

bruno@brodrigues.co
Twitter: @brodriguesco
Mastodon: @brodriguesco@fosstodon.org
Blog: www.brodrigues.co
Book: www.raps-with-r.dev

Thank you!

Building Reproducible Analytical Pipelines with R, Docker and Nix

Intro: Who am I

Goal of this workshop

Main reference for this workshop

What I mean by reproducibility

Reproducibility is on a continuum (1/2)

Reproducibility is on a continuum (2/2)

The problem

Project start

Project start - What’s wrong with these scripts?

Turning our scripts reproducible

The easiest, cheapest thing you should do

Recording packages and R version used

renv.lock file

Restoring a library using an renv.lock file

{renv} conclusion

Where are we in the continuum?

Ensuring long-term reproducibility using Docker

What is Docker

A word of warning

“Hello, Docker!”

Without Docker

With Docker

Dockerizing a project (1/2)

Dockerizing a project (2/2)

Build-time vs run-time

The Rocker project

Docker Hub

An example of a Dockerized project

Dockerizing our project (1/2)

Dockerizing our project (2/2)

Docker: a panacea?

The Nix package manager

The Nix package manager

Reproducibility in the R ecosystem

A basic Nix expression (1/6)

A basic Nix expression (2/6)

A basic Nix expression (3/6)

A basic Nix expression (4/6)

A basic Nix expression (5/6)

A basic Nix expression (6/6)

Nix expressions

CRAN and Bioconductor

rix: reproducible development environments with Nix (1/4)

rix: reproducible development environments with Nix (2/4)

rix: reproducible development environments with Nix (3/4)

rix: reproducible development environments with Nix (4/4)

Non-interactive use

Nix and Github Actions: running pipelines

Nix and Github Actions: writing papers

Subshells

R packages release cycle

Conclusion

The end

`renv.lock` file

Restoring a library using an `renv.lock` file

`{renv}` conclusion