QTM 350 - Data Science Computing

Lecture 08 - Reproducible Research with Quarto

Danilo Freire

Emory University

25 September, 2024

Welcome back! 😊

Some comments on quiz 01 📚

Quiz 01

What do you think of it?

  • How did you do? What did you find challenging, if anything?
  • Too long, too short, too easy, too hard?
  • What did you like about it (if anything 😅)?
  • Feedback is always welcome! 🙏

Some comments on quiz 01

  • The quiz was designed to be a little challenging, but not impossible
  • All but the bonus questions were based on the material covered in class (all commands were in the slides)
  • The main idea behind it was to simulate a real-world scenario where you have to use the CLI and git to collaborate with others
  • Now you already know how to interact with git and GitHub, and what you did on Monday covers the vast majority of the work you’ll have to do with git in the workplace
  • Answers here

Checking git diff

Checking git diff

Today’s plan 📅

  • Why is reproducible research so important in data science (and elsewhere too)?
  • The idea of literate programming and how it can help you write better code
  • An introduction to Quarto and how it can help you write reproducible research
  • Using Quarto to write reports in many formats (HTML, PDF, Word, etc.)
  • How to use Quarto to write code

Why is reproducible research so important? 🤔

The reproducibility crisis

  • Reproducibility is the ability of an entire experiment or study to be duplicated, either by the same researcher or by someone else working independently
  • The reproducibility crisis is a term used to describe the inability of researchers to replicate the results of a study using the same data and methods
  • The crisis is not limited to data science, but it’s particularly acute in fields like psychology and medicine
  • The crisis is caused by a combination of factors, including poor experimental design, publication bias, and the use of p-values
  • Here we will focus on the role of code and data in reproducibility, which is the most relevant to us

Amy Cuddy demonstrating her theory of “power posing”. It could never be replicated. Ted Talk with 72 million views!

The reproducibility crisis in medicine

https://doi.org/10.1371/journal.pmed.0020124

The reproducibility crisis in psychology

https://www.science.org/doi/10.1126/science.aac4716

Why does it matter? We all write code, right? 🤔

Code is reproducible by default, isn’t it?

Think again…

https://www.nature.com/articles/s41597-022-01143-6

How to make your code reproducible? 🧑🏻‍💻

Literate programming

  • Literate programming is a programming paradigm introduced by Donald Knuth in 1984
  • The idea is to treat code as a form of literature, where the code is interspersed with explanations and documentation
  • The goal is to make the code more readable and understandable to humans
  • Requires two processing steps: weaving and tangling
    • Weaving: converts the literate programme into a document
    • Tangling: extracts the code from the document
  • Literate programming improves code quality by requiring programmers to articulate their reasoning, thus showing flawed design choices

(R)Markdown, Jupyter, and Quarto

  • LaTeX is a popular tool for writing scientific documents and is one of the first tools to support literate programming
  • However, LaTeX is not very user-friendly and has a steep learning curve (I know, I’ve been there 😅)
  • Markdown is a lightweight markup language with plain text formatting syntax
  • If you use R, you probably already know R Markdown
  • If you use Python, you probably already know Jupyter
  • Both are good tools, but they have some limitations

(R)Markdown, Jupyter, and Quarto

  • R Markdown, as the name suggests, is more focused on R
  • Jupyter works with many languages, but it’s not very good for writing reports or papers
  • Enter Quarto, a new tool that combines the best of both worlds
  • Quarto is language-agnostic, meaning you can use it with many languages, such as R, Python, Observable, and Julia
  • Quarto is also more powerful than R Markdown and Jupyter, as it allows you to write reports in many formats (HTML, PDF, Word, epubs, etc.)
  • As you may have noticed, all my slides, the website, and the course PDFs are written in Quarto
  • Programmers are lazy: learn it once, use it everywhere!

Some Quarto examples

Company reports

Some Quarto examples

Interactive dashboards

https://jjallaire.github.io/gapminder-dashboard/

Some Quarto examples

Books

Some Quarto examples

Websites

And everything with the same code (and version control) 🤩

How does Quarto work? 📝

How does Quarto work?

%%{
  init: {
    "theme": "dark",
    "themeCSS": ".label foreignObject, .cluster-label foreignObject { font-size: 90%; overflow: visible; }"
  }
}%%
flowchart LR
  A1[qmd] --> C{"knitr<br>(R)"}
  A1[qmd] --> B{"Jupyter<br>(Python)"}
  A2[ipynb] --> B{"Jupyter<br>(Python)"}
  B --> D[md]
  C --> D[md]
  D --> E{Pandoc}
  E --> F[pdf]
  E --> G[docx]
  E --> H[html]
  E --> I[...]

  subgraph engine [Engine]
  B
  C
  end

  • Quarto files have the .qmd extension
  • You can write Python code in .qmd files, but you can also render Jupyter notebooks (.ipynb) in Quarto with some YAML configuration
  • Quarto files are converted to Markdown files, which are then converted to many formats using Pandoc
  • Pandoc is a universal document converter that can convert files from one format to another
  • Pandoc can convert Markdown files to PDF, Word, HTML, and many other formats

Downloading and installing Quarto

https://quarto.org/docs/download/

Downloading and installing Quarto

Additionally, you can download and install Quarto using:

Bash
brew install quarto

PowerShell
choco install quarto

Writing with your favourite editor

Quarto: a command line interface

Of course, there is a CLI for that!

quarto --help

Usage:   quarto
Version: 1.5.57

Description:

  Quarto CLI

Options:

  -h, --help     - Show this help.                            
  -V, --version  - Show the version number for this program.  

Commands:

  render     [input] [args...]     - Render files or projects to various document types.
  preview    [file] [args...]      - Render and preview a document or website project.  
  serve      [input]               - Serve a Shiny interactive document.                
  create     [type] [commands...]  - Create a Quarto project or extension               
  use        <type> [target]       - Automate document or project setup tasks.          
  add        <extension>           - Add an extension to this folder or project         
  update     [target...]           - Updates an extension or global dependency.         
  remove     [target...]           - Removes an extension.                              
  convert    <input>               - Convert documents to alternate representations.    
  pandoc     [args...]             - Run the version of Pandoc embedded within Quarto.  
  typst      [args...]             - Run the version of Typst embedded within Quarto.   
  run        [script] [args...]    - Run a TypeScript, R, Python, or Lua script.        
  install    [target...]           - Installs a global dependency (TinyTex or Chromium).
  uninstall  [tool]                - Removes an extension.                              
  tools                            - Display the status of Quarto installed dependencies
  publish    [provider] [path]     - Publish a document or project to a provider.       
  check      [target]              - Verify correct functioning of Quarto installation. 
  help       [command]             - Show this help or the help of a sub-command.       

Checking if Quarto is installed

quarto check
Quarto 1.5.57

[✓] Checking versions of quarto binary dependencies...
      Pandoc version 3.2.0: OK
      Dart Sass version 1.70.0: OK
      Deno version 1.41.0: OK
      Typst version 0.11.0: OK

[✓] Checking versions of quarto dependencies......OK

[✓] Checking Quarto installation......OK
      Version: 1.5.57
      Path: /Applications/quarto/bin


[✓] Checking tools....................OK
      TinyTeX: (external install)
      Chromium: (not installed)


(|) Checking LaTeX....................
(/) Checking LaTeX....................
[✓] Checking LaTeX....................OK
      Using: TinyTex
      Path: /Users/dafreir/Library/TinyTeX/bin/universal-darwin
      Version: 2024


(|) Checking basic markdown render....
(/) Checking basic markdown render....
(-) Checking basic markdown render....
(\) Checking basic markdown render....
(|) Checking basic markdown render....
(/) Checking basic markdown render....
(-) Checking basic markdown render....
[✓] Checking basic markdown render....OK

WARN: Specified QUARTO_PYTHON 'python3: aliased to /opt/miniconda3/bin/python' does not exist
WARN: No python binary found in specified QUARTO_PYTHON location 'python3: aliased to /opt/miniconda3/bin/python'

[✓] Checking Python 3 installation....OK
      Version: 3.12.2 (Conda)
      Path: /opt/miniconda3/bin/python
      Jupyter: 5.7.2
      Kernels: python3


(|) Checking Jupyter engine render....
(/) Checking Jupyter engine render....
(-) Checking Jupyter engine render....
(\) Checking Jupyter engine render....
(|) Checking Jupyter engine render....
(/) Checking Jupyter engine render....
(-) Checking Jupyter engine render....
(\) Checking Jupyter engine render....
(|) Checking Jupyter engine render....
(/) Checking Jupyter engine render....
(-) Checking Jupyter engine render....
(\) Checking Jupyter engine render....
(|) Checking Jupyter engine render....
(/) Checking Jupyter engine render....
(-) Checking Jupyter engine render....
(\) Checking Jupyter engine render....
(|) Checking Jupyter engine render....
(/) Checking Jupyter engine render....
(-) Checking Jupyter engine render....
(\) Checking Jupyter engine render....
(|) Checking Jupyter engine render....
(/) Checking Jupyter engine render....
(-) Checking Jupyter engine render....
(\) Checking Jupyter engine render....
(|) Checking Jupyter engine render....
[✓] Checking Jupyter engine render....OK


(|) Checking R installation...........
(/) Checking R installation...........
[✓] Checking R installation...........OK
      Version: 4.4.1
      Path: /Library/Frameworks/R.framework/Resources
      LibPaths:
        - /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
      knitr: 1.48
      rmarkdown: 2.28


(|) Checking Knitr engine render......
(/) Checking Knitr engine render......
(-) Checking Knitr engine render......
(\) Checking Knitr engine render......
(|) Checking Knitr engine render......
(/) Checking Knitr engine render......
(-) Checking Knitr engine render......
[✓] Checking Knitr engine render......OK

Creating a new Quarto project

quarto create project
 ? Type
  default
   website
   blog
   manuscript
   book
   confluence

Quarto projects: default

quarto create project default Default
demo/Default
├── Default.qmd
└── _quarto.yml

1 directory, 2 files

Quarto projects: default

  • _quarto.yml: project configuration file.
project:
  title: "Default"
  • Default.qmd: default Quarto document.
---
title: "Default"
---

## Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see <https://quarto.org>.

Quarto projects: website

quarto create project website Website
demo/Website
├── _quarto.yml
├── about.qmd
├── index.qmd
└── styles.css

1 directory, 4 files

Quarto projects: blog

quarto create project blog Blog
demo/Blog
├── _quarto.yml
├── about.qmd
├── index.qmd
├── posts
│   ├── _metadata.yml
│   ├── post-with-code
│   │   ├── image.jpg
│   │   └── index.qmd
│   └── welcome
│       ├── index.qmd
│       └── thumbnail.jpg
├── profile.jpg
└── styles.css

4 directories, 10 files

Quarto projects: book

quarto create project book Book
demo/Book
├── _quarto.yml
├── cover.png
├── index.qmd
├── intro.qmd
├── references.bib
├── references.qmd
└── summary.qmd

1 directory, 7 files

Quarto projects: manuscript

quarto create project manuscript Manuscript
demo/Manuscript
├── _quarto.yml
├── index.qmd
└── references.bib

1 directory, 3 files

Quarto engines

knitr

---
title: "ggplot2 demo"
  format: 
    html:
      code-fold: true
---

## Meet Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see <https://quarto.org>.

```{r}
#| label: plot-penguins
#| echo: false
#| message: false
#| warning: false

library(tidyverse)
library(palmerpenguins)

ggplot(penguins, 
       aes(x = flipper_length_mm, y = bill_length_mm)) +
geom_point(aes(color = species, shape = species)) +
scale_color_manual(values = c("darkorange","purple","cyan4")) +
  labs(
    title = "Flipper and bill length",
    subtitle = "Dimensions for penguins at Palmer Station LTER",
    x = "Flipper length (mm)", y = "Bill length (mm)",
    color = "Penguin species", shape = "Penguin species"
  ) +
theme_minimal()

Jupyter

---
title: "Palmer Penguins Demo"
  format: 
    html:
      code-fold: true
jupyter: python3
---


## Meet Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see <https://quarto.org>.

```{python}
#| echo: false
#| message: false

import pandas as pd
import seaborn as sns 
from palmerpenguins import load_penguins
sns.set_style('whitegrid')

penguins = load_penguins()

g = sns.lmplot(x="flipper_length_mm",
               y="body_mass_g",
               hue="species",
               height=7,
               data=penguins,
               palette=['#FF8C00','#159090','#A034F0']);
g.set_xlabels('Flipper Length');
g.set_ylabels('Body Mass');

Quarto tooling

quarto render my-project
quarto render file.qmd
quarto render file.qmd --to pdf
quarto render file.qmd --to html

:::

Writing using the Visual Editor

Contents of a Quarto document

  • A Quarto document contains three types of content: a YAML header, code chunks, and Markdown text
  • We will cover Markdown in more detail in the next lecture

YAML

  • “Yet another markup language”
  • Metadata of your document
  • Demarcated by three dashes (---) on either end
  • Uses key-value pairs in the format key: value
  • We include the title, format, author, and other metadata in the YAML header
---
title: "Penguins, meet Quarto!"
format: html
editor: visual
---

Code chunks

  • Code chunks begin and end with three backticks (usually)
  • Code chunks are identified with a programming language in between {}
  • Can include optional chunk options, in YAML style, identified by #| at the beginning of the line

Code chunks

```{r}
#| label: plot-penguins
#| warning: false
#| echo: false

ggplot(penguins, 
       aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(aes(color = species, shape = species)) +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  labs(
    title = "Flipper and bill length",
    subtitle = "Dimensions for penguins at Palmer Station LTER",
    x = "Flipper length (mm)", y = "Bill length (mm)",
    color = "Penguin species", shape = "Penguin species"
  ) +
  theme_minimal()
```

Multiple figures

Put two plots side by side:

```{r}
#| label: side-plots
#| warning: false
#| echo: true
#| eval: true
#| layout-ncol: 2

ggplot(penguins, 
       aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(aes(color = species, shape = species))

ggplot(data = penguins, aes(x = flipper_length_mm)) +
  geom_histogram(aes(fill = species), 
                 alpha = 0.5, 
                 position = "identity")

```

Multiple figures

Put two plots side by side:

Phew! That was a lot of information! 🤯

In summary…

  • Reproducibility is essential in data science (and elsewhere too)
  • Literate programming can help you write better code
  • Quarto is a new tool that combines the best of R Markdown and Jupyter
  • You can use it to write reports, books, websites, and more
  • We will learn how to write reports, presentations, and websites in the next lecture

Quarto resources

Have a great weekend! 🎉