ACE 592 - Lecture 1.2

Best practices: reproducibility, organization & version control

Diego S. Cardoso

University of Illinois Urbana-Champaign

Course Roadmap

Introduction to Scientific Computing
1. Motivation
2. Best practices
3. Intro to programming
Fundamentals of numerical methods
Systems of equations
Optimization
Function approximation
Structural estimation

Agenda

How to start and maintain computational research projects in a less “chaotic” way
- Why? Because it saves time! For your future self and everyone else that needs to work with you now or in the future
We’ll review best practices on Gentzkow & Shapiro (2014)

Relevant sources for this lecture

Gentzkow & Shapiro (2014)
Lecture notes for Grant McDermott’s Data Science for Economists (Oregon) and Ivan Rudik’s Dynamic Optimization (Cornell)

Reproducibility

Why am I doing this to you?

Reproducibility is increasingly important in the physical sciences
- And Econ is catching up
If you want to publish in AEA journals you need to have good practices
- Other journals are doing the same

Reproducibility

Why should we care?

Because journals are interested in that these days. Yes, but why?

Because it is a fundamental principle of the scientific method
- With enough instructions, someone else can verify your experimental results and claims
- It lends credibility to the scientific process

Reproducibility

The term can have different meanings and (quasi) synonyms depending on the field

We’ll borrow terminology from CS¹

Repeatability: same team, same experimental setup

Replicability: different team, same experimental setup

Reproducibility: different team, different experimental setup

Reproducibility

Reproducibility is the higher scientific goal
For the projects we develop, we look for replicability
We make it easy for someone else to run our analysis and get the same results

Automation

Automate everything that can be automated

Automation

Automate everything that can be automated

Scripts are great because they

Save time for repeated tasks
Document step by step what you are doing
Can usually be reused/adapted for similar tasks in the same or other projects

Manual tasks (with files or in Excel) are difficult to replicate unless you take detailed notes at every step
Rule-of-thumb: if you are doing a lot of clicking for repeated task, think about writing a script for that

Automation

Automate everything that can be automated

My advice: get comfortable with the basics of UNIX shell

You can do so much with basic shell scripting
- You’ll most likely need it if your research needs High-Performance Computing (HPC)
You can learn that in a couple of hours with Software Carpentry lessons
Works in any OS
- Mac: shell environment built in
- Linux: you probably needed that already
- Windows: there’s WSL2 and it works great (really!)

Automation

Write a single script that executes all code from beginning to end

It’s the ideal goal for replication packages
Have a script like this from the beginning of the project and increment it as you go

But don’t obsess over it. Sometimes it can’t be done
- E.g.: you are running part of your code on an external server

Automation + replicability

An extra rule: Keep track of your dependencies

We often use many external packages
Popular packages get updated often. Sometimes it breaks things…
- The syntax you used doesn’t work anymore
- The function you called doesn’t exist anymore
Keep note (and backups) of the versions you use

There are many tools to help you with that
- Some are simpler and just track that info for you (e.g.: Julia project, R renv)
- Others create a stand-alone environment with everything you need to run (e.g.: Docker)

Organization

Directories

Separate directories by function

Separate files into inputs and outputs

These are quite intuitive: they make it easy for you (your code) and anyone else to find things
Separating inputs, intermediate outputs (temp), and final outputs is crucial for storing the right things
- We normally don’t use version control for intermediate calculations

Directories

Have separate folders for

Code
Data
Output
- I also use separate sub-folders for tables, figures, & maps
Text (e.g. LaTeX files)

\(\uparrow\) This is from a working paper

Directories

For long scripts that do many steps in sequence, like building a data set, it is also a good practice to number your scripts

Directories

Make directories portable

In other words: use relative paths. Or don’t hard-code your paths

Instead of

C:/User/me/my_research/data/my_data_file.csv

use a relative path like this

data/my_data_file.csv

Directories

It’s not always possible to have relative paths everywhere

In that case, you can define all the paths in one single place and read it from there every time

But there are many packages to care of that for you (e.g., package here for R)

Project management

Before we continue…

By now you hopefully have

GitHub Desktop installed on your laptop
A GitHub account (and sent me your username)

After class, please:

Watch out for an invitation for our GitHub Classroom repository
Accept the invitation

Version Control: why bother?

Git

Git is a distributed version control system
- Imagine if Dropbox and the “Track changes” feature in MS Word had a baby. Git would be that baby

In fact, it’s even better than that because Git is optimized for the things that economists spend a lot of time working on (e.g. code)
It gives you an easy way to test experimental changes (e.g. new specifications, additional model states) and not have them mess with your main code

GitHub

Git \(\neq\) GitHub

GitHub hosts a bunch of online services we want when using Git
- Hosts a copy of your repository online
- Allows for people to suggest changes to your project
- Keeps track of team communication on tasks
- And even let’s you host some related content (like these slides!)
- You can even program and run your code on GitHub Codespaces
It’s also the main location for non-base Julia (and R) packages to be stored and developed

The differences

Git is the software infrastructure for versioning and merging files

GitHub provides an online service to coordinate working with Git repositories

And adds some additional features for managing projects
- Stores the project on the cloud, allows for task management, creation of groups, etc

Why Git(Hub)?

Selfish reasons

The private benefits of having well-versioned code in case you need to go back to previous stages

Your directories will be super clean

Makes it MUCH easier to collaborate on projects

Why Git(Hub)?

Semi-altruistic reasons

The external benefits of open science, collaboration, etc

These external benefits also generate some downstream private reputational benefits
- You must be confident in your code to make it public
Can improve future social efficiency
- You commit to post future code (if you don’t, it’ll look shady)

Git basics

Everything on Git is stored in something called a repository or repo for short. This is the directory for a project

Local: a directory with a .git subdirectory that stores the history of changes to the repository
Remote: a website, e.g. see the GitHub repo for the Optim package in Julia

Git basics

Creating a new repo on GitHub

Let’s create a new repo

Easy from GitHub website: just click on that green New button from the launch page

Creating a new repo on GitHub

Next steps:

Choose a name
Choose a description
Choose whether the repo is public or private
Choose whether you want to add a README.md (yes), or a .gitignore or a LICENSE.md file (more next slide)

Creating a new repo on GitHub

Repos come with some common files in them

.gitignore: lists files/directories/extensions that Git shouldn’t track (raw data, restricted data, those weird LaTeX by-product files). This is usually a good idea
README.md: a Markdown file that is basically the welcome content on repo’s GitHub website. You should generally initialize a repo with one of these
LICENSE.md: describes the license agreement for the repository

Repo of Optim.jl again

Creating a new repo on GitHub

You can find the repo at https://github.com/dscardoso/ace592_example_repo

How do I get a repo on GitHub onto on my computer?

Clone

To get the repository on your local machine you need to clone the repo

Key thing: this will link your local repository to the remote

You’ll be able to update your local when the remote is changed

Cloning

Click on

Code > Open with GitHub Desktop

You can also use git command line for that (we won’t cover it here)

Your turn!

Create and clone your own repository on GitHub and initialize it with a README.md file

Git workflow

Workspace: actual files on your computer
Repository/local: your saved local history of changes to the files in the repository
Remote/origin: remote repository on GitHub that allows for sharing across collaborators

Using Git

There are only a few basic Git operations you need to know for versioning solo economics research efficiently

Add/Stage: Add files & modifications to the index

Take a snapshot of the changes you want updated/saved in your local repository (i.e. your computer)

Commit: Record the changes to your local repository

This requires a short message to record what was done or changed

Using Git

Push: Send the changes you committed in the local repository to the remote repository (i.e. GitHub)

Pull: Take changes on the remote and integrate them with the local repository - Technically two operations: fetch and merge

Git workflow: a sequence

Your turn!

In your own repository do the following:

Open README.md in some text editor and insert the following code: # Hello World!
Save README.md
Add the changes to README.md to the index
Commit the changes to your local repo with the message: “First README.md edit.”
Push the changes to your remote

Did the changes show up your repo’s GitHub page?

Using Git: branching

Some more (but not very) advanced operations relate to branching and pull requests

Branching creates different, but parallel, versions of your code

If you want to test out a new feature of your model but don’t want to contaminate your main branch, create a new branch and add the feature there
If it works out, you can bring the changes back into main
If it doesn’t, just delete it

Using Git: branching

It is easy to create a branch on GitHub Desktop

Using Git: branching

It is easy to create a branch on GitHub Desktop

Using Git: branching

And also to switch and merge your branch into the main

Your turn!

In your own repository do the following:

Create a new branch called test-branch
Edit README.md and add the following code: ## your_name_here
Save README.md
Add the changes to README.md to the index
Commit and push the changes with the message: “Test change to README.md.”
Switch to the main branch
Choose your test-branch to merge into the main branch
Push the changes to your remote
Check your repo’s GitHub page

Pull requests

The branch + merge we just did is the standard workflow if you are working alone
When you are working with collaborators, before merging it is best to announce first that you have finished the branch or completed a new feature
This is done with a pull request

In practice, this is also a way to group a bunch of commits into a single new feature and let others know about it

Pull requests

Once you have committed and pushed changes to a branch, you create a new pull request on GitHub Desktop…

Pull requests

…or on the GitHub website

Pull requests

Enter a description and assign any reviewers

Pull requests

Once you and your collaborators are happy with it, just go ahead and merge pull request

Pull requests

And you’re done!

Your turn!

In your own repository do the following:

Switch back to test-branch
Create a new file called new_feature.jl and write anything in it
Commit and push the changes with the message “Adding new feature”
Create a pull request and add “New feature” as a description

(Here is where you and collaborators would discuss/agree)

Go ahead and “Merge pull request”

Team up!

Find a partner for this next piece
One of you invites the other to collaborate on the project: GitHub repo > Settings > Collaborators > Add people

Team up!

If you were the one being invited, accept the invite, and clone the repo to your local

Now do the following:

Each of you edit the # Hello World! line of code to be something else and different from each other
Commit the changes to your local
Have the repo creator push their changes
Have the collaborator push their changes

Can’t push changes when you aren’t updated

It turns out that the second person can’t push their local changes to the remote

The second person is pushing their history of changes

But the remote is already one commit ahead because of the first person, so the second person’s changes can’t be pushed

Update by pulling after you commit local changes

You need to pull the remote changes first. But then you try to pull your commit and you get a merge conflict in README.md

Merge conflicts

This means there were differences between the remote and your local that conflicted

Sometimes there will be conflicts between two separate histories

E.g. if you and your collaborator edited the same chunk of code separately on your local repos

When you try to merge these histories by pushing to the remote, Git will throw a merge conflict

Merge conflicts

Good code editors (like Visual Studio Code) “understand” git and will show you nicely where the conflict is

Solving merge conflicts

<<<<<<< HEAD

Indicates the start of the conflicted code

=======

Separates the two different conflicting histories

>>>>>>> lots of numbers and letters

Indicates the end of the conflicted code and the hash (don’t worry about it) for the specific commit

Fixing a merge conflict

Merge conflicts can be fixed by directly editing the file. Then Continue merge on GitHub Desktop. Fixed!

Back to Gentzkow and Shapiro’s rules

Version control

Store code and data under version control

Now you know how to do that with Git(Hub)
But I don’t recommend using it for large data sets
- Might actually be impossible because GitHub sets a strict size limit of 100 MB per file
- For large data, use Dropbox/OneDrive/Box (and symbolic links if you collaborate on them!)
- For restricted/confidential data: DEFINITELY don’t use any of the above

Run the whole directory before checking it back in

In other words: avoid committing a version with bugs or that breaks other code in your project

Management

Manage tasks with a task management system

E-mail is not a task management system

You can do that in GitHub!
- With the added benefit that you can easily link changes to tasks (we’ll see that in a bit)
But there are many other tools
- Some examples: Asana, Trello, Notion, and even Outlook tasks

Managing tasks and workflow with GitHub

GitHub is also very useful for task management in solo or group projects using issues and pull requests

Issues: task management for you and your collaborators

It should be able to completely replace email
- With the added benefit of organizing your discussions and decisions by topic

Let’s look at the issues for the Optim package in Julia

Issues

The issues tab reports a list of 56 open issues (286 closed, i.e., task or problem has been solved)
Each issue has its own title
Let’s one example of issue

Issues

One person reported issues with the documentation of a function, which does not match the actual function
Someone else responded with some feedback

Issues

It is easy to creat a new task or issue: from the issues tab, click the green new issue button which takes you here

Issues

Then you can

Add a title
Add a description
Assign the task to a collaborator
Add labels/tags

Issues

The issue keeps track of the history of everything that’s happened to it

Issues

You can reference people with @ which brings up a dropdown menu of all collaborators on the project

Issues

You can also reference other issues if they’re related by using # which brings up a dropdown of all issues for your repository

Issues

Issues can also be referenced in your commits to your project by adding #issue_number_here to the commit message

Issues

Then those commits show up in your issue so you have a history of what code changes have been made

Issues

If you click on the commit, it takes you to the git diff which shows you any changes to files made in that commit

Other stuff on GitHub

GitHub keeps adding new features for project management
Three of the newest additions are
- Discussions: basically a messaging board for your repo. Threads can be created independent of issues
- Projects: let you create organization spaces with Kanban boards (To do/Doing/Done columns)
- Codespaces: let you create virtual machines with all your software that you can program and run from a browser
  - They give you free computing hours per month, but you gotta pay for more time or better computers

Git FAQ

FAQ

Q: When should I commit (and push) changes?

A: Early and often

It’s not quite as important as saving your work regularly, but it’s a close second
You should certainly push everything that you want your collaborators to see

FAQ

Q: Do I need branches if I am working on a solo project?

A: You don’t really need them, but they offer big advantages in maintaining a sane workflow

Experiment without any risk to the main project!

FAQ

Q: What’s the difference between cloning and forking a repo?

A: Cloning directly ties your local version to the original repo, while forking creates a copy on your GitHub (which you can then clone)

Cloning makes it easier to fetch updates (and is often the best choice for new GitHub users), but forking has advantages too.

FAQ

Q: What happens when something goes wrong?

A: Look for help

Command line-based fixes: ohshitgit.com
Also, search on Stack Exchange

FAQ

Q: What happens when something goes horribly wrong?

A: Burn it down and start again: http://happygitwithr.com/burn.html

This is a great advantage of Git’s distributed nature:
- If something goes horribly wrong, there’s usually an intact version somewhere else

Appendix

Submitting your work with GitHub Classroom

Submitting your work

To help you get familiarized with GitHub workflow, all your submitted work will be done on a repository
After you accept an invitation from GitHub Classroom, it will create a new private repository
- Only you(r team) and Diego will be able to see its content

GitHub will automatically create a pull request called Feedback
- This is where I’ll write any feedback I might have for your files
- Please do not close or merge this pull request. Leave it open for the rest of the semester

Submitting your work

Follow the link you received in a Canvas announcement. It’s an invitation to create a tutorial repository
This repository only has a README file with a summary of how GitHub works

Typically, problem sets will include a README file with the instructions to solve the problem
- When applicable, it will also include starter scripts to set up the problem environment

You can (and probably should) make as many commits as you want in that repository
I will grade the latest comit before the deadline
- I will appreciate it if you do not commit changes after the deadline. If you need an extension, please send me an email

Submitting your work

Your Feedback pull request page will something like this after your first commit

Next class

We’ll start programming in Julia language

Before next class, please: follow these instructions to install Julia and Visual Studio (VS) code on your laptop

If you plan to use a different programming language (Python, R, or Matlab), this is the time to let me know!

We won’t cover it, but you might also be interested in learning more about Quarto and Jupyter for programming using code notebooks
- They allow you to write formatted text along with code blocks and render graphics in the same document
- You can learn more about Quarto here
- You can learn more about Jupyter here

ACE 592 - Lecture 1.2

Course Roadmap

Agenda

Relevant sources for this lecture

Gentzkow & Shapiro’s “Code and Data for the Social Sciences”

Gentzkow & Shapiro’s “Code and Data for the Social Sciences”

Reproducibility

Why am I doing this to you?

Reproducibility

Reproducibility

Reproducibility

Automation

Automation

Automation

Automation

Automation + replicability

Organization

Directories

Directories

Directories

Directories

Directories

Project management

Before we continue…

Version Control: why bother?

Git

GitHub

The differences

Why Git(Hub)?

Why Git(Hub)?

Git basics

Git basics

Creating a new repo on GitHub

Let’s create a new repo

Creating a new repo on GitHub

Creating a new repo on GitHub

Creating a new repo on GitHub

Repo of Optim.jl again

Creating a new repo on GitHub

How do I get a repo on GitHub onto on my computer?

Cloning

Your turn!

Git workflow

Using Git

Using Git

Git workflow: a sequence

Your turn!

Using Git: branching

Using Git: branching

Using Git: branching

Using Git: branching

Your turn!

Pull requests

Pull requests

Pull requests

Pull requests

Pull requests

Pull requests

Your turn!

Team up!

Team up!

Can’t push changes when you aren’t updated

Update by pulling after you commit local changes

Merge conflicts

Merge conflicts

Solving merge conflicts

Fixing a merge conflict

Back to Gentzkow and Shapiro’s rules

Version control

Management

Managing tasks and workflow with GitHub

Issues

Issues

Issues

Issues

Issues

Issues

Issues

Issues

Issues