QTM 350 - Data Science Computing

Lecture 05 - Git and GitHub

Danilo Freire

Emory University

16 September, 2024

Nice to see you all again! 😊

Recap and lecture overview 📚

Recap of our last lecture

In our last class, we covered

  • Essential navigation commands: ls, cd, pwd
  • Useful directory shortcuts: ~ (home), . (current), .. (parent directory)
  • File manipulation commands: touch, mkdir, rm, rmdir
  • File content commands: cat, head, tail
  • Wildcards for efficient file matching: * (any characters), ? (any single character)
  • Batch operations with {}
  • Chaining commands with && and ||
  • Text manipulation tools:
    • wc, grep, sed

Today’s lecture

  • Learn how to think in terms of projects
  • Introduction to version control systems
  • What is Git and why use it?
  • How to add, commit, and push changes to a repository
  • GitHub and its role in collaborative projects
  • How to fork, clone, and create pull requests
  • How to resolve conflicts
  • Best practices for using Git and GitHub

Project management 📂

Taming chaos

In the data science workflow, there are two sorts of surprises and cognitive stress:

  1. Analytical (often good)
  2. Infrastructural (almost always bad)
  • Analytical surprise is when you learn something from or about the data.

  • Infrastructural surprise is when you discover that:

  • You can’t find what you did before.

  • The analysis code breaks.

  • The report doesn’t compile.

  • The collaborator can’t run your code.

  • Good project management lets you focus on the right kind of stress.

Keeping Future-you happy

  • It’s often tempting to set up a project assuming that you will be the only person working on it, e.g. as homework.
  • That’s almost never true.
  • Coauthors and collaborators happen to the best of us.
  • Even if not, there’s someone else who you always have to keep happy: Future-you.
  • Future-you is really the one you organise your projects for.
  • Most importantly, they are who will enjoy the fruits of your data science labour, or have to fight back your chaos.
  • So, be kind to Future-you. Establish a good workflow. You’ll thank yourself later.

Project setup 👷🏻‍♀️👷🏽‍♂️

You should always think in terms of projects.

A project is a self-contained unit of data science work that can be

  • Shared (e.g., with collaborators)
  • Recreated by others (including Future-you)
  • Packaged (e.g., as a report)
  • Dumped (exported or archived)

A project contains:

  • Content, e.g., raw data, processed data, scripts, functions, documents and other output

  • Metadata, e.g., information about tools for running it (required libraries, compilers), version history

For R projects for example:

  • Projects are folders/directories.
  • Metadata is the RStudio project (.Rproj) files (perhaps augmented with the output of renv for dependency management) and .git.

Project setup: Directories

Recommendations

  • One parent folder contains everything inside it.
  • You decide what goes in the project folder. The project dictates the structure.
  • Keep input separate from output. Definitely separate raw from processed data!
  • All internal paths are relative. Absolute paths are bad paths. Don’t feed functions with paths like "/Users/me/data/thing.csv".
  • Those paths will not work outside your computer (or maybe not even there, some days/weeks/months ahead).

Project setup: Scripts

  • Scripts are the glue that holds your project together.
  • They should be readable and reproducible.
  • Names should only include letters and numbers with dashes - or underscores _ to separate words.
  • Use numbering to indicate the order in which files should be run:
    • 00-setup.py
    • 01-import-data.py
    • 02-preprocess-data.py
  • Write short, modular scripts. Every script serves a purpose in your pipeline.
  • Put the setup first (e.g., library() and source()).
  • Always comment more than you usually do.

Version control systems 🔄

What is version control?

  • Version Control is a way to track your files
  • It is usually saved in a series of snapshots and branches, which you can move back and forth between
  • Version Control allows you to view how project has progressed over time
  • It allows you to:
    • Distribute your file changes over time
    • Prevent against data loss/damage by creating backup snapshots
    • Manage complex project structures (e.g. Linux)

Why version control?

More reasons to use version control

Have you ever…

  • Changed your code, realised it was a mistake and wanted to revert back?
  • Lost code or had a backup that was too old?
  • Wanted to see the difference between different versions of your code?
  • Wanted to review the history of some code?
  • Wanted to submit a change to someone else’s code?
  • Wanted to see how much work is being done, when, and by whom?

Version control can help with all of these.

Git and GitHub 🐙

What is Git?

  • Git is a distributed version control system.
  • Imagine if your Dropbox and the “Track changes” feature in MS Word had a baby.
  • In fact, it’s even better than that because Git is optimised for code
  • There is a learning curve, but it’s worth it.
  • Being familiar with Git is taken for granted when you interact with other data scientists.
  • It is not the only version control software, but certainly the most popular one.
  • According to StackOverflow’s 2022 Developer Survey, about 94% of developers use Git.

  • Git and GitHub are distinct things.
  • GitHub is an online hosting platform that allows you to host your code online.
  • It relies on Git and makes some of its functionality more accessible.
  • Also, it provides many more useful features to collaborate with others.
  • Just like we don’t need Rstudio to run R code, we don’t need GitHub to use Git… But it will make our lives easier.

Git: some background

Where does Git come from?

  • Git was created in 2005 by Linux creator Linus Torvalds.
  • The initial motivation was to have a non-proprietary version control system to manage Linux kernel development.
  • Check out this (quite opinionated) talk by Linus Torvalds on Git two years after its creation.
  • There are many Git GUIs, giving you the option to use git without the shell (often with reduced functionality). Popular choices are the GitHub Desktop, and the integration with VS Code.

GitHub: some background

Where does GitHub come from?

  • GitHub was created in 2008 and is now owned by Microsoft.
  • GitHub offers various subscription plans and has expanded its services beyond hosting Git-based version control.
  • GitHub is also widely used to document scientific projects, host websites, and more.
  • GitHub is a social network for developers. It’s like Facebook, but for code.
  • I’m a big fan of GitHub!. My website and, as you know, this course are hosted on there.

Second step: install Git

Again, Git is an independent piece of software. You need to have it installed on your machine to call it from the command line or VS Code

Chances are that that’s already the case. Here’s how you can check using the command line:

which git

And here’s how you can check the version:

git --version

If you want to install (or update) Git on your Mac/Linux machine, I recommend using Homebrew, “the missing package manager for macOS (or Linux)”:

brew install git

To install/update Git for Windows, check out happygitwithr.com.

Third step: introduce yourself to Git

This is particularly important when you work with Git but without the GitHub overhead. The idea is to define how your commits are labelled. Others should easily identify your commits as coming from you.

Have you already introduced yourself to Git? Find it out:

git config --list

Still have to introduce yourself? To that end, we set our user name and email address like this:

git config --global user.name 'danilofreire'
git config --global user.email 'danilo.freire@emory.edu'

The user name can be (but does not have to be) your GitHub user name. The email address should definitely be the one associated with your GitHub account.

Check out these setup instructions from Software Carpentry to learn about more configuration options.

Git from the shell 🐚

Repositories

  • Repositories, usually called ‘repos’, store the full history and source control of a project.
  • They can either be hosted locally, or on a shared server, such as GitHub.
  • Most repositories are stored on GitHub, while core contributors make copies of the repository on their machine and update the repository using the push/pull system.
  • Any repository stored somewhere other than locally is called a ‘remote repository’.

Repos vs Directories

  • Repositories are timelines of the entire project, including all changes made previously.
  • Directories, or ‘working directories’ are projects at their current state in time.
  • Any local directory interacting with a repository is technically a repository itself, however, it is better to call these directories ‘local repositories’, as they are instances of a remote repository.

Workflow Diagram

  • This diagram shows a little bit about how the basic Git workflow process works
  • The staging area is the bundle of all the modifications to the project that are going to be committed.
  • A ‘commit’ is similar to taking a snapshot of the current state of the project, then storing it on a timeline.

Main Git commands

  • git init: Initialises a new Git repository.
  • git clone: Copies an existing Git repository.
  • git add: Adds changes to the staging area.
  • git commit: Commits changes to the repository.
  • git push: Pushes changes to a remote repository.
  • git pull: Pulls changes from a remote repository.
  • git status: Shows the status of the working directory.
  • git log: Shows the commit history.
  • git branch: Shows the branches in the repository.
  • git checkout: Switches branches.

Git documentation

Hands-on: Git and GitHub 🤲🏽

Let’s create a repository together

  • Let’s create a new repository on GitHub from scratch!
  • We will create it in our local machine and push it to GitHub.
  • Next class, I will show you how to clone it and make changes.
  • We will also learn how to create branches, merge them, and resolve conflicts.

Let’s do it! 🚀

Creating a new repository

  1. I will create a new folder/directory in my computer: my_project
  2. Open the bash/zsh Terminal and go to the my_project directory
  3. I will copy the my_project directory path into a text document
  4. I will try to add this folder to the staging area.
mkdir my-project
cd my-project
git add .
  1. Error!
  1. We need to initialise the repository. Do not do that in your home directory!
git init

Adding/Removing files from the repo

  1. Check the staging area status.
git status
  1. Let’s add a 01-cleaning-data.py file to the staging area. Then, check its status.
touch 01-cleaning-data.py
git add 01-cleaning-data.py
git status

Adding/Removing files from the repo

  1. Now let’s create more files and add them to the staging area.
touch 02-exploratory-data-analysis.py
touch 03-modelling.py
touch 04-visualisation.py
git add .
git status

Adding/Removing files from the repo

  1. In the directory, let’s delete the 04-visualisation.py file and check the status:
rm -f 04-visualisation.py
git status

First commit

  1. (cont.) In the directory, let’s do our “initial commit” :
git commit -m "initial commit"

First Commit

  1. To check all commits in your repo:
git log
  • Most important things here are:
    • commit id;
    • date/time;
    • branch;
    • commit message;

Git Checkout

  1. First, let’s make new commits in our repo:
touch 04-new-visualisation.py # new file
touch 05-comments.py # new file
git add . # add files to the staging area
git commit -m "adding files" # new commit
echo "Hello you" >> 05-comments.py # edit file
git add . # add files to the staging area
git commit -m "editing 05-comments.py". # new commit
rm -f 05-comments.py # delete file
git commit -a -m "deleting 05-comments.py" # new commit
git log  # lets check

Hands on! Git Checkout

  1. We use checkout to go back in time to a given commit. Let’s go back to the “initial commit”:
git checkout 470636f38e409f4f322c48183e19633ebb550625
  • Check the folder!
  • 04-visualisation.py is back!
  • Important: doing this does not delete our commits. We just move back in time!

Git Checkout

  1. To “go back to the future”, the most recent commit, we just need to go back to the main branch:
git checkout main
git log
  • Check the folder/directory again!
  • 04-visualisation.py is gone and 04-new-visualisation.py is back!

Next steps 🙂

  • Phew! That was a lot of work! 😅
  • Next class we will see how we can create branches and merge them.
  • We will also learn how to resolve conflicts and push our changes to GitHub.
  • For now, let’s keep our my_project folder and we will use it in our next class.
  • You can try to create a repository on GitHub and push your changes there! 🤓
  • Questions?

Summary 📝

Today we learned about:

  • The importance of project management
  • The benefits of version control systems
  • What is Git and why use it
  • Why GitHub is a great tool for collaborative projects
  • The main Git commands
  • The difference between repositories and directories
  • How to add, remove, commit, and check changes in a repository

Questions? 🤓

Thank you very much and see you soon! 😊🙏🏻