Lecture Lab 7

Leon Eyrich Jessen, Daniel Romero

The nightmare of keeping track of file versions

  • How many of you have run into this situation…? Now imagine multiplying this by dozens of files in a common Bio Data Science project!
  • This is nightmare to work with and terrible for reproducibility!

Version Control to the rescue

  • To avoid situations like that, we use Version Control System (VCS)
  • i.e., a program that allows to store snapshots that we can go back to at any point in time
  • git is by far the most popular Version Control tool for software projects in the world (>95% of all software projects)

  • git is only for code!

The git workflow

  1. Initialize the git repository (only once per project)
  2. Write and modify your code as usual
  3. Add your changes to the staging area
  4. Review the staging area
  5. Commit your changes (save a version which you can return back to at any time)
  6. Return to step 2

The git workflow

  1. Initialize the git repository (only once per project)
  2. Write and modify your code as usual
  3. Add your changes to the staging area
  4. Review the staging area
  5. Commit your changes (save a version which you can return back to at any time)
  6. Return to step 2
  • Commit messages should be informative
  • Ideally, every commit is a working version of the code

Centralized repositories

  • The most common way to use git is pairing it with a git repository hosting platform
  • Web services that will store your git repository and give you extra features on top of git

Centralized repositories

  • The most common way to use git is pairing it with a git repository hosting platform
  • Web services that will store your git repository and give you extra features on top of git
  • GitHub is the most popular, but there are many others:

  • git and GitHub are related but completely different things! Think of it as R vs. RStudio

Collaborative Bio Data Science using GitHub via RStudio

  • When doing assignments, you discovered that it was challenging to collaborate

  • Collaboration is key to success, also when coding!

  • You could:

    • Write your code in a google doc and copy/paste
    • Send code snippets in emails
    • Use a whatsapp-group to exchange code
    • …?

Collaborative Bio Data Science using GitHub via RStudio

  • When doing assignments, you discovered that it was challenging to collaborate

  • Collaboration is key to success, also when coding!

  • You could:

    • Write your code in a google doc and copy/paste
    • Send code snippets in emails
    • Use a whatsapp-group to exchange code
    • Stare at the same screen and come with 17 suggestions
    • …?
  • All of which would be a recipe for inevitable disaster!

Okay, so… How does your team “do” it?

Github is not only a backup, but a collaboration platform

Pulling and pushing!

When working with collaborators, we have to add a few extra step to the git workflow…

Before

  1. Initialize the git repository (only once per project)
  2. Write and modify your code as usual
  3. Add your changes to the staging area
  4. Review the staging area
  5. Commit your changes (save a version which you can return back to at any time)
  6. Return to step 2

Pulling and pushing!

When working with collaborators, we have to add a few extra step to the git workflow…

After

  1. Initialize the git repository (only once per project)
  2. Write and modify your code as usual
  3. Add your changes to the staging area
  4. Review the staging area
  5. Commit your changes (save a version which you can return back to at any time)
  6. Pull new changes from the central repo
  7. Push your changes to the central repo
  8. Return to step 2

Pulling and pushing!

When working with collaborators, we have to add a few extra step to the git workflow…

After*

  1. Initialize the git repository (only once per project)
  2. Write and modify your code as usual
  3. Add your changes to the staging area
  4. Review the staging area
  5. Commit your changes (save a version which you can return back to at any time)
  6. Pull new changes from the central repo
  7. [Occasionally] Solve merge conflicts
  8. Push your changes to the central repo
  9. Return to step 2

Branching, a way of developing in parallel

With branches, you can work on your own without disturbing or getting disturbed by other people, then merge at the end.

Pull requests: contributing other people’s software

A feature of GitHub to contribute to other people’s code: work on your own independent branch, then request the maintainers that to pull your code into the main branch

Final notes

  • git is a powerful tool for reproducible data science
  • We won’t do it today, but git is most powerful from the Terminal
  • We will intentionally run you through the common issues with git. Don’t get discouraged!

But really, the best thing is to…

  • Try it out for your self
  • FOLLOWING TODAY’S EXERCISES POINT-BY-POINT IS SUPERCALIFRAGILISTICEXPIALIDOCIOUS!!!
  • Break and then exercises