Lecture Lab 7
Leon Eyrich Jessen, Daniel Romero
The nightmare of keeping track of file versions
![]()
- How many of you have run into this situation…? Now imagine multiplying this by dozens of files in a common Bio Data Science project!
- This is nightmare to work with and terrible for reproducibility!
Version Control to the rescue
- To avoid situations like that, we use Version Control System (VCS)
- i.e., a program that allows to store snapshots that we can go back to at any point in time
git is by far the most popular Version Control tool for software projects in the world (>95% of all software projects)
The git workflow
- Initialize the git repository (only once per project)
- Write and modify your code as usual
- Add your changes to the staging area
- Review the staging area
- Commit your changes (save a version which you can return back to at any time)
- Return to step 2
The git workflow
- Initialize the git repository (only once per project)
- Write and modify your code as usual
- Add your changes to the staging area
- Review the staging area
- Commit your changes (save a version which you can return back to at any time)
- Return to step 2
- Commit messages should be informative
- Ideally, every commit is a working version of the code
Centralized repositories
- The most common way to use
git is pairing it with a git repository hosting platform
- Web services that will store your git repository and give you extra features on top of
git
Centralized repositories
- The most common way to use
git is pairing it with a git repository hosting platform
- Web services that will store your git repository and give you extra features on top of
git
- GitHub is the most popular, but there are many others:
git and GitHub are related but completely different things! Think of it as R vs. RStudio
Collaborative Bio Data Science using GitHub via RStudio
When doing assignments, you discovered that it was challenging to collaborate
Collaboration is key to success, also when coding!
You could:
- Write your code in a google doc and copy/paste
- Send code snippets in emails
- Use a whatsapp-group to exchange code
- …?
Collaborative Bio Data Science using GitHub via RStudio
When doing assignments, you discovered that it was challenging to collaborate
Collaboration is key to success, also when coding!
You could:
- Write your code in a google doc and copy/paste
- Send code snippets in emails
- Use a whatsapp-group to exchange code
- Stare at the same screen and come with 17 suggestions
- …?
All of which would be a recipe for inevitable disaster!
Okay, so… How does your team “do” it?
Github is not only a backup, but a collaboration platform
Pulling and pushing!
When working with collaborators, we have to add a few extra step to the git workflow…
Before
- Initialize the git repository (only once per project)
- Write and modify your code as usual
- Add your changes to the staging area
- Review the staging area
- Commit your changes (save a version which you can return back to at any time)
- Return to step 2
Pulling and pushing!
When working with collaborators, we have to add a few extra step to the git workflow…
After
- Initialize the git repository (only once per project)
- Write and modify your code as usual
- Add your changes to the staging area
- Review the staging area
- Commit your changes (save a version which you can return back to at any time)
- Pull new changes from the central repo
- Push your changes to the central repo
- Return to step 2
Pulling and pushing!
When working with collaborators, we have to add a few extra step to the git workflow…
After*
- Initialize the git repository (only once per project)
- Write and modify your code as usual
- Add your changes to the staging area
- Review the staging area
- Commit your changes (save a version which you can return back to at any time)
- Pull new changes from the central repo
- [Occasionally] Solve merge conflicts
- Push your changes to the central repo
- Return to step 2
Branching, a way of developing in parallel
With branches, you can work on your own without disturbing or getting disturbed by other people, then merge at the end.
Pull requests: contributing other people’s software
A feature of GitHub to contribute to other people’s code: work on your own independent branch, then request the maintainers that to pull your code into the main branch
Final notes
git is a powerful tool for reproducible data science
- We won’t do it today, but
git is most powerful from the Terminal
- We will intentionally run you through the common issues with
git. Don’t get discouraged!
But really, the best thing is to…
- Try it out for your self
- FOLLOWING TODAY’S EXERCISES POINT-BY-POINT IS SUPERCALIFRAGILISTICEXPIALIDOCIOUS!!!
- Break and then exercises