git
, GitHub and make
Let us start by saying that the session for today is going to contain much less information, but is perhaps more difficult to internalise. It will take you some time to get used to Git
. Don’t worry, this is new for most people.
We will start by going through version control and at the end we will talk a bit more about make
. In particular we are going to talk about how you can automate an R
project with make
.
The lecture will be quite hands on and will help you get started with Git
. Please note that this is a very basic introduction and does not constitute a deep dive into Git
.
First, we need to do a quick check to see if all the relevant software has been installed. Make sure you have,
R
RStudio
Git
Github
accountmake
If you have these tings installed you are ready to go!
The notebook for the lecture today follows the slides by Grant McDermott, which can be found here. Some of these notes are directly copied from his slides with Grant’s permission. I recommend you go and check out his notes. He has some really good material.
We are going to be only covering the bare essentials for Git
today. If you want a really good book to learn more about interaction between Git
, Github
and R
I highly recommend the following by Jenny Bryan – https://happygitwithr.com/
Most of you probably use a horribly inefficient naming convention as a makeshift version control system. Does the following look familiar,
Final Draft.docx
Final Draft 1.docx
Final Draft 22 Nov.docx
Final Draft 22 Nov Comments.doc
Final Draft 22 Nov Supervisor Feedback.docx
Final Final Draft.docx
Final Final Draft 1.docx
If you are guilty of this terrible renaming of the same document then off to the gallows!
Don’t worry, I work with many people who do this. However, while working on my PhD I realised that I couldn’t save a million different versions of the same document with similar names. I was asking myself, which document is the right one? What changes did I make to my document? Why is my PhD folder 15GB in size?!?
The easiest form of version control if you are working on a project on your own is to use Dropbox / Google Drive / OneDrive / Box , etc. I don’t think it is always required to use git
for every project, especially if you are working on your own. These online backup systems have their own built-in ways of providing version control.
I think git
really starts to shine when you are working on projects with other people. I remember that things became quite problematic after my PhD when I was trying to work with other people! Collaboration without git
was really painful. I started using git
in 2014 and I believe it is one of the best investments I have made.
Like I mentioned above, the original goal of git
was for collaboration on big projects. You start with a respository and then everyone gets to work on the repository, where “track changes” are recorded.
In the data science space git
is also used to store more than source code. Normally a data science project will contain, data, figures, reports and source code.
In this lecture we are going to try and establish one way in which you can introduce git
into your normal workflow. Initially it might seem strange, but after a while it will be second nature.
Note: We are working with an
R
project here. The same should be applicable to other languages likeJulia
orPython
.
In order to get properly setup for a project you are going to have to do the following,
Rstudio
projectgit
repositoryAll of my projects have this basic structure. There are many details and nuances related to these steps, but this are the basic principles. At the end of the day working with git
is going to be no different than saving your project and sending it to Github every now and then.
If you are going to be coding in R
, then RStudio
offers some really nice integration with Github
. The same is true for VS Code
and Github
, if you are thinking about programming in other languages.
Note: Before you follow the steps above you need to setup the connection between
Github
andRStudio
through a personal access token. Please read the instructions on how to do this here. You could also set up keys forSSH
, which is actually the preferred method, but perhaps a bit more complex. Instructions can be found here.
We are going to start by linking an RStudio
project to a Github
repository. The steps are going to be as follows,
Github
and initialise with a README.RStudio
We will do this practically in class so that you can see how it works.
Below is an animated guide (gif) to see how to do Steps 1 and 2.
For the first step, you can just call your repo DataScienceTest
. If everyone has the same name for the repo it will make things easier down the line.
For Step 3, 4, 5 consider the animated guide below.
If you have done everything properly then you should be able to see a Git
tab to the right of the Connections
tab in RStudio
.
Open the README file that you created when initialising the repo and type something in there. You should see some changes in the Git
panel.
There are many graphical user interfaces that you can work with instead of relying on RStudio
. I prefer to work with GitKraken
, since there are many features that are useful when working on big projects with other people. There is a free version, but the Pro
version has some really cool draws.
In my opinion GitHub Desktop is the easiest to get started with for the beginner. You can slowly migrate to other software packages once you understand Git
a bit better.
Git
There are four main Git
operations,
Stage and commit normally occur together. So does pushing and pulling.
Let us stage and commit changes to our README file. Then we can push our local changes to the GitHub repo.
NB Always
pull
from the upstream repo before you push any changes. This makes sure the local repo is up to date.
Git
at the command lineThere is always the option to forgo GUIs entirely and operate everything through the terminal. While GitHub
and Rstudio
is ideal for new users there is a case to be made for knowing shell commands. There are some things that you can easily do through the shell that is not possible with the RStudio
Git GUI.
In addition, you might be working with projects that don’t focus primarily on R
. I only use a handful of shell commands in my daily workflow, so I won’t burden you with too many. The easiest command is to clone a repo. I use this a lot.
$ git clone REPOSITORY-URL
You can test this out by cloning the DataScience-871
repo for this course. If you wanted do this, you can cd
into the appropriate directory where you want to save the content of the repo and issue the following command.
$ git clone https://github.com/DawievLill/DataScience-871
Now switch back to your test repo, DataScienceTest
, that you created before. You must cd
back to the location of this repo on your computer. Let me know if you are struggling with this.
We can see the commit history with the following command,
$ git log
We can also check which files have changes with the following,
$ git status
We can stage a file, or group of files, as follows
$ git add NAME-OF-FILE-OR-FOLDER
You can use wildcard characters to stage a group of files. There are a bunch of useful flag options too:
Stage all files.
$ git add -A
Stage updated files only (modified or deleted, but not new).
$ git add -u
Stage new files only (not updated).
$ git add .
Commit your changes.
$ git commit -m "Helpful message"
Pull from the upstream repository (i.e. GitHub).
$ git pull
Push any local changes that you’ve committed to the upstream repo (i.e. GitHub).
$ git push
Remember to always pull before you push to GitHub.
Branches are an important feature of Git
and you will make use of it when you work on collaborative projects. A branch allows you to take a snapshot of the repo and then try out some new ideas without affecting the main branch. Once you are satisfied with your changes you can try and merge back into the main branch.
You can create a new branch in many ways. You can use Rstudio
, VS Code
, GitKraken
, the command line, etc. We will quickly show how to do this with RStudio
in the lecture. However, if you wanted to do this with a shell command you could do create a new branch on your local machine and switch to it:
$ git checkout -b NAME-OF-YOUR-NEW-BRANCH
Push the new branch to GitHub
:
$ git push origin NAME-OF-YOUR-NEW-BRANCH
List all branches on your local machine:
$ git branch
Switch back to (e.g.) the master branch:
$ git checkout master
Delete a branch
$ git branch -d NAME-OF-YOUR-FAILED-BRANCH
$ git push origin :NAME-OF-YOUR-FAILED-BRANCH
Another important topic for collaboration is forking. If you create a fork of a repository you are creating your own copy of the original repository. You can now work on your own version. When you are ready you can submit changes to the original repository through a pull request. One good exercise is to fork the DataScience-871
repository and then look for spelling mistakes in the notes. You can then correct the mistake and submit a pull request. We will talk about how to do this in a second.
You have two options for merging branches:
You can merge locally. Commit changes to a new branch. You can switch between main and new branch using the checkout
command. Merge the new branch using the merge
command.
$ git merge NAME-OF-YOUR-NEW-BRANCH
You can merge remotely (sometimes by creating a fork of the original repository). Merging remotely via pull requests is a way to notify collaborators that you have completed some feature. You provide a neat summary of all the changes that you made in your branch. Normally the pulled request is then reviewed and can then be approved. Once approved you will the pull request will be incorporated / merged on GitHub
.
We are going to try and practice this in class by making changes to our test repository and then attempting a pull request.
First we need to create a new branch called new-idea
.
$ git checkout -b new-idea
Make some local changes and then commit and push them to GitHub
. You could change something in the README file, as an example.
After you have pushed the changes, go to the GitHub
repo online. There should be a new green button that states: “Compare and pull request”. Click this button.
Add a description of what this pull request does. Then click on “Create pull request”. Since you are the owner of the repository you can simply click the “Merge pull request” button to merge the changes.
If you were contributing to someone elses repository, then they would normally review the changes that you want to make.
If you for a repo then you are basically creating a copy of it. To do this, just click the Fork
button in the top-right corner of the repo you want to copy.
This creates an independent copy of the repo under your Github account.
Once you fork a repo you can do anything you like and it won’t affect the original repo. However, if you want to contribute to the original repo you can make changes in your fork and then send an upstream pull request. The person who owns the original repo is notified of the changes you want to make and can accept them or not.
For this course you will need to submit your final project by forking the repo and then submitting a pull request with your project!
This will help you to understand how to contribute to open source projects. Even if it is something simple like fixing small bugs or correcting spelling errors. One of the most important contributions you can make is to help write documentation for packages that you enjoy using that are hosted on GitHub
.
GitHub Issues are another great way to interact with your collaborators and/or package maintainers.
If you spot any problems with these lecture notes, please file an issue here!
Now we turn to make
. This is an important tool for reproducible research. We will show how you can take a small project that utilises R
code and automate the process. This provides a nice way for you to replicate your work in the future. It is almost like writing down a recipe for how you got to your results in the first place.
We will be using the notes from Jenny Bryan, which can be found here.
The goal of our research project is to create a pipeline that will…
We will automate this pipeline using make
.
Makefile
Within our test repo we will now include a Makefile
. You can run a Makefile
directly from RStudio
, but I prefer to use the shell. If you look at the notes here from Jenny Bryan, she shows how to do this in RStudio
.
The first rule for the Makefile
will download a speech that the Governor of the South African Reserve Bank gave at Stellenbosch in 2021. Remember from the first lecture that we can use curl
to download this speech as follows:
$ cd data
$ curl -s "https://www.resbank.co.za/content/dam/sarb/publications/speeches/speeches-by-governors/2021/gov-kganyago/A%20public%20lecture%20by%20Lesetja%20Kganyago%20Governor%20of%20the%20South%20African%20Reserve%20Bank%20at%20Stellenbosch%20University.pdf" > stellies-speech.pdf
Alternatively you can use the R
function download.file()
to achieve the same thing. For arguments sake, let us use the R
function in this instance. The code would look something like this,
download.file("https://www.resbank.co.za/content/dam/sarb/publications/speeches/speeches-by-governors/2021/gov-kganyago/A%20public%20lecture%20by%20Lesetja%20Kganyago%20Governor%20of%20the%20South%20African%20Reserve%20Bank%20at%20Stellenbosch%20University.pdf", destfile = "data/stellies-speech.pdf", quiet = TRUE)
In terms of the Makefile
we want to include the following line to so that we can start to automate the process.
stellies-speech.pdf:
Rscript -e 'download.file("https://www.resbank.co.za/content/dam/sarb/publications/speeches/speeches-by-governors/2021/gov-kganyago/A%20public%20lecture%20by%20Lesetja%20Kganyago%20Governor%20of%20the%20South%20African%20Reserve%20Bank%20at%20Stellenbosch%20University.pdf", destfile = "data/stellies-speech.pdf", quiet = TRUE)'
To make sure this works we can now enter the following command in the shell,
$ make stellies-speech.pdf
This should, at this stage, download the file and place it in the appropriate folder. Please adjust the directory to fit with your test repository that you have created!
The next thing we want to do is create some rules for all
and clean
. These are some of the more general commands that you find in almost every Makefile
.
The code will be something like this,
all: stellies-speech.pdf
clean:
rm -f stellies-speech.pdf
stellies-speech.pdf:
Rscript -e 'download.file("https://www.resbank.co.za/content/dam/sarb/publications/speeches/speeches-by-governors/2021/gov-kganyago/A%20public%20lecture%20by%20Lesetja%20Kganyago%20Governor%20of%20the%20South%20African%20Reserve%20Bank%20at%20Stellenbosch%20University.pdf", destfile = "data/stellies-speech.pdf", quiet = TRUE)'
This allows you to now execute all of the commands in the Makefile
and also remove any files you want. You can use make all
from the shell to download the PDF. You can then also remove the PDF if you enter make clean
.
The next thing we might want to do is to convert the PDF to text. There are packages in R
that can do this for us. However, this is also relatively easy to do with the pdftotext
command in the shell.
$ cd data
$ pdftotext stellies-speech.pdf
$ ls -a -h
## .
## ..
## stellies-speech.pdf
## stellies-speech.txt
We see that there is now a stellies-speech.txt
file, which has been generated by our pdftotext
command. If you want to do this with R
you need to install a package such as pdftools
and issue the following command.
<- pdf_text("stellies-speech.pdf") stellies_speech_text
It doesn’t really matter which one you use. I prefer to use the shell, but you might prefer to use R
for this. This is mostly a matter of preference. There are also some really nice packages in Python
that extract text from PDFs using machine learning methods. At the end of the day, you should use the tool that best fits your goal.
In order to generate a wordcloud we are now going to make use an R
script that I wrote that will do some basic text analysis. The Makefile
now looks as follows.
all: move_plot
clean:
rm -f ./data/stellies-speech.pdf
rm -f ./data/stellies-speech.txt
rm -f ./figures/wordcloud.pdf
rm -f ./scripts/report.html
stellies-speech.pdf:
curl -s "https://www.resbank.co.za/content/dam/sarb/publications/speeches/speeches-by-governors/2021/gov-kganyago/A%20public%20lecture%20by%20Lesetja%20Kganyago%20Governor%20of%20the%20South%20African%20Reserve%20Bank%20at%20Stellenbosch%20University.pdf" > ./data/stellies-speech.pdf
pdf_to_text: stellies_speech.pdf
pdftotext ./data/stellies-speech.pdf
wordcloud: ./scripts/wordcloud.R ./data/stellies-speech.txt
Rscript $<
move_plot: wordcloud
mv Rplots.pdf ./figures/wordcloud.pdf
You will notice that the wordcloud
line has a component Rscript $<
. This indicates that we should run the R
script in the shell and that we should use the stellies-speech.txt
file as an input into the wordcloud.R
script. This script then generates a PDF with a wordcloud. Our final command moves the generated PDF file to the figures
folder.
Finally we will use rmarkdown::render()
to generate an HTML report. We create an RMarkdown
file called report.rmd
that displays the pre-made wordcloud called wordcloud.pdf
.
report.html: ./scripts/report.rmd move_plot
Rscript -e 'rmarkdown::render("$<")'
Makefile
The final Makefile
should look something like this,
all: report.html
clean:
rm -f ./data/stellies-speech.pdf
rm -f ./data/stellies-speech.txt
rm -f ./figures/wordcloud.pdf
rm -f ./scripts/report.html
stellies-speech.pdf:
curl -s "https://www.resbank.co.za/content/dam/sarb/publications/speeches/speeches-by-governors/2021/gov-kganyago/A%20public%20lecture%20by%20Lesetja%20Kganyago%20Governor%20of%20the%20South%20African%20Reserve%20Bank%20at%20Stellenbosch%20University.pdf" > ./data/stellies-speech.pdf
pdf_to_text: stellies_speech.pdf
pdftotext ./data/stellies-speech.pdf
wordcloud: ./scripts/wordcloud.R ./data/stellies-speech.txt
Rscript $<
move_plot: wordcloud
mv Rplots.pdf ./figures/wordcloud.pdf
report.html: ./scripts/report.rmd move_plot
Rscript -e 'rmarkdown::render("$<")'
If you type in make all
you should get a report a generated report that includes your figure. You can also clearly see all the steps that you took in the process. I think this is quite useful. You could try and use this type of setup in your project if you wish, but it isn’t directly required.
makepipe
There is an alternative to make
that is probably more in line with the way you think about coding in R
, it is called makepipe
. If you want to see how this works you can look here. As a quick overview, if we wanted to perform our task using makepipe
instead of make
, the code would look as follows,
if (!file.exists("docs/stellies-speech.pdf")) download.file("https://www.resbank.co.za/content/dam/sarb/publications/speeches/speeches-by-governors/2021/gov-kganyago/A%20public%20lecture%20by%20Lesetja%20Kganyago%20Governor%20of%20the%20South%20African%20Reserve%20Bank%20at%20Stellenbosch%20University.pdf", destfile = "data/stellies-speech.pdf", quiet = TRUE)
::make_with_source(
makepipesource = "data_prep.R",
targets = "data/histogram.tsv",
dependencies = "data/words.txt"
)
::make_with_source(
makepipesource = "data_prep.R",
targets = "data/histogram.tsv",
dependencies = "data/words.txt"
)
::make_with_source(
makepipesource = "data_viz.R",
targets = "data/histogram.png",
dependencies = "data/histogram.tsv"
)
::make_with_recipe(
makepiperecipe = {
::render("report.Rmd", output_file = "output/report.html")
rmarkdown
},targets = "output/report.html",
dependencies = c("report.Rmd", "data/histogram.png")
)