class: center, middle, inverse, title-slide # Reproducible Projects and Version Control ### Itamar Caspi ### March 18, 2019 (updated: 2019-03-18) --- # Replicating this presentation R packages used to produce this presentation ```r library(knitr) # for presenting tables library(xaringan) # for rendering xaringan presentations library(tidyverse) # for data wrangling and plotting library(tidymodels) # for modelling the tidy way ``` If you are missing a package, run the following command ``` install.packages("package_name") ``` Alternatively, you can just use the [pacman](https://cran.r-project.org/web/packages/pacman/vignettes/Introduction_to_pacman.html) package that loads and installs packages: ```r if (!require("pacman")) install.packages("pacman") pacman::p_load(tidyvers, tidymodels, knitr, xaringan) ``` --- # Outline 1. [Reproducibility](#projects) 2. [The Tidyverse](#tidyverse) 3. [Version Control](#git) 4. [GitHub](#github) --- class: title-slide-section-blue, center, middle name: projects # RStudio Projects ------------------ --- # Reproducibility - Reproducible research allows anyone to generate your exact same results. - To make your project reproducible you'll need to: - document what you did (code + explanations). - name the packages you used (including version numbers). - describe your R environment (R version number, operating system, etc.) - Being in a "reproducible" state-of-mind means putting yourself in the shoes of the consumers, rather than producers, of your code. (In "consumers" I also include the future you!) --- # An aside: Docker .pull-left[ <img src="https://www.docker.com/sites/default/files/social/docker_facebook_share.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ - [Docker](https://www.docker.com/) is a virtual computer inside your computer. - Docker makes sure that anyone running your code will be able to perfectly reproduce your results. - Docker solves a major predictability barrier: replicating your entire development environment (operating system, R versions, dependencies, etc.). - For further details, see [rOpenSci's tutorial](http://ropenscilabs.github.io/r-docker-tutorial/). ] --- # RStudio project environment - If your R script starts with `setwd()` or `rm(list=ls())` then are [doing something wrong](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/)! - Instead: 1. Use RStudio's project environment. 2. Go to `Tools -> Global Options -> General` and set the "Save workspace to .RData on exit" to **NEVER**. --- # R Markdown - R Markdown notebooks, by RStudio, are perhaps THE go-to tool for conducting reproducible research in R. - The process of "knitting" an Rmd file starts with a clean slate. - An R Markdown file integrates text, code, links, figures, tables, and all that is related to your research project. - R Markdown is perfect for communicating research. One if its main advantages is that an *.Rmd file is a "meta-document" that can be exported as a: - document (word, PDF, html, markdown). - presentation (html, beamer, xaringan, power point) - website ([`blogdown`](https://bookdown.org/yihui/blogdown/)). - book ([`bookdown`](https://bookdown.org/home/)). - journal article ([`pagedown`](https://github.com/rstudio/pagedown)) - dashboard ([`flexdashboards`](https://rmarkdown.rstudio.com/flexdashboard/)). --- class: title-slide-section-blue, center, middle name: tidyverse # The Tidyverse --------------- --- # Prerequisite: `%>%` is a pipe - The "pipe" operator `%>%` introduced in the [`magrittr`](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html) package, is deeply rooted in the `tidyverse`. - To understand what `%>%` does, try associating it with the word "then". - Instead of `y <- f(x)`, we type `y <- x %>% f()`. This might seen cumbersome at first, but consider the following two lines of code: ```r > y <- h(g(f(x), z)) > y <- x %>% f() %>% g(z) %>% h() ``` The second line of code should be read as: "take `x`, _then_ put it through `f()`, _then_ put the result through `g(. , z)`, _then_ put the result through `h()`, and finally, keep the result in y. --- # Base R vs. the tidyverse - Consider the following data frame: ```r df <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10)) ``` - Can you guess what this code chunk does? ```r df_new <- df[, c("x", "y")]["x" > 0] df_new$xx <- df_new$x^2 ``` -- - How about this? ```r df_new <- df %>% select(x, y) %>% filter(x > 0) %>% mutate(xx = x^2) ``` ??? the "tidyvers" code chunk should be read as "generate a new dataframe `df_new` by taking `df`, _then_ select `x` and `y`, _then_ filter rows where `x` is positive, _then_ mutate a new variable `xx = x^2` --- # The Tidyverse - Following a "tidy" approach makes your code more readable, and thus reproducible. - I believe that there is a growing consensus in the #rstats community that we should [learn the tidyverse first](http://varianceexplained.org/r/teach-tidyverse/). - Nevertheless, note that the tidyverse is "Utopian" in the sense that it strives toward _perfection_, and thus keeps changing. By contrast, base R was built to last. - As usual, begin proficient in both (base R and the tidyverse) will get you far... --- # Tidyverse packages Which packages come with `tidyverse`? ```r tidyverse::tidyverse_packages() ``` ``` ## [1] "broom" "cli" "crayon" "dplyr" "dbplyr" ## [6] "forcats" "ggplot2" "haven" "hms" "httr" ## [11] "jsonlite" "lubridate" "magrittr" "modelr" "purrr" ## [16] "readr" "readxl\n(>=" "reprex" "rlang" "rstudioapi" ## [21] "rvest" "stringr" "tibble" "tidyr" "xml2" ## [26] "tidyverse" ``` Note that not all these packages are loaded by default (e.g., `lubrudate`.) We now briefly introduce one the tidyvers flagships: `dplyr`. --- # `dplyr`: The grammar of data manipulation `dplyr` is THE go-to tool for data manipulation - Key "verbs": - `filter()` - selects observations (rows). - `select()` - selects variables (columns). - `mutate()` - generate new variables (columns). - `arrange()` - sort observations (rows). - `summarise()` - summary statistics (by groups). - Other useful verbs: - `group_by()` - groups observations by variables. - `sample_n()` - sample rows from a table. - And much more (see `dplyr` [documentation](https://dplyr.tidyverse.org/reference/index.html)) --- # Tidymodels - Tidymodels extends the tidyverse "grammar" philosophy to modelling tasks. ```r tidymodels::tidymodels_packages() ``` ``` ## [1] "broom" "cli" "crayon" "dials" ## [5] "dplyr" "ggplot2" "infer" "magrittr" ## [9] "parsnip" "pillar" "purrr" "recipes" ## [13] "rlang" "rsample" "rstudioapi" "tibble" ## [17] "tidytext" "tidypredict" "tidyposterior" "yardstick" ## [21] "tidymodels" ``` For further details, visit the [tidymodels GitHub repo](https://github.com/tidymodels/tidymodels). --- # Resources 1. [R for Data Science (r4ds)](http://r4ds.had.co.nz/) by Garrett Grolemund and Hadley Wickham. 2. [Data wrangling and tidying with the “Tidyverse”](https://raw.githack.com/uo-ec607/lectures/master/05-tidyverse/05-tidyverse.html) by Grant McDerrmot. 3. [Getting used to R, RStudio, and R Markdown](https://rbasics.netlify.com/index.html) by Chester Ismay and Patrick C. Kennedy. 4. [Data Visualiztion: A practical introduction](https://socviz.co/) by Kieran Healy. --- class: title-slide-section-blue, center, middle name: git # Version Control ----------------- --- # Version control .pull-left[ <img src="https://images-na.ssl-images-amazon.com/images/I/61h4UtvnGWL._SL1200_.jpg" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ What's wrong with the "*_FINAL_FINAL" method? - What changed? - Where?? - When??? - By who???? You get the picture... ] --- # Git .pull-left[ <img src="https://git-scm.com/images/logos/downloads/Git-Icon-1788C.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ - Git is a distributed version control system. - Huh?! - Sorry. Think of MS Word "track changes" for code projects. - Git has established itself as the de-facto standard for version control and software collaboration. ] --- # GitHub .pull-left[ <img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ - GitHub is a web-based hosting service for version control using Git. - OK, OK! Think of "Dropbox" for git projects. On steroids. And then some. - GitHub is where and how a large share of open-source projects (e.g., R packages) are being developed. ] ??? The source for the definition of GitHub is [Wikipedia](https://en.wikipedia.org/wiki/GitHub). --- # Resources 1. [Happy Git and GitHub for the useR](https://happygitwithr.com/) by Jenny Bryan. 2. [Version Control with Git(Hub)](https://raw.githack.com/uo-ec607/lectures/master/02-git/02-Git.html) by Grant McDerrmot. 2. [Pro Git](https://git-scm.com/book/en/v2). --- class: title-slide-section-blue, center, middle name: practice # Let's practice! ----------------- --- # Your mission 1. Open your first R project and publish it on GitHub. 2. Clone this course's lecture notes GitHub repository. --- # Suggested workflow for starting a new R project RStudio: 1. Open RStudio. 2. File -> New Project -> New Directory -> New Project. 3. Name your project under "Directory name:". Make sure to check "Create git repository".<sup>1</sup> GitHub Desktop: 1. Open GitHub Desktop. 2. File -> Add local repository. 3. Set "Local path" to your RStudio project's folder. 4. Publish local git repo on GitHub (choose private or public repo). .footnote[ <sup>1</sup> RStudio automatically generates a `.gitignore` file that tells git which files to ignore (duh!). Click [here](https://raw.githack.com/uo-ec607/lectures/master/02-git/02-Git.html#57) for further details on how to configure what to ignore. ] --- # Git workflow The __pull -> stage -> commit -> push__ workflow: 1. Open GitHub Desktop. 2. Change "Current repository" to the cloned repo. 3. Click "Fetch origin" and __pull__ any changes made to the GitHub repo. 4. Open your project. 5. Make changes to one or more of your files. 6. Save. 7. __stage__ or unstage changed files. 8. write a summary (and description) of your changes. 9. Click "__Commit__ to master". 10. Update remote: Click "__Push__ origin" (Ctrl + P). --- # Clone and sync a remote GitHub repository Cloning: 1. Open GitHub Desktop. 2. Open the remote repository. 3. Click on "Clone or download". 4. Set the local path of your cloned repo (e.g., "C:/Documents/CLONED_REPO". Syncing: 1. Open GitHub Desktop. 2. Change "Current repository" to the cloned repo. 3. Click the "Fetch origin" button. 4. __Pull__ any changes made on the remote repo. --- class: .title-slide-final, center, inverse, middle # `slides %>% end()` [<i class="fa fa-github"></i> Source code](https://github.com/ml4econ/notes-spring2019/tree/master/03-reprod-vc)