class: inverse, center, title-slide, middle <style> .title-slide .remark-slide-number { display: none; } </style> # .title-wrap[Intro to Programming with R for Political Scientists] <br /> ## .header-fancy[Session 3: Data Wrangling I] ### Markus Freitag ### Geschwister Scholl Institute of Political Science, LMU ### [<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#415564;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg>](https://twitter.com/MarkusGFreitag) [<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#415564;" xmlns="http://www.w3.org/2000/svg"> <path d="M336.5 160C322 70.7 287.8 8 248 8s-74 62.7-88.5 152h177zM152 256c0 22.2 1.2 43.5 3.3 64h185.3c2.1-20.5 3.3-41.8 3.3-64s-1.2-43.5-3.3-64H155.3c-2.1 20.5-3.3 41.8-3.3 64zm324.7-96c-28.6-67.9-86.5-120.4-158-141.6 24.4 33.8 41.2 84.7 50 141.6h108zM177.2 18.4C105.8 39.6 47.8 92.1 19.3 160h108c8.7-56.9 25.5-107.8 49.9-141.6zM487.4 192H372.7c2.1 21 3.3 42.5 3.3 64s-1.2 43-3.3 64h114.6c5.5-20.5 8.6-41.8 8.6-64s-3.1-43.5-8.5-64zM120 256c0-21.5 1.2-43 3.3-64H8.6C3.2 212.5 0 233.8 0 256s3.2 43.5 8.6 64h114.6c-2-21-3.2-42.5-3.2-64zm39.5 96c14.5 89.3 48.7 152 88.5 152s74-62.7 88.5-152h-177zm159.3 141.6c71.4-21.2 129.4-73.7 158-141.6h-108c-8.8 56.9-25.6 107.8-50 141.6zM19.3 352c28.6 67.9 86.5 120.4 158 141.6-24.4-33.8-41.2-84.7-50-141.6h-108z"></path></svg>](https://markusfreitag.netlify.app/) ### July 12, 2021 <a href="https://github.com/m-freitag" class="github-corner" aria-label="View source on Github"><svg width="80" height="80" viewBox="0 0 250 250" style="fill:#415564; color:#f6f3f2; position: absolute; top: 0; border: 0; right: 0;" aria-hidden="true"><path d="M0,0 L115,115 L130,115 L142,142 L250,250 L250,0 Z"></path><path d="M128.3,109.0 C113.8,99.7 119.0,89.6 119.0,89.6 C122.0,82.7 120.5,78.6 120.5,78.6 C119.2,72.0 123.4,76.3 123.4,76.3 C127.3,80.9 125.5,87.3 125.5,87.3 C122.9,97.6 130.6,101.9 134.4,103.2" fill="currentColor" style="transform-origin: 130px 106px;" class="octo-arm"></path><path d="M115.0,115.0 C114.9,115.1 118.7,116.5 119.8,115.4 L133.7,101.6 C136.9,99.2 139.9,98.4 142.2,98.6 C133.8,88.0 127.5,74.4 143.8,58.0 C148.5,53.4 154.0,51.2 159.7,51.0 C160.3,49.4 163.2,43.6 171.4,40.1 C171.4,40.1 176.1,42.5 178.8,56.2 C183.1,58.6 187.2,61.8 190.9,65.4 C194.5,69.0 197.7,73.2 200.1,77.6 C213.8,80.2 216.3,84.9 216.3,84.9 C212.7,93.1 206.9,96.0 205.4,96.6 C205.1,102.4 203.0,107.8 198.3,112.5 C181.9,128.9 168.3,122.5 157.7,114.1 C157.9,116.9 156.7,120.9 152.7,124.9 L141.0,136.5 C139.8,137.7 141.6,141.9 141.8,141.8 Z" fill="currentColor" class="octo-body"></path></svg></a><style>.github-corner:hover .octo-arm{animation:octocat-wave 560ms ease-in-out}@keyframes octocat-wave{0%,100%{transform:rotate(0)}20%,60%{transform:rotate(-25deg)}40%,80%{transform:rotate(10deg)}}@media (max-width:500px){.github-corner:hover .octo-arm{animation:none}.github-corner .octo-arm{animation:octocat-wave 560ms ease-in-out}}</style> --- # Overview 1. Intro + R-Studio and (Git)Hub 3. Base R & Tidyverse Basics 3. .hl[Data Wrangling I] 4. Data Wrangling II 5. Data Viz 6. Writing Functions --- # Workflow - Navigate to `Session Scripts > Session 3` and open `Session_3_script.R`. - You will see a pre-formatted Script with all the steps I do on the slides. - Explore as you follow. - If you have a second monitor, great! If not, split your screen. --- # Tidyverse Packages <img src="data:image/png;base64,#Figs/tidyverse.png" width="45%" style="display: block; margin: auto;" /> - These are only the tidyverse packages loaded by default. There are [more](https://tidyverse.tidyverse.org/) (e.g., [lubridate](https://github.com/tidyverse/lubridate)). --- # The Philosophy: Tidy Data > 1. Each variable forms a column. > 2. Each observation forms a row. > 3. Each type of observational unit forms a table. [(Wickham 2014)](https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf) - Sounds pretty reasonable. 3. is related to constructing clean relational data bases (we will see an example later). - However, we can relax this from time to time given our needs/statistical methods. .hl2[Messy data] is everything else, e.g.: .code60[ ``` ## race character gender age_0-100 age_100-500 age_500-100 age_>1000 ## 1 hobbits Frodo male 1 0 0 0 ## 2 hobbits Sam male 1 0 0 0 ## 3 elves Arwen female 0 0 0 1 ## 4 hobbits Golum male 0 0 1 0 ## 5 dwarves Gimli male 0 1 0 0 ## 6 men Eowyn female 1 0 0 0 ``` ] --- # The Philosophy: Tidy Data Tidy: ``` ## race character gender age_cat ## 1 hobbits Frodo male 0-100 ## 2 hobbits Sam male 0-100 ## 3 elves Arwen female >1000 ## 4 hobbits Golum male 500-100 ## 5 dwarves Gimli male 100-500 ## 6 men Eowyn female 0-100 ``` --- # Tibbles - Tibbles are data.frames but with some perks. For instance, they subset more strictly. - They also print differently (better): ```r as_tibble(lotr) ``` ``` ## # A tibble: 6 x 4 ## race character gender age_cat ## <chr> <chr> <chr> <chr> ## 1 hobbits Frodo male 0-100 ## 2 hobbits Sam male 0-100 ## 3 elves Arwen female >1000 ## 4 hobbits Golum male 500-100 ## 5 dwarves Gimli male 100-500 ## 6 men Eowyn female 0-100 ``` ```r # Hint: You can create tibbles just like data frames but with tibble(). ``` --- # Importing/Exporting Data - R comes with its own two file formats, `.rds` (single objects) and `.rda` (multiple objects/tabular data). - However, for saving the latter you likely will use `.csv` or `.json` (human readable) most of the time. -- - For R novices, importing and exporting can be a bit of a pain. - There are different functions/packages for reading different file formats (`haven`, `data.table`, `readxl`, etc.). - Fortunately, the [rio](https://github.com/leeper/rio) package by Thomas Leeper and Chung-hong Chan et al. makes our life easier. - Provides `export()` and `import` as wrappers fot the above mentioned packages. ```r install.packages("rio") ``` --- # Importing/Exporting Data ```r library(rio) # Export export(mtcars, "mtcars.csv") # R's built-in mtcars data-set. export(mtcars, "mtcars.rds") # R serialized export(mtcars, "mtcars.dta") # Stata export(mtcars, "mtcars.json") # Import W <- import("mtcars.csv") X <- import("mtcars.rds") Y <- import("mtcars.dta") Z <- import("mtcars.json") ``` --- # Importing/Exporting Data .code70[ ```r # Exporting/importing several data frames: export_list()/import_list() # Make a list of two built-in data sets. # tibble::lst() automatically names the elements: df_list <- tibble::lst(mtcars, iris) export_list(df_list, file = paste0(names(df_list), ".csv")) # export_file takes a character vector; hence, we build one from the names of our element # With the paste0() we paste ".csv" to every element of the character vector # produced by names(df_list). Z <- import_list(dir(pattern = "csv$")) # import_file takes achr vector holding file paths/files. # With dir() we get all names of the files that match a specific pattern (regular expression). # In this case, all files that end with csv ($ matches the end of the string).[1] ``` ] .font70[<sup>[1]</sup> .hl[Fine Point:] [Regular expressions](https://en.wikipedia.org/wiki/Regular_expression) were developed in computer science to specify search patterns/make character matching possible. They are very useful (e.g. for manipulating/cleaning textual data) but pretty hard to memorize. Even if you know the basics, you will google a lot. In R, I recommend the [stringi/stringr](https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html) packages.] --- # Pipes: %>% and |> - Pipes are crucial to the tidyverse workflow but also in general. - They make code more readable. Idea: Take the output of a function and to pass it **as the first argument** of another function. `$$f(g(x)) \quad \text{becomes (in R)} \quad x \quad \text{%>%}^{[2]} \quad g() \quad \text{%>%} \quad f()$$` -- .code80[ ```r x <- c(1, 2, 3, 4) sqrt(mean(x)) x %>% mean() %>% sqrt() ``` ] .font70[<sup>[2]</sup> .hl[TIPP:] Press ctrl-shift-m to produce the [magrittr](https://magrittr.tidyverse.org/) pipe, `%>%`, in RStudio.] --- # Pipes: %>% and |> - We can also pass the output as the second, third, ... argument using `.` as a placeholder. - `f(a, b = c)` can be written as `c %>% f(a, b = .)` Example: ```r "Ceci n%est pas une pipe" %>% gsub("%", "'", x = .) # gsub() performs replacement of all matches in a chr. vector. ``` ``` ## [1] "Ceci n'est pas une pipe" ``` -- - They are so useful, the R core team even introduced a base pipe (`|>`) in May 2021. Something like this happens **very** rarely: ```r x |> mean() |> sqrt() ``` --- # Pipes: %>% and |> - The base pipe is a [tiny](https://statistik-dresden.de/archives/17227) bit faster. - It also does not need an extra dependency. This may prove useful for package development (using the `magrittr` pipe in packages can be annoying sometimes). -- - .hl[BUT]: it does not (yet) properly support the `.` placeholder (as of June 2021). We can hack it tho: ```r Sys.setenv("_R_USE_PIPEBIND_" = "true") "Ceci n%est pas une pipe" |> . => gsub("%", "'", x = .) ``` ``` ## [1] "Ceci n'est pas une pipe" ``` - That's pretty ugly and [apparently](https://twitter.com/rdataberlin/status/1396170545046097928) still buggy. We will stick to `%>%`. --- class: inverse, center, middle name: intro # Let's Wrangle --- # The Data - We will use the [parlgov](http://www.parlgov.org/) database: > ParlGov is a data infrastructure for political science and contains information for all EU and most OECD democracies (37 countries). The database combines approximately 1700 parties, 1000 elections (9400 results), and 1600 cabinets (3900 parties). - It's relational, i.e., consists of different tables (parties, elections, cabinets) that can be **joined** using key variables. It can also be joined with the [partyfacts](https://partyfacts.herokuapp.com/) dataset that provides ids for many other datasets (e.g., CLEA, ESS). --- # The Data <img src="data:image/png;base64,#Figs/relational.svg" width="60%" style="display: block; margin: auto;" /> --- # The Data - We will use the [parlgov](http://www.parlgov.org/) database: > ParlGov is a data infrastructure for political science and contains information for all EU and most OECD democracies (37 countries). The database combines approximately 1700 parties, 1000 elections (9400 results), and 1600 cabinets (3900 parties). - It's relational, i.e. consists of different tables (parties, elections, cabinets) that can be **joined** using key variables. It can also be joined with the [partyfacts](https://partyfacts.herokuapp.com/) dataset that provides id's for many other datasets (e.g. CLEA, ESS). - It makes for pretty simple examples and hence we use it. Let's import the election data: ```r parlgov_elec <- import("http://www.parlgov.org/static/data/development-cp1252/view_election.csv") ``` --- # The Data: Getting an Overview .code70[ ```r glimpse(parlgov_elec) # enhanced version of str() ``` ``` ## Rows: 8,665 ## Columns: 16 ## $ country_name_short <chr> "AUS", "AUS", "AUS", "AUS", "AUS", "AU~ ## $ country_name <chr> "Australia", "Australia", "Australia",~ ## $ election_type <chr> "parliament", "parliament", "parliamen~ ## $ election_date <date> 1901-03-30, 1901-03-30, 1901-03-30, 1~ ## $ vote_share <dbl> 44.4, 34.2, 19.4, 1.4, 0.6, 29.7, 34.4~ ## $ seats <int> 32, 26, 15, 1, 1, 26, 25, 23, 1, 0, 27~ ## $ seats_total <int> 75, 75, 75, 75, 75, 75, 75, 75, 75, 75~ ## $ party_name_short <chr> "PP", "FTP", "ALP", "none", "one-seat"~ ## $ party_name <chr> "Protectionist Party", "Free Trade Par~ ## $ party_name_english <chr> "Protectionist Party", "Free Trade Par~ ## $ left_right <dbl> 7.4000, 6.0000, 3.8833, NA, NA, 7.4000~ ## $ country_id <int> 33, 33, 33, 33, 33, 33, 33, 33, 33, 33~ ## $ election_id <int> 731, 731, 731, 731, 731, 730, 730, 730~ ## $ previous_parliament_election_id <int> NA, NA, NA, NA, NA, 731, 731, 731, 731~ ## $ previous_cabinet_id <int> NA, NA, NA, NA, NA, 997, 997, 997, 997~ ## $ party_id <int> 1898, 1938, 1253, 1396, 2299, 1898, 19~ ``` ] --- # The Data: Getting an Overview
.hl[Q]: Is this data set tidy? --- # The Data: Getting an Overview .code70[ ```r datasummary_skim(parlgov_elec) # from modelsummary package. Set type = "categorical" for character vars. # Of course, not super informative in our hierarchical data set: ``` <table class="table table" style="width: auto !important; margin-left: auto; margin-right: auto; font-size: 17px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Unique (#) </th> <th style="text-align:right;"> Missing (%) </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> SD </th> <th style="text-align:right;"> Min </th> <th style="text-align:right;"> Median </th> <th style="text-align:right;"> Max </th> <th style="text-align:right;"> </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> vote_share </td> <td style="text-align:right;"> 2459 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 11.9 </td> <td style="text-align:right;"> 12.9 </td> <td style="text-align:right;"> 0.0 </td> <td style="text-align:right;"> 6.6 </td> <td style="text-align:right;"> 71.2 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.78" y="3.22" width="3.12" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="4.90" y="8.10" width="3.12" height="3.56" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="8.02" y="9.69" width="3.12" height="1.98" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="11.14" y="10.42" width="3.12" height="1.25" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="14.26" y="10.63" width="3.12" height="1.03" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="17.38" y="10.76" width="3.12" height="0.90" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="20.50" y="10.94" width="3.12" height="0.72" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="23.63" y="11.03" width="3.12" height="0.63" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="26.75" y="11.12" width="3.12" height="0.54" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="29.87" y="11.30" width="3.12" height="0.36" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="32.99" y="11.54" width="3.12" height="0.12" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="36.11" y="11.63" width="3.12" height="0.031" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="39.23" y="11.66" width="3.12" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.35" y="11.66" width="3.12" height="0.0048" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="45.47" y="11.66" width="3.12" height="0.0024" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> seats </td> <td style="text-align:right;"> 291 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 23.3 </td> <td style="text-align:right;"> 45.7 </td> <td style="text-align:right;"> 0.0 </td> <td style="text-align:right;"> 6.0 </td> <td style="text-align:right;"> 470.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.78" y="3.22" width="4.73" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="6.51" y="10.93" width="4.73" height="0.73" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="11.23" y="11.42" width="4.73" height="0.24" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="15.96" y="11.53" width="4.73" height="0.13" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="20.69" y="11.60" width="4.73" height="0.058" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="25.42" y="11.60" width="4.73" height="0.060" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.15" y="11.64" width="4.73" height="0.025" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.87" y="11.65" width="4.73" height="0.010" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="39.60" y="11.66" width="4.73" height="0.0034" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="44.33" y="11.66" width="4.73" height="0.0011" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> seats_total </td> <td style="text-align:right;"> 209 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 212.2 </td> <td style="text-align:right;"> 181.3 </td> <td style="text-align:right;"> 5.0 </td> <td style="text-align:right;"> 151.0 </td> <td style="text-align:right;"> 709.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.46" y="4.82" width="3.16" height="6.84" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="4.62" y="4.24" width="3.16" height="7.43" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="7.78" y="3.76" width="3.16" height="7.90" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="10.93" y="3.22" width="3.16" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="14.09" y="8.73" width="3.16" height="2.94" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="17.24" y="10.52" width="3.16" height="1.14" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="20.40" y="9.38" width="3.16" height="2.28" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="23.56" y="11.29" width="3.16" height="0.37" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="26.71" y="11.42" width="3.16" height="0.24" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="29.87" y="9.27" width="3.16" height="2.39" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="33.03" y="10.91" width="3.16" height="0.75" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="36.18" y="10.50" width="3.16" height="1.16" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="39.34" y="8.92" width="3.16" height="2.74" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.50" y="11.37" width="3.16" height="0.29" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="45.65" y="11.55" width="3.16" height="0.11" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> left_right </td> <td style="text-align:right;"> 413 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 5.1 </td> <td style="text-align:right;"> 2.4 </td> <td style="text-align:right;"> 0.0 </td> <td style="text-align:right;"> 5.7 </td> <td style="text-align:right;"> 9.8 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.78" y="10.29" width="4.52" height="1.38" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="6.30" y="6.81" width="4.52" height="4.85" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="10.83" y="7.42" width="4.52" height="4.24" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="15.35" y="3.67" width="4.52" height="7.99" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="19.87" y="7.86" width="4.52" height="3.80" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="24.40" y="3.79" width="4.52" height="7.88" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="28.92" y="5.30" width="4.52" height="6.37" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="33.44" y="3.22" width="4.52" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="37.97" y="7.51" width="4.52" height="4.15" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.49" y="10.53" width="4.52" height="1.13" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> country_id </td> <td style="text-align:right;"> 37 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 37.2 </td> <td style="text-align:right;"> 20.5 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 37.0 </td> <td style="text-align:right;"> 75.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.18" y="8.46" width="3.00" height="3.20" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="4.18" y="3.22" width="3.00" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="7.18" y="8.54" width="3.00" height="3.12" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="10.19" y="11.03" width="3.00" height="0.63" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="13.19" y="5.87" width="3.00" height="5.79" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="16.19" y="4.48" width="3.00" height="7.19" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="19.20" y="4.26" width="3.00" height="7.40" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="22.20" y="5.13" width="3.00" height="6.53" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="25.20" y="3.64" width="3.00" height="8.02" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="28.20" y="11.66" width="3.00" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="31.21" y="6.85" width="3.00" height="4.81" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.21" y="7.03" width="3.00" height="4.63" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="37.21" y="5.09" width="3.00" height="6.57" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="40.22" y="7.74" width="3.00" height="3.93" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="43.22" y="8.76" width="3.00" height="2.90" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> election_id </td> <td style="text-align:right;"> 998 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 569.5 </td> <td style="text-align:right;"> 320.2 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 583.0 </td> <td style="text-align:right;"> 1094.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.74" y="5.51" width="4.07" height="6.15" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="5.80" y="5.49" width="4.07" height="6.17" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="9.87" y="5.70" width="4.07" height="5.96" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="13.94" y="5.60" width="4.07" height="6.06" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="18.00" y="5.77" width="4.07" height="5.89" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="22.07" y="4.99" width="4.07" height="6.67" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="26.13" y="5.23" width="4.07" height="6.43" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.20" y="4.27" width="4.07" height="7.40" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.27" y="4.97" width="4.07" height="6.70" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="38.33" y="6.08" width="4.07" height="5.58" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.40" y="3.22" width="4.07" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> previous_parliament_election_id </td> <td style="text-align:right;"> 794 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 543.0 </td> <td style="text-align:right;"> 301.5 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 563.0 </td> <td style="text-align:right;"> 1079.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.74" y="6.12" width="4.12" height="5.54" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="5.86" y="5.51" width="4.12" height="6.16" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="9.98" y="6.04" width="4.12" height="5.63" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="14.11" y="6.28" width="4.12" height="5.39" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="18.23" y="5.48" width="4.12" height="6.18" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="22.35" y="5.23" width="4.12" height="6.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="26.47" y="4.97" width="4.12" height="6.69" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.60" y="3.22" width="4.12" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.72" y="7.09" width="4.12" height="4.57" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="38.84" y="5.95" width="4.12" height="5.71" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.97" y="7.26" width="4.12" height="4.40" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> previous_cabinet_id </td> <td style="text-align:right;"> 833 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 752.0 </td> <td style="text-align:right;"> 468.0 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 734.0 </td> <td style="text-align:right;"> 1634.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.72" y="3.22" width="2.72" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="4.45" y="3.54" width="2.72" height="8.13" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="7.17" y="4.06" width="2.72" height="7.60" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="9.89" y="5.89" width="2.72" height="5.77" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="12.62" y="4.39" width="2.72" height="7.27" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="15.34" y="4.72" width="2.72" height="6.94" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="18.06" y="5.27" width="2.72" height="6.39" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="20.79" y="4.10" width="2.72" height="7.57" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="23.51" y="3.93" width="2.72" height="7.73" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="26.23" y="5.87" width="2.72" height="5.80" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="28.96" y="3.37" width="2.72" height="8.29" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="31.68" y="5.88" width="2.72" height="5.78" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.40" y="6.99" width="2.72" height="4.67" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="37.13" y="9.05" width="2.72" height="2.61" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="39.85" y="7.32" width="2.72" height="4.34" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.57" y="4.05" width="2.72" height="7.62" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="45.30" y="9.85" width="2.72" height="1.81" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> party_id </td> <td style="text-align:right;"> 1559 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1120.2 </td> <td style="text-align:right;"> 738.8 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 1015.0 </td> <td style="text-align:right;"> 2812.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.75" y="4.10" width="3.16" height="7.56" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="4.91" y="3.22" width="3.16" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="8.07" y="4.25" width="3.16" height="7.41" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="11.24" y="3.25" width="3.16" height="8.42" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="14.40" y="3.70" width="3.16" height="7.97" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="17.56" y="4.39" width="3.16" height="7.27" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="20.73" y="3.44" width="3.16" height="8.22" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="23.89" y="4.19" width="3.16" height="7.48" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="27.05" y="7.92" width="3.16" height="3.74" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.22" y="10.00" width="3.16" height="1.66" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="33.38" y="8.77" width="3.16" height="2.89" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="36.54" y="8.15" width="3.16" height="3.51" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="39.71" y="8.06" width="3.16" height="3.60" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.87" y="8.56" width="3.16" height="3.10" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="46.03" y="11.54" width="3.16" height="0.12" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> </tbody> </table> ] .font70[.hl[TIPP:] The [collapse](https://github.com/SebKrantz/collapse) package also provides some data summary functions (`descr()` and `qsu()`) that work really well on huge data sets. Stata users will love it as it's similar to `summarize`.] --- # Tidyverse's Wrangling "Wunderkind": dplyr <img src="data:image/png;base64,#https://github.com/allisonhorst/stats-illustrations/raw/master/rstats-artwork/dplyr_wrangling.png" width="40%" style="display: block; margin: auto;" /> <center> .font70[© [Alison Horst](https://github.com/allisonhorst/stats-illustrations)] <center/> A full list of `dplyr` functions can be found [here](https://dplyr.tidyverse.org/reference/index.html). --- # Data Masking - Before we jump right in, remember when we talked about R environments in the last session? - When we wanted to pass a variable of a data frame (or element of a list) to a function, we needed to make that .hl[explicit], e.g., using `$` for subsetting. ```r lm(df$y ~ df$x) ``` - Some important dplyr functions use [data masking](https://dplyr.tidyverse.org/articles/programming.html) such that you can refer to variables just like objects in the (global) environment. - Get's rid of .hl[$]s. --- # Filtering Rows - Let's start with something very basic: filtering rows. - For instance, we might want to obtain only the results of all German elections: .code70[ ```r parlgov_elec_de <- parlgov_elec %>% # add, e.g., _de if we want to keep our original df filter(country_name_short == "DEU") ``` ] -- - Now, say we want the mean SPD vote share across all Bundestag elections (bit silly but alas): .code60[ ```r parlgov_elec_de %>% # add, e.g., _de if we want to keep our original df filter(party_name_short == "SPD", election_type == "parliament") %>% mean(vote_share) ``` ] - That does not work....hl[Q]: Why? .font70[.hl[Note] that using convenience wrappers such as `datasummary()` is no substitute for learning the underlying base, tidyverse or data.table-way of achieving this.] --- # pull() - Base-R's summary stats functions can't take a df or tibble, and we did not wrap it into a dplyr function (.hl[in order for data masking to kick in]). Two options: 1\. We can use `pull()` to pull out a vector from a df; works like `$` but for pipes. ```r parlgov_elec_de %>% # add, e.g., _de if we want to keep our original df filter(party_name_short == "SPD", election_type == "parliament") %>% pull(vote_share) %>% mean() ``` ``` ## [1] 31.70964 ``` --- # summarise() 2\. We can wrap `mean()` into `summarise()` ("data-masking" for built-in summary functions): ```r parlgov_elec_de %>% filter(party_name_short == "SPD", election_type == "parliament") %>% summarise(mean(vote_share)) # summarise() takes summary functions such as mean(), sd(), etc. ``` ``` ## mean(vote_share) ## 1 31.70964 ``` --- # summarise() - Puts out a tibble we can pass to other dplyr functions. - We can also compute more summary statistics with `summarise()` .hl[AND] also turn them into variables: -- ```r parlgov_elec_de %>% filter(party_name_short == "SPD", election_type == "parliament") %>% summarise(mean = mean(vote_share), sd = sd(vote_share), n = n()) ``` ``` ## mean sd n ## 1 31.70964 8.374648 28 ``` ```r # Aside: you can also get the number of rows (in a group) by using the base # function nrow(.) with a placeholder in the above pipe. ``` - In that, it is somewhat similar to the `mutate()` command... but it .hl2[collapses the data]. --- # mutate() - With `mutate()` you can add columns - or overwrite existing ones (with the same name) - based on transformations of other variables - For instance, say we want to add a new column "year" based on the election date variable: ```r parlgov_elec_de <- parlgov_elec_de %>% # here, we "overwrite" our df mutate(year = lubridate::year(election_date)) ``` -- - Here, we used the `year()` function of the [`lubridate`](https://rawgit.com/rstudio/cheatsheets/master/lubridate.pdf) package. - We pulled out an integer vector indicating the election year from the `date` variable and assigned it to a new variable.<sup>[3]<sup/> - `transmute()` is the opposite of `mutate()`; i.e. adds a new variable and drops the rest. .font70[ <sup>[3]<sup/>] .font70[.hl[NOTE] We will use some core `lubridate` functions on the fly as they are rather intuitive. If you deal with dates often, give [this](https://r4ds.had.co.nz/dates-and-times.html) intro a read.] --- # select() and arrange() - Another basic wrangling operation we occasionally need to do do is selecting columns. - And sometimes we also want to sort columns: .code70[ ```r parlgov_elec_de %>% filter(year == 2017) %>% select(party_name_short, vote_share, left_right) %>% arrange(desc(left_right)) #default is ascending; we can wrap the masked vector with desc() to sort descending ``` ``` ## party_name_short vote_share left_right ## 1 AfD 12.6 8.8000 ## 2 FW 1.0 7.4000 ## 3 CSU 6.2 7.2871 ## 4 CDU 26.8 6.2503 ## 5 FDP 10.7 5.9233 ## 6 SPD 20.5 3.6451 ## 7 B90/Gru 8.9 2.9308 ## 8 PDS|Li 9.2 1.2152 ## 9 PARTEI 1.0 NA ``` ] --- # group_by() - The last of the most fundamental dplyr "verbs" we will learn is `group_by`. - Given the variables you supply, it converts a data frame into a grouped version of it such that you can do transformations, call functions, manipulate variables, etc. - For instance, let's use the whole data to get the party with the most votes ever in a parliamentary election by country... --- # group_by() .code60[ ```r parlgov_elec %>% filter(election_type == "parliament") %>% group_by(country_name_short) %>% summarise(share_max = max(vote_share, na.rm = TRUE), # In many base functions, default is not to ignore missings. party = party_name_english[1], election_date = election_date[1] ) %>% # note the "collapsing" I mentioned earlier arrange(desc(share_max)) ``` ``` ## # A tibble: 37 x 4 ## country_name_short share_max party election_date ## <chr> <dbl> <chr> <date> ## 1 CYP 71.2 Democratic Party 1976-09-05 ## 2 LVA 68.2 Popular Front of Latvia 1990-04-29 ## 3 ROU 66.3 Democratic Party 1990-05-20 ## 4 MLT 59.9 Malta Labour Party 1947-10-27 ## 5 JPN 59 Japan Liberal Party 1946-04-10 ## 6 NZL 58.7 New Zealand Liberal Party 1902-11-25 ## 7 CAN 57 Liberal Party of Canada 1900-11-07 ## 8 CHE 56.2 Radical Democratic Party 1902-10-26 ## 9 BEL 56 Catholic Party 1900-05-27 ## 10 GBR 55.0 Conservatives 1918-12-14 ## # ... with 27 more rows ``` ] --- # group_by() - Too complex? - Ok, let's do something more simple and add a variable that tells us the maximum number of total seats for each country. -- ```r parlgov_elec <- parlgov_elec %>% group_by(country_id) %>% mutate(max_seats = max(seats_total)) %>% ungroup() # to remove the grouping ``` --- # group_by() .code70[ ```r parlgov_elec %>% filter(seats_total == max_seats) %>% select(country_name_short, election_date, max_seats) %>% distinct() # select distinct rows ``` ``` ## # A tibble: 309 x 3 ## country_name_short election_date max_seats ## <chr> <date> <int> ## 1 AUS 2019-05-18 151 ## 2 AUT 1920-10-17 183 ## 3 AUT 1971-10-10 183 ## 4 AUT 1975-10-05 183 ## 5 AUT 1979-05-06 183 ## 6 AUT 1983-04-24 183 ## 7 AUT 1986-11-23 183 ## 8 AUT 1990-10-07 183 ## 9 AUT 1994-10-09 183 ## 10 AUT 1995-12-17 183 ## # ... with 299 more rows ``` ] --- # Other usefull stuff - `tidyr::separate()` to separate entries in columns; `tidyr::unite` to unite columns - `dplyr::rename()` to rename variables. - `dplyr::count()` count unique values of one or multiple variables. - `dplyr::distinct()` or Base R `unique()` to remove duplicate rows - `dplyr::slice()` to get a slice of rows: `parlgov_elec %>% slice(1:5)` to get the first 5 rows. - `dplyr::across()` apply transformations to multiple columns. .font70[.hl[TIPP] Check out the [`modelsummary`](https://vincentarelbundock.github.io/modelsummary/articles/datasummary.html) functions in more detail for easy two- or multi-dimensional cross tabulations (for Stata `tabulate` addicts, see e.g., `datasummary_crosstab()`). ALternatively, use [`janitor::tabyl`](https://github.com/sfirke/janitor).] --- class: inverse, center, middle name: intro # Session 3 - Problem Set --- class: inverse, center, middle name: intro # Next Up: Data Wrangling II (Next Week) --- class: inverse, center, middle name: intro # Q & A