class: inverse, center, title-slide, middle <style> .title-slide .remark-slide-number { display: none; } </style> # .title-wrap[Intro to Programming with R for Political Scientists] <br /> ## .header-fancy[Session 4: Data Wrangling II] ### Markus Freitag ### Geschwister Scholl Institute of Political Science, LMU ### [<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#415564;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg>](https://twitter.com/MarkusGFreitag) [<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#415564;" xmlns="http://www.w3.org/2000/svg"> <path d="M336.5 160C322 70.7 287.8 8 248 8s-74 62.7-88.5 152h177zM152 256c0 22.2 1.2 43.5 3.3 64h185.3c2.1-20.5 3.3-41.8 3.3-64s-1.2-43.5-3.3-64H155.3c-2.1 20.5-3.3 41.8-3.3 64zm324.7-96c-28.6-67.9-86.5-120.4-158-141.6 24.4 33.8 41.2 84.7 50 141.6h108zM177.2 18.4C105.8 39.6 47.8 92.1 19.3 160h108c8.7-56.9 25.5-107.8 49.9-141.6zM487.4 192H372.7c2.1 21 3.3 42.5 3.3 64s-1.2 43-3.3 64h114.6c5.5-20.5 8.6-41.8 8.6-64s-3.1-43.5-8.5-64zM120 256c0-21.5 1.2-43 3.3-64H8.6C3.2 212.5 0 233.8 0 256s3.2 43.5 8.6 64h114.6c-2-21-3.2-42.5-3.2-64zm39.5 96c14.5 89.3 48.7 152 88.5 152s74-62.7 88.5-152h-177zm159.3 141.6c71.4-21.2 129.4-73.7 158-141.6h-108c-8.8 56.9-25.6 107.8-50 141.6zM19.3 352c28.6 67.9 86.5 120.4 158 141.6-24.4-33.8-41.2-84.7-50-141.6h-108z"></path></svg>](https://markusfreitag.netlify.app/) ### July 19, 2021 <a href="https://github.com/m-freitag" class="github-corner" aria-label="View source on Github"><svg width="80" height="80" viewBox="0 0 250 250" style="fill:#415564; color:#f6f3f2; position: absolute; top: 0; border: 0; right: 0;" aria-hidden="true"><path d="M0,0 L115,115 L130,115 L142,142 L250,250 L250,0 Z"></path><path d="M128.3,109.0 C113.8,99.7 119.0,89.6 119.0,89.6 C122.0,82.7 120.5,78.6 120.5,78.6 C119.2,72.0 123.4,76.3 123.4,76.3 C127.3,80.9 125.5,87.3 125.5,87.3 C122.9,97.6 130.6,101.9 134.4,103.2" fill="currentColor" style="transform-origin: 130px 106px;" class="octo-arm"></path><path d="M115.0,115.0 C114.9,115.1 118.7,116.5 119.8,115.4 L133.7,101.6 C136.9,99.2 139.9,98.4 142.2,98.6 C133.8,88.0 127.5,74.4 143.8,58.0 C148.5,53.4 154.0,51.2 159.7,51.0 C160.3,49.4 163.2,43.6 171.4,40.1 C171.4,40.1 176.1,42.5 178.8,56.2 C183.1,58.6 187.2,61.8 190.9,65.4 C194.5,69.0 197.7,73.2 200.1,77.6 C213.8,80.2 216.3,84.9 216.3,84.9 C212.7,93.1 206.9,96.0 205.4,96.6 C205.1,102.4 203.0,107.8 198.3,112.5 C181.9,128.9 168.3,122.5 157.7,114.1 C157.9,116.9 156.7,120.9 152.7,124.9 L141.0,136.5 C139.8,137.7 141.6,141.9 141.8,141.8 Z" fill="currentColor" class="octo-body"></path></svg></a><style>.github-corner:hover .octo-arm{animation:octocat-wave 560ms ease-in-out}@keyframes octocat-wave{0%,100%{transform:rotate(0)}20%,60%{transform:rotate(-25deg)}40%,80%{transform:rotate(10deg)}}@media (max-width:500px){.github-corner:hover .octo-arm{animation:none}.github-corner .octo-arm{animation:octocat-wave 560ms ease-in-out}}</style> --- # Overview 1. Intro + R-Studio and (Git)Hub 3. Base R & Tidyverse Basics 3. Data Wrangling I 4. .hl[Data Wrangling II] 5. Data Viz 6. Writing Functions --- # Workflow - As I changed some file names and appended some things from last week, merging your forked version with mine will surely generate some merge conflicts. - As we don't want to spend too much time on this, just download a `.zip` of the course repo and manually copy the files you want in your repo or initialize a new repo. - Navigate to `Session Scripts` and open `Session_3_script.R` (Same as Data Wrangling I). - You will see a pre-formatted Script with all the steps I do on the slides. - Explore as you follow. - If you have a second monitor, great! If not, split your screen. --- # The Data - We will use the [parlgov](http://www.parlgov.org/) database: > ParlGov is a data infrastructure for political science and contains information for all EU and most OECD democracies (37 countries). The database combines approximately 1700 parties, 1000 elections (9400 results), and 1600 cabinets (3900 parties). - It's relational, i.e. consists of different tables (parties, elections, cabinets) that can be **joined** using key variables. It can also be joined with the [partyfacts](https://partyfacts.herokuapp.com/) dataset that provides id's for many other datasets (e.g. CLEA, ESS). - It makes for pretty simple examples and hence we use it. --- # The Data Let's import the election data: ```r parlgov_elec <- import("http://www.parlgov.org/static/data/development-cp1252/view_election.csv") ``` And filter for German elections: ```r parlgov_elec_de <- parlgov_elec %>% # add, e.g., _de if we want to keep our original df filter(country_name_short == "DEU") ``` --- # Dealing with factor variables: forcats - Factor variables are useful, especially for plotting and modelling. - With `factor_recode`, we can easily recode levels: ```r parlgov_elec_de %>% mutate(election_type = fct_recode(election_type, # Coerces the type automatically from chr to fct. Bundestagswahl = "parliament", Europawahl = "ep" )) %>% count(election_type) ``` ``` ## election_type n ## 1 Europawahl 82 ## 2 Bundestagswahl 264 ``` - With `fct_recorder` we can reorder factors (which will be useful for plotting factors). --- # Complex conditions: if_else and case_when - Often, we also want to manipulate variables by means of complex conditions - We will go deeper into control flow statements next week, but here is a sneak preview for data wrangling. - Say we want to create a variable, "family", that puts parties into some party family based on some arbitrary cutoff of the time-invariant left_right position: -- ```r parlgov_elec_de <- parlgov_elec_de %>% mutate(family = if_else(left_right > 5, "right", "left")) ``` - Vectorised if: `if_else(condition, true, false)`. --- # Complex conditions: if_else and case_when - A generalised version of `if_else` is `case_when`. - This is .hl2[very] useful: ```r parlgov_elec_de <- parlgov_elec_de %>% mutate(family = case_when( left_right <= 2.5 ~ "left", left_right > 2.5 & left_right < 5 ~ "centre-left", left_right > 5 & left_right < 7.5 ~ "centre-right", left_right >= 7.5 ~ "right")) ``` - Two-sided formula: LHS = logical test; RHS = value to assign if the test is `TRUE`. - Values that do not fall into any of the conditions become `NA` which can be prevented by adding `TRUE ~ something` as the last argument. --- # tidyr: reshaping data - Reshaping data is one of the key things you need to when cleaning/analysing data. - [Two functions](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-import.pdf): - `tidyr::pivot_wider/longer()` is for reshaping from long (wide) to wide (long). Two main arguments: - `names_*` and `values_*`, where "*" is "to" for `pivot_longer()` and "from" for `pivot_wider()`. -- .pull-left[ <img src="data:image/png;base64,#Figs/wide.png" width="60%" style="display: block; margin: auto;" /> <center> long → wide: `pivot_wider()` <center/> ] .pull-right[ <img src="data:image/png;base64,#Figs/longer.png" width="60%" style="display: block; margin: auto;" /> <br/> <center> wide → long: `pivot_longer()` <center/> ] --- # tidyr: reshaping data .hl[Example:] - Say, we want a table of the vote shares of all major parties for each post-WW2 parliamentary election. - Where each row is an election: ```r wide <- parlgov_elec_de %>% filter(election_type == "parliament", vote_share >= 5, year(election_date) >= 1945) %>% select(election_date, party_name_short, vote_share) %>% pivot_wider(names_from = party_name_short, values_from = vote_share) ``` --- # tidyr: reshaping data
--- # tidyr: reshaping data - We can revert back to long format: ```r long <- wide %>% pivot_longer(!election_date, names_to = "party_name_short", values_to = "vote_share") %>% filter(is.na(vote_share) == FALSE) # alternatively, simply set values_drop_na to TRUE in pivot_longer(). head(long) ``` ``` ## # A tibble: 6 x 3 ## election_date party_name_short vote_share ## <date> <chr> <dbl> ## 1 1949-08-14 SPD 29.2 ## 2 1949-08-14 CDU 25.2 ## 3 1949-08-14 FDP 11.9 ## 4 1949-08-14 CSU 5.8 ## 5 1949-08-14 KPD 5.7 ## 6 1953-09-06 SPD 28.8 ``` --- # dplyr: joins - Let's come back to the relational nature of our data... --- # The Data <img src="data:image/png;base64,#Figs/relational.svg" width="60%" style="display: block; margin: auto;" /> --- # dplyr: joins - Let's come back to the [relational](https://en.wikipedia.org/wiki/Relational_database) nature of our data... - Remember, it consists of different tables, each representing one "entity type" (c.f. the 3rd. point in Wickham's tidy data framework) - Each table has a unique key, representing each row. This key variable is used to link/join tables - .hl[Suppose we want to join the party and the election table. How do we do that?] --- # dplyr: joins <img src="data:image/png;base64,#Figs/joins.png" width="35%" style="display: block; margin: auto;" /> <img src="data:image/png;base64,#Figs/fjoins.png" width="35%" style="display: block; margin: auto;" /> .hl2[Q] Which join do we need? --- # dplyr: joins We want a left join here... .code70[ ```r parlgov_party <- rio::import("http://www.parlgov.org/static/data/development-utf-8/view_party.csv") l_joined <- left_join(parlgov_elec_de, parlgov_party, by = "party_id") head(l_joined) ``` ``` ## country_name_short.x country_name.x election_type election_date vote_share ## 1 DEU Germany parliament 1919-01-19 37.87 ## 2 DEU Germany parliament 1919-01-19 18.32 ## 3 DEU Germany parliament 1919-01-19 15.45 ## 4 DEU Germany parliament 1919-01-19 10.26 ## 5 DEU Germany parliament 1919-01-19 4.66 ## 6 DEU Germany parliament 1919-01-19 7.63 ## seats seats_total party_name_short.x ## 1 165 423 SPD ## 2 74 423 DDP ## 3 73 423 DZ ## 4 41 423 DNVP ## 5 23 423 DVP ## 6 22 423 USPD ## party_name.x ## 1 Sozialdemokratische Partei Deutschlands ## 2 Deutsche Demokratische Partei ## 3 Deutsche Zentrumspartei ## 4 Deutschnationale Volkspartei ## 5 Deutsche Volkspartei ## 6 Unabhängige Sozialdemokratische Partei Deutschland ## party_name_english.x left_right.x country_id.x ## 1 Social Democratic Party of Germany 3.6451 54 ## 2 German Democratic Party 6.0000 54 ## 3 Centre Party 6.2000 54 ## 4 German National People's Party 8.8000 54 ## 5 German People's Party 7.4000 54 ## 6 Independent Social Democratic Party of Germany 1.3000 54 ## election_id previous_parliament_election_id previous_cabinet_id party_id ## 1 1031 NA NA 558 ## 2 1031 NA NA 2691 ## 3 1031 NA NA 137 ## 4 1031 NA NA 2692 ## 5 1031 NA NA 2693 ## 6 1031 NA NA 2690 ## family country_name_short.y country_name.y party_name_short.y ## 1 centre-left DEU Germany SPD ## 2 centre-right DEU Germany DDP ## 3 centre-right DEU Germany DZ ## 4 right DEU Germany DNVP ## 5 centre-right DEU Germany DVP ## 6 left DEU Germany USPD ## party_name_english.y ## 1 Social Democratic Party of Germany ## 2 German Democratic Party ## 3 Centre Party ## 4 German National People's Party ## 5 German People's Party ## 6 Independent Social Democratic Party of Germany ## party_name.y ## 1 Sozialdemokratische Partei Deutschlands ## 2 Deutsche Demokratische Partei ## 3 Deutsche Zentrumspartei ## 4 Deutschnationale Volkspartei ## 5 Deutsche Volkspartei ## 6 Unabhängige Sozialdemokratische Partei Deutschland ## party_name_ascii family_name_short ## 1 Sozialdemokratische Partei Deutschlands soc ## 2 Deutsche Demokratische Partei lib ## 3 Deutsche Zentrumspartei chr ## 4 Deutschnationale Volkspartei right ## 5 Deutsche Volkspartei con ## 6 Unabhaengige Sozialdemokratische Partei Deutschland com ## family_name left_right.y state_market liberty_authority eu_anti_pro ## 1 Social democracy 3.6451 3.8443 3.9791 7.9318 ## 2 Liberal 6.0000 6.7000 3.6000 8.7000 ## 3 Christian democracy 6.2000 5.7000 7.1000 8.3000 ## 4 Right-wing 8.8000 5.9000 8.5000 2.3000 ## 5 Conservative 7.4000 6.4000 7.0000 7.9000 ## 6 Communist/Socialist 1.3000 1.4000 3.0000 3.3000 ## cmp euprofiler ees castles_mair huber_inglehart ray benoit_laver chess ## 1 41320 120 1276320 803 1503 302 6087 302 ## 2 NA NA NA NA NA NA NA NA ## 3 41522 NA NA NA NA NA NA NA ## 4 NA NA NA NA NA NA NA NA ## 5 NA NA NA NA NA NA NA NA ## 6 NA NA NA NA NA NA NA NA ## country_id.y family_id ## 1 54 11 ## 2 54 6 ## 3 54 3 ## 4 54 40 ## 5 54 26 ## 6 54 14 ``` ] .hl[Q] Why do we get `party_name_short.x` etc? --- # dplyr: joins - There are multiple matching (by name) variables in both tables. - Hence, we need to specify all keys, or let `dplyr` do its magic: .code70[ ```r l_joined <- left_join(parlgov_elec_de, parlgov_party) head(l_joined) ``` ``` ## country_name_short country_name election_type election_date vote_share seats ## 1 DEU Germany parliament 1919-01-19 37.87 165 ## 2 DEU Germany parliament 1919-01-19 18.32 74 ## 3 DEU Germany parliament 1919-01-19 15.45 73 ## 4 DEU Germany parliament 1919-01-19 10.26 41 ## 5 DEU Germany parliament 1919-01-19 4.66 23 ## 6 DEU Germany parliament 1919-01-19 7.63 22 ## seats_total party_name_short ## 1 423 SPD ## 2 423 DDP ## 3 423 DZ ## 4 423 DNVP ## 5 423 DVP ## 6 423 USPD ## party_name ## 1 Sozialdemokratische Partei Deutschlands ## 2 Deutsche Demokratische Partei ## 3 Deutsche Zentrumspartei ## 4 Deutschnationale Volkspartei ## 5 Deutsche Volkspartei ## 6 Unabhängige Sozialdemokratische Partei Deutschland ## party_name_english left_right country_id ## 1 Social Democratic Party of Germany 3.6451 54 ## 2 German Democratic Party 6.0000 54 ## 3 Centre Party 6.2000 54 ## 4 German National People's Party 8.8000 54 ## 5 German People's Party 7.4000 54 ## 6 Independent Social Democratic Party of Germany 1.3000 54 ## election_id previous_parliament_election_id previous_cabinet_id party_id ## 1 1031 NA NA 558 ## 2 1031 NA NA 2691 ## 3 1031 NA NA 137 ## 4 1031 NA NA 2692 ## 5 1031 NA NA 2693 ## 6 1031 NA NA 2690 ## family party_name_ascii family_name_short ## 1 centre-left Sozialdemokratische Partei Deutschlands soc ## 2 centre-right Deutsche Demokratische Partei lib ## 3 centre-right Deutsche Zentrumspartei chr ## 4 right Deutschnationale Volkspartei right ## 5 centre-right Deutsche Volkspartei con ## 6 left <NA> <NA> ## family_name state_market liberty_authority eu_anti_pro cmp ## 1 Social democracy 3.8443 3.9791 7.9318 41320 ## 2 Liberal 6.7000 3.6000 8.7000 NA ## 3 Christian democracy 5.7000 7.1000 8.3000 41522 ## 4 Right-wing 5.9000 8.5000 2.3000 NA ## 5 Conservative 6.4000 7.0000 7.9000 NA ## 6 <NA> NA NA NA NA ## euprofiler ees castles_mair huber_inglehart ray benoit_laver chess ## 1 120 1276320 803 1503 302 6087 302 ## 2 NA NA NA NA NA NA NA ## 3 NA NA NA NA NA NA NA ## 4 NA NA NA NA NA NA NA ## 5 NA NA NA NA NA NA NA ## 6 NA NA NA NA NA NA NA ## family_id ## 1 11 ## 2 6 ## 3 3 ## 4 40 ## 5 26 ## 6 NA ``` ] .font70[.hl[NOTE] Be careful if variables are named the same but measure different things.] --- # Alternative approaches - For every tidyverse function ("verb"), there is, of course, [a base R way to do it](https://dplyr.tidyverse.org/articles/base.html). - There are alternatives. - For instance, the [data.table](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) and [collapse](https://raw.githubusercontent.com/SebKrantz/cheatsheets/master/collapse.pdf) (also comes with fast versions of summary stats and models) package provide great and fast data wrangling alternatives. - Data.table syntax is closer to the base R way of indexing/manipulating data frames. Some like that. - Don't be dogmatic. Use whatever suits you and your context. Mix stuff. - You can find a great comparison of `data.table` and `dplyr` [here](https://atrebas.github.io/post/2019-03-03-datatable-dplyr/). --- # A glimpse at data.table - Comes with its own interpretation of data frames, "data tables". Special structure to work faster. - Looks similar to basic `df[]` indexing but with alot of twists. - Three elements: .hl[which observations/rows] COMMA .hl2[transformations or other functions] COMMA grouping. Rough `dplyr` equivalent: <center> DT[.hl[`slice(); filter(); arrange()`], .hl2[`select(); mutate()`], `group_by()`] <center/> --- # A glimpse at data.table Example: .pull-left[ ```r parlgov_elec_de %>% # add, e.g., _de if we want to keep our original df filter(party_name_short == "SPD") %>% summarise(mean(vote_share, na.rm = T)) ``` ``` ## mean(vote_share, na.rm = T) ## 1 31.12622 ``` ] .pull-right[ ```r setDT(parlgov_elec_de) parlgov_elec_de[party_name_short == "SPD", mean(vote_share, na.rm = T)] ``` ``` ## [1] 31.12622 ``` ] --- # Session 4 Problem Set/"Homework" - We will not be able to do this task in class/break-out sessions. - I highly recommend doing it at home - it's great practice! .hl[If you like, send me your solution via mail and I will comment/give you feedback!] --- class: inverse, center, middle name: intro # Next Up: Data Viz