Session 4: Data Wrangling II

class: inverse, center, title-slide, middle

# .title-wrap[Intro to Programming with R for Political Scientists]

<br />

## .header-fancy[Session 4: Data Wrangling II]

### Markus Freitag

### Geschwister Scholl Institute of Political Science, LMU

### [<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#415564;" xmlns="http://www.w3.org/2000/svg">  <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg>](https://twitter.com/MarkusGFreitag) [<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#415564;" xmlns="http://www.w3.org/2000/svg">  <path d="M336.5 160C322 70.7 287.8 8 248 8s-74 62.7-88.5 152h177zM152 256c0 22.2 1.2 43.5 3.3 64h185.3c2.1-20.5 3.3-41.8 3.3-64s-1.2-43.5-3.3-64H155.3c-2.1 20.5-3.3 41.8-3.3 64zm324.7-96c-28.6-67.9-86.5-120.4-158-141.6 24.4 33.8 41.2 84.7 50 141.6h108zM177.2 18.4C105.8 39.6 47.8 92.1 19.3 160h108c8.7-56.9 25.5-107.8 49.9-141.6zM487.4 192H372.7c2.1 21 3.3 42.5 3.3 64s-1.2 43-3.3 64h114.6c5.5-20.5 8.6-41.8 8.6-64s-3.1-43.5-8.5-64zM120 256c0-21.5 1.2-43 3.3-64H8.6C3.2 212.5 0 233.8 0 256s3.2 43.5 8.6 64h114.6c-2-21-3.2-42.5-3.2-64zm39.5 96c14.5 89.3 48.7 152 88.5 152s74-62.7 88.5-152h-177zm159.3 141.6c71.4-21.2 129.4-73.7 158-141.6h-108c-8.8 56.9-25.6 107.8-50 141.6zM19.3 352c28.6 67.9 86.5 120.4 158 141.6-24.4-33.8-41.2-84.7-50-141.6h-108z"></path></svg>](https://markusfreitag.netlify.app/)

### July 19, 2021

<a href="https://github.com/m-freitag" class="github-corner" aria-label="View
source on Github"><svg width="80" height="80" viewBox="0 0 250 250"
style="fill:#415564; color:#f6f3f2; position: absolute; top: 0; border: 0;
right: 0;" aria-hidden="true"><path d="M0,0 L115,115 L130,115 L142,142 L250,250
L250,0 Z"></path><path d="M128.3,109.0 C113.8,99.7 119.0,89.6 119.0,89.6
C122.0,82.7 120.5,78.6 120.5,78.6 C119.2,72.0 123.4,76.3 123.4,76.3 C127.3,80.9
125.5,87.3 125.5,87.3 C122.9,97.6 130.6,101.9 134.4,103.2" fill="currentColor"
style="transform-origin: 130px 106px;" class="octo-arm"></path><path
d="M115.0,115.0 C114.9,115.1 118.7,116.5 119.8,115.4 L133.7,101.6 C136.9,99.2
139.9,98.4 142.2,98.6 C133.8,88.0 127.5,74.4 143.8,58.0 C148.5,53.4 154.0,51.2
159.7,51.0 C160.3,49.4 163.2,43.6 171.4,40.1 C171.4,40.1 176.1,42.5 178.8,56.2
C183.1,58.6 187.2,61.8 190.9,65.4 C194.5,69.0 197.7,73.2 200.1,77.6 C213.8,80.2
216.3,84.9 216.3,84.9 C212.7,93.1 206.9,96.0 205.4,96.6 C205.1,102.4
203.0,107.8 198.3,112.5 C181.9,128.9 168.3,122.5 157.7,114.1 C157.9,116.9
156.7,120.9 152.7,124.9 L141.0,136.5 C139.8,137.7 141.6,141.9 141.8,141.8 Z"
fill="currentColor"
class="octo-body"></path></svg></a><style>.github-corner:hover
.octo-arm{animation:octocat-wave 560ms ease-in-out}@keyframes
octocat-wave{0%,100%{transform:rotate(0)}20%,60%{transform:rotate(-25deg)}40%,80%{transform:rotate(10deg)}}@media
(max-width:500px){.github-corner:hover .octo-arm{animation:none}.github-corner
.octo-arm{animation:octocat-wave 560ms ease-in-out}}</style>

---

# Overview

1. Intro + R-Studio and (Git)Hub

3. Base R & Tidyverse Basics

3.  Data Wrangling I

4. .hl[Data Wrangling II]

5. Data Viz

6. Writing Functions

---

# Workflow

- As I changed some file names and appended some things from last week, merging your forked
version with mine will surely generate some merge conflicts.

- As we don't want to spend too much time on this, just download a `.zip` of the course repo and manually copy the files you want in your repo
or initialize a new repo.

- Navigate to `Session Scripts` and open `Session_3_script.R` (Same as Data Wrangling I).

- You will see a pre-formatted Script with all the steps I do on the slides.

- Explore as you follow.

- If you have a second monitor, great! If not, split your screen.

---

# The Data

- We will use the [parlgov](http://www.parlgov.org/) database:

> ParlGov is a data infrastructure for political 
science and contains information for all EU 
and most OECD democracies (37 countries). 
The database combines approximately 1700 
parties, 1000 elections (9400 results), and 1600 cabinets (3900 parties).

- It's relational, i.e. consists of different tables (parties, elections, cabinets)
that can be **joined** using key variables. It can also be joined with the 
[partyfacts](https://partyfacts.herokuapp.com/) dataset that provides id's for 
many other datasets (e.g. CLEA, ESS).

- It makes for pretty simple examples and hence we use it.

---

# The Data

Let's import the election data:

```r
parlgov_elec <- import("http://www.parlgov.org/static/data/development-cp1252/view_election.csv")
```

And filter for German elections:

```r
parlgov_elec_de <- parlgov_elec %>% # add, e.g., _de if we want to keep our original df 
  filter(country_name_short == "DEU")
```

---

# Dealing with factor variables: forcats

- Factor variables are useful, especially for plotting and modelling.

- With `factor_recode`, we can easily recode levels:

```r
parlgov_elec_de %>%
  mutate(election_type = fct_recode(election_type, # Coerces the type automatically from chr to fct.
    Bundestagswahl = "parliament",
    Europawahl = "ep"
  )) %>%
  count(election_type)
```

```
##    election_type   n
## 1     Europawahl  82
## 2 Bundestagswahl 264
```

- With `fct_recorder` we can reorder factors (which will be useful for plotting
factors).

---

# Complex conditions: if_else and case_when

- Often, we also want to manipulate variables by means of complex conditions

- We will go deeper into control flow statements next week, but here is a sneak preview for data wrangling.

- Say we want to create a variable, "family", that puts parties into some party family
based on some arbitrary cutoff of the time-invariant left_right position:

```r
parlgov_elec_de <- parlgov_elec_de %>%
  mutate(family = if_else(left_right > 5, "right", "left"))
```

- Vectorised if: `if_else(condition, true, false)`.

---

# Complex conditions: if_else and case_when

- A generalised version of `if_else` is `case_when`.

- This is .hl2[very] useful:

```r
parlgov_elec_de <- parlgov_elec_de %>%
  mutate(family = case_when(
    left_right <= 2.5 ~ "left",
    left_right > 2.5 & left_right < 5 ~ "centre-left",
    left_right > 5 & left_right < 7.5 ~ "centre-right",
    left_right >= 7.5 ~ "right"))
```

- Two-sided formula: LHS = logical test; RHS = value to assign if the test
is `TRUE`.

- Values that do not fall into any of the conditions become `NA` which can
be prevented by adding `TRUE ~ something` as the last argument.

---
# tidyr: reshaping data

- Reshaping data is one of the key things you need to when cleaning/analysing data.

- [Two functions](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-import.pdf):

- `tidyr::pivot_wider/longer()` is for reshaping from long (wide) to wide (long).

Two main arguments:

- `names_*` and `values_*`, where "*" is "to" for `pivot_longer()` and "from" for 
  `pivot_wider()`.

--
  
.pull-left[

<img src="data:image/png;base64,#Figs/wide.png" width="60%" style="display: block; margin: auto;" />
<center> long &rarr; wide: `pivot_wider()` <center/> 
]

.pull-right[

<img src="data:image/png;base64,#Figs/longer.png" width="60%" style="display: block; margin: auto;" />
<br/>
<center> wide &rarr; long: `pivot_longer()` <center/> 
]

---

# tidyr: reshaping data

.hl[Example:]

- Say, we want a table of the vote shares of all major parties for each 
post-WW2 parliamentary election.

- Where each row is an election:

```r
wide <- parlgov_elec_de %>%
  filter(election_type == "parliament", vote_share >= 5, year(election_date) >= 1945) %>%
  select(election_date, party_name_short, vote_share) %>%
  pivot_wider(names_from = party_name_short, values_from = vote_share)
```

---

# tidyr: reshaping data

<div id="htmlwidget-ff0fd2fba8e5d55c8267" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-ff0fd2fba8e5d55c8267">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19"],["1949-08-14","1953-09-06","1957-09-15","1961-09-17","1965-09-19","1969-09-28","1972-11-19","1976-10-03","1980-10-05","1983-03-06","1987-01-25","1990-12-02","1994-10-16","1998-09-27","2002-09-22","2005-09-18","2009-09-27","2013-09-22","2017-09-24"],[29.2,28.8,31.8,36.2,39.3,42.7,45.8,42.6,42.9,38.2,37.2,33.5,36.4,40.9,38.5,34.2,23,25.7,20.5],[25.2,36.4,39.7,35.8,38.2,36.6,35.2,38.1,34.2,38.2,34.5,36.7,34.2,28.4,29.5,27.8,27.3,34.1,26.8],[11.9,9.5,7.7,12.8,9.5,5.8,8.4,7.9,10.6,7,9.1,11.1,6.9,6.2,7.4,9.8,14.6,null,10.7],[5.8,8.8,10.5,9.6,9.6,9.5,9.7,10.6,10.3,10.6,9.8,7.1,7.3,6.7,9,7.4,6.5,7.4,6.2],[5.7,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null],[null,5.9,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null],[null,null,null,null,null,null,null,null,null,5.6,8.3,null,7.3,6.7,8.6,8.1,10.7,8.4,8.9],[null,null,null,null,null,null,null,null,null,null,null,null,null,5.1,null,8.7,11.9,8.6,9.2],[null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,12.6]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>election_date<\/th>\n      <th>SPD<\/th>\n      <th>CDU<\/th>\n      <th>FDP<\/th>\n      <th>CSU<\/th>\n      <th>KPD<\/th>\n      <th>GB/BHE<\/th>\n      <th>B90/Gru<\/th>\n      <th>PDS|Li<\/th>\n      <th>AfD<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":5,"scrollX":true,"columnDefs":[{"className":"dt-right","targets":[2,3,4,5,6,7,8,9,10]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[5,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---

# tidyr: reshaping data

- We can revert back to long format:

```r
long <- wide %>%
  pivot_longer(!election_date, names_to = "party_name_short", values_to = "vote_share") %>%
  filter(is.na(vote_share) == FALSE) # alternatively, simply set values_drop_na to TRUE in pivot_longer().
head(long)
```

```
## # A tibble: 6 x 3
##   election_date party_name_short vote_share
##   <date>        <chr>                 <dbl>
## 1 1949-08-14    SPD                    29.2
## 2 1949-08-14    CDU                    25.2
## 3 1949-08-14    FDP                    11.9
## 4 1949-08-14    CSU                     5.8
## 5 1949-08-14    KPD                     5.7
## 6 1953-09-06    SPD                    28.8
```

---

# dplyr: joins

- Let's come back to the relational nature of our data...

---

# The Data

---
  
# dplyr: joins

- Let's come back to the [relational](https://en.wikipedia.org/wiki/Relational_database) nature of our data...

- Remember, it consists of different tables, each representing one "entity type" 
(c.f. the 3rd. point in Wickham's tidy data framework)

- Each table has a unique key, representing each row. This key variable is used to link/join tables

- .hl[Suppose we want to join the party and the election table. How do we do that?]

---
  
# dplyr: joins

.hl2[Q] Which join do we need?

---
  
# dplyr: joins

We want a left join here...

.code70[

```r
parlgov_party <- rio::import("http://www.parlgov.org/static/data/development-utf-8/view_party.csv")

l_joined <- left_join(parlgov_elec_de, parlgov_party, by = "party_id")

head(l_joined)
```

```
##   country_name_short.x country_name.x election_type election_date vote_share
## 1                  DEU        Germany    parliament    1919-01-19      37.87
## 2                  DEU        Germany    parliament    1919-01-19      18.32
## 3                  DEU        Germany    parliament    1919-01-19      15.45
## 4                  DEU        Germany    parliament    1919-01-19      10.26
## 5                  DEU        Germany    parliament    1919-01-19       4.66
## 6                  DEU        Germany    parliament    1919-01-19       7.63
##   seats seats_total party_name_short.x
## 1   165         423                SPD
## 2    74         423                DDP
## 3    73         423                 DZ
## 4    41         423               DNVP
## 5    23         423                DVP
## 6    22         423               USPD
##                                         party_name.x
## 1            Sozialdemokratische Partei Deutschlands
## 2                      Deutsche Demokratische Partei
## 3                            Deutsche Zentrumspartei
## 4                       Deutschnationale Volkspartei
## 5                               Deutsche Volkspartei
## 6 Unabhängige Sozialdemokratische Partei Deutschland
##                             party_name_english.x left_right.x country_id.x
## 1             Social Democratic Party of Germany       3.6451           54
## 2                        German Democratic Party       6.0000           54
## 3                                   Centre Party       6.2000           54
## 4                 German National People's Party       8.8000           54
## 5                          German People's Party       7.4000           54
## 6 Independent Social Democratic Party of Germany       1.3000           54
##   election_id previous_parliament_election_id previous_cabinet_id party_id
## 1        1031                              NA                  NA      558
## 2        1031                              NA                  NA     2691
## 3        1031                              NA                  NA      137
## 4        1031                              NA                  NA     2692
## 5        1031                              NA                  NA     2693
## 6        1031                              NA                  NA     2690
##         family country_name_short.y country_name.y party_name_short.y
## 1  centre-left                  DEU        Germany                SPD
## 2 centre-right                  DEU        Germany                DDP
## 3 centre-right                  DEU        Germany                 DZ
## 4        right                  DEU        Germany               DNVP
## 5 centre-right                  DEU        Germany                DVP
## 6         left                  DEU        Germany               USPD
##                             party_name_english.y
## 1             Social Democratic Party of Germany
## 2                        German Democratic Party
## 3                                   Centre Party
## 4                 German National People's Party
## 5                          German People's Party
## 6 Independent Social Democratic Party of Germany
##                                          party_name.y
## 1             Sozialdemokratische Partei Deutschlands
## 2                       Deutsche Demokratische Partei
## 3                             Deutsche Zentrumspartei
## 4                        Deutschnationale Volkspartei
## 5                                Deutsche Volkspartei
## 6 UnabhÃ¤ngige Sozialdemokratische Partei Deutschland
##                                      party_name_ascii family_name_short
## 1             Sozialdemokratische Partei Deutschlands               soc
## 2                       Deutsche Demokratische Partei               lib
## 3                             Deutsche Zentrumspartei               chr
## 4                        Deutschnationale Volkspartei             right
## 5                                Deutsche Volkspartei               con
## 6 Unabhaengige Sozialdemokratische Partei Deutschland               com
##           family_name left_right.y state_market liberty_authority eu_anti_pro
## 1    Social democracy       3.6451       3.8443            3.9791      7.9318
## 2             Liberal       6.0000       6.7000            3.6000      8.7000
## 3 Christian democracy       6.2000       5.7000            7.1000      8.3000
## 4          Right-wing       8.8000       5.9000            8.5000      2.3000
## 5        Conservative       7.4000       6.4000            7.0000      7.9000
## 6 Communist/Socialist       1.3000       1.4000            3.0000      3.3000
##     cmp euprofiler     ees castles_mair huber_inglehart ray benoit_laver chess
## 1 41320        120 1276320          803            1503 302         6087   302
## 2    NA         NA      NA           NA              NA  NA           NA    NA
## 3 41522         NA      NA           NA              NA  NA           NA    NA
## 4    NA         NA      NA           NA              NA  NA           NA    NA
## 5    NA         NA      NA           NA              NA  NA           NA    NA
## 6    NA         NA      NA           NA              NA  NA           NA    NA
##   country_id.y family_id
## 1           54        11
## 2           54         6
## 3           54         3
## 4           54        40
## 5           54        26
## 6           54        14
```
]

.hl[Q] Why do we get `party_name_short.x` etc?

---

# dplyr: joins

- There are multiple matching (by name) variables in both tables.

- Hence, we need to specify all keys, or let `dplyr` do its magic:

.code70[

```r
l_joined <- left_join(parlgov_elec_de, parlgov_party)

head(l_joined)
```

```
##   country_name_short country_name election_type election_date vote_share seats
## 1                DEU      Germany    parliament    1919-01-19      37.87   165
## 2                DEU      Germany    parliament    1919-01-19      18.32    74
## 3                DEU      Germany    parliament    1919-01-19      15.45    73
## 4                DEU      Germany    parliament    1919-01-19      10.26    41
## 5                DEU      Germany    parliament    1919-01-19       4.66    23
## 6                DEU      Germany    parliament    1919-01-19       7.63    22
##   seats_total party_name_short
## 1         423              SPD
## 2         423              DDP
## 3         423               DZ
## 4         423             DNVP
## 5         423              DVP
## 6         423             USPD
##                                           party_name
## 1            Sozialdemokratische Partei Deutschlands
## 2                      Deutsche Demokratische Partei
## 3                            Deutsche Zentrumspartei
## 4                       Deutschnationale Volkspartei
## 5                               Deutsche Volkspartei
## 6 Unabhängige Sozialdemokratische Partei Deutschland
##                               party_name_english left_right country_id
## 1             Social Democratic Party of Germany     3.6451         54
## 2                        German Democratic Party     6.0000         54
## 3                                   Centre Party     6.2000         54
## 4                 German National People's Party     8.8000         54
## 5                          German People's Party     7.4000         54
## 6 Independent Social Democratic Party of Germany     1.3000         54
##   election_id previous_parliament_election_id previous_cabinet_id party_id
## 1        1031                              NA                  NA      558
## 2        1031                              NA                  NA     2691
## 3        1031                              NA                  NA      137
## 4        1031                              NA                  NA     2692
## 5        1031                              NA                  NA     2693
## 6        1031                              NA                  NA     2690
##         family                        party_name_ascii family_name_short
## 1  centre-left Sozialdemokratische Partei Deutschlands               soc
## 2 centre-right           Deutsche Demokratische Partei               lib
## 3 centre-right                 Deutsche Zentrumspartei               chr
## 4        right            Deutschnationale Volkspartei             right
## 5 centre-right                    Deutsche Volkspartei               con
## 6         left                                    <NA>              <NA>
##           family_name state_market liberty_authority eu_anti_pro   cmp
## 1    Social democracy       3.8443            3.9791      7.9318 41320
## 2             Liberal       6.7000            3.6000      8.7000    NA
## 3 Christian democracy       5.7000            7.1000      8.3000 41522
## 4          Right-wing       5.9000            8.5000      2.3000    NA
## 5        Conservative       6.4000            7.0000      7.9000    NA
## 6                <NA>           NA                NA          NA    NA
##   euprofiler     ees castles_mair huber_inglehart ray benoit_laver chess
## 1        120 1276320          803            1503 302         6087   302
## 2         NA      NA           NA              NA  NA           NA    NA
## 3         NA      NA           NA              NA  NA           NA    NA
## 4         NA      NA           NA              NA  NA           NA    NA
## 5         NA      NA           NA              NA  NA           NA    NA
## 6         NA      NA           NA              NA  NA           NA    NA
##   family_id
## 1        11
## 2         6
## 3         3
## 4        40
## 5        26
## 6        NA
```
]

.font70[.hl[NOTE] Be careful if variables are named the same but measure different things.]

---

# Alternative approaches

- For every tidyverse function ("verb"), there is, of course, [a base
R way to do it](https://dplyr.tidyverse.org/articles/base.html).

- There are alternatives.

- For instance, the [data.table](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) 
and [collapse](https://raw.githubusercontent.com/SebKrantz/cheatsheets/master/collapse.pdf) (also comes with fast versions of summary stats and models) package provide great
and fast data wrangling alternatives.

- Data.table syntax is closer to the base R way of indexing/manipulating data frames.
Some like that.

- Don't be dogmatic. Use whatever suits you and your context. Mix stuff.

- You can find a great comparison of `data.table` and `dplyr` [here](https://atrebas.github.io/post/2019-03-03-datatable-dplyr/).

---

# A glimpse at data.table

- Comes with its own interpretation of data frames, "data tables". Special structure to work faster.

- Looks similar to basic `df[]` indexing but with alot of twists.

- Three elements:  .hl[which observations/rows] COMMA .hl2[transformations or other functions] COMMA grouping.

Rough `dplyr` equivalent:

<center> DT[.hl[`slice(); filter(); arrange()`], .hl2[`select(); mutate()`], `group_by()`] <center/>

---

# A glimpse at data.table

Example:

.pull-left[

```r
parlgov_elec_de %>% # add, e.g., _de if we want to keep our original df
  filter(party_name_short == "SPD") %>%
  summarise(mean(vote_share, na.rm = T))
```

```
##   mean(vote_share, na.rm = T)
## 1                    31.12622
```
]

.pull-right[

```r
setDT(parlgov_elec_de)

parlgov_elec_de[party_name_short == "SPD", mean(vote_share, na.rm = T)]
```

```
## [1] 31.12622
```
]

---

# Session 4 Problem Set/"Homework"

- We will not be able to do this task in class/break-out sessions.

- I highly recommend doing it at home - it's great practice!

.hl[If you like, send me your solution via mail and I will comment/give you feedback!]

---

class: inverse, center, middle name: intro

# Next Up: Data Viz