Topic 8: Exploratory Analysis

class: center, middle, inverse, title-slide

.title[
# Topic 8: Exploratory Analysis
]
.subtitle[
## Part 1: Understand individual variables
]
.author[
### Nick Hagerty <br> ECNS 460/560 <br> Montana State University
]

---

name: toc

# Table of contents

1. [Get to know your data](#first)

1. [Describe your categorical variables](#categories)

1. [Describe your continuous distributions](#distributions)

1. [Handle your extreme values](#extreme)

1. [Choose whether to transform variables](#transform)

---

# Why?

**Before you analyze your data, you need to understand your data.**
- Never rush into regressions or other fancy analysis.
- You will often get wrong or misleading results!

**What's the point of the steps we'll cover today?**
- Verify the data you have is the data you expected.
- Detect errors in data cleaning.
- Learn surprising new features of the context you're studying.
- Make more appropriate decisions in downstream analysis.
- Better interpret your results (what's driving them?).

**Do I really have to do these steps for every variable in my data?**
- Only for the ones that you want to give you the right answers.
- Especially crucial for your **outcome** and **treatment** variables.
- Still important for other (e.g., control) variables.
- Maybe less critical for *some* types of prediction (ML) techniques.

---
class: inverse, middle
name: first

# Get to know your data

---

# Setup

Load the tidyverse if necessary:

```r
library(tidyverse)
```

Download this data on hotel listings in Vienna, Austria in November 2017:

```r
vienna = read_csv("https://osf.io/y6jvb/download")
```

.small[Data from [Gabors Data Analysis](https://osf.io/7epdj/) by Gábor Békés and Gábor Kézdi, used under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).]

---

# First look

You already know these, but let's go over them:

```r
head(vienna)
```

```
## # A tibble: 6 × 24
##   country city_actual rating_count center1label center2label neighbourhood price
##   <chr>   <chr>              <dbl> <chr>        <chr>        <chr>         <dbl>
## 1 Austria Vienna                36 City centre  Donauturm    17. Hernals      81
## 2 Austria Vienna               189 City centre  Donauturm    17. Hernals      81
## 3 Austria Vienna                53 City centre  Donauturm    Alsergrund       85
## 4 Austria Vienna                55 City centre  Donauturm    Alsergrund       83
## 5 Austria Vienna                33 City centre  Donauturm    Alsergrund       82
## 6 Austria Vienna                25 City centre  Donauturm    Alsergrund      229
## # ℹ 17 more variables: city <chr>, stars <dbl>, ratingta <dbl>,
## #   ratingta_count <dbl>, scarce_room <dbl>, hotel_id <dbl>, offer <dbl>,
## #   offer_cat <chr>, year <dbl>, month <dbl>, weekend <dbl>, holiday <dbl>,
## #   distance <dbl>, distance_alter <dbl>, accommodation_type <chr>,
## #   nnights <dbl>, rating <dbl>
```

---

# First look

.scroll-output-full[

```r
View(vienna)

summary(vienna)
```

```
##    country          city_actual         rating_count  center1label      
##  Length:428         Length:428         Min.   :   1   Length:428        
##  Class :character   Class :character   1st Qu.:  27   Class :character  
##  Mode  :character   Mode  :character   Median :  84   Mode  :character  
##                                        Mean   : 155                     
##                                        3rd Qu.: 203                     
##                                        Max.   :1541                     
##                                        NA's   :35                       
##  center2label       neighbourhood          price            city          
##  Length:428         Length:428         Min.   :  27.0   Length:428        
##  Class :character   Class :character   1st Qu.:  83.0   Class :character  
##  Mode  :character   Mode  :character   Median : 109.5   Mode  :character  
##                                        Mean   : 131.4                     
##                                        3rd Qu.: 146.0                     
##                                        Max.   :1012.0                     
##                                                                           
##      stars          ratingta     ratingta_count    scarce_room    
##  Min.   :1.000   Min.   :2.000   Min.   :   2.0   Min.   :0.0000  
##  1st Qu.:3.000   1st Qu.:3.500   1st Qu.: 129.0   1st Qu.:0.0000  
##  Median :3.500   Median :4.000   Median : 335.0   Median :1.0000  
##  Mean   :3.435   Mean   :3.991   Mean   : 556.5   Mean   :0.5981  
##  3rd Qu.:4.000   3rd Qu.:4.500   3rd Qu.: 811.0   3rd Qu.:1.0000  
##  Max.   :5.000   Max.   :5.000   Max.   :3171.0   Max.   :1.0000  
##                  NA's   :103     NA's   :103                      
##     hotel_id         offer         offer_cat              year     
##  Min.   :21894   Min.   :0.0000   Length:428         Min.   :2017  
##  1st Qu.:22028   1st Qu.:0.0000   Class :character   1st Qu.:2017  
##  Median :22156   Median :1.0000   Mode  :character   Median :2017  
##  Mean   :22154   Mean   :0.6799                      Mean   :2017  
##  3rd Qu.:22279   3rd Qu.:1.0000                      3rd Qu.:2017  
##  Max.   :22409   Max.   :1.0000                      Max.   :2017  
##                                                                    
##      month       weekend     holiday     distance      distance_alter  
##  Min.   :11   Min.   :0   Min.   :0   Min.   : 0.000   Min.   : 0.600  
##  1st Qu.:11   1st Qu.:0   1st Qu.:0   1st Qu.: 0.700   1st Qu.: 2.700  
##  Median :11   Median :0   Median :0   Median : 1.300   Median : 3.400  
##  Mean   :11   Mean   :0   Mean   :0   Mean   : 1.659   Mean   : 3.718  
##  3rd Qu.:11   3rd Qu.:0   3rd Qu.:0   3rd Qu.: 2.000   3rd Qu.: 4.400  
##  Max.   :11   Max.   :0   Max.   :0   Max.   :13.000   Max.   :13.000  
##                                                                        
##  accommodation_type    nnights      rating     
##  Length:428         Min.   :1   Min.   :1.000  
##  Class :character   1st Qu.:1   1st Qu.:3.700  
##  Mode  :character   Median :1   Median :4.000  
##                     Mean   :1   Mean   :3.971  
##                     3rd Qu.:1   3rd Qu.:4.400  
##                     Max.   :1   Max.   :5.000  
##                                 NA's   :35
```
]

---

# Better summaries with skimr

```r
install.packages("skimr")
library(skimr)
skim(vienna)
```
.scroll-output-75[
  .small[

Table: Data summary

|                         |       |
|:------------------------|:------|
|Name                     |vienna |
|Number of rows           |428    |
|Number of columns        |24     |
|_______________________  |       |
|Column type frequency:   |       |
|character                |8      |
|numeric                  |16     |
|________________________ |       |
|Group variables          |None   |

**Variable type: character**

|skim_variable      | n_missing| complete_rate| min| max| empty| n_unique| whitespace|
|:------------------|---------:|-------------:|---:|---:|-----:|--------:|----------:|
|country            |         0|             1|   7|   7|     0|        1|          0|
|city_actual        |         0|             1|   6|  10|     0|        4|          0|
|center1label       |         0|             1|  11|  11|     0|        1|          0|
|center2label       |         0|             1|   9|   9|     0|        1|          0|
|neighbourhood      |         0|             1|   6|  20|     0|       22|          0|
|city               |         0|             1|   6|   6|     0|        1|          0|
|offer_cat          |         0|             1|  10|  13|     0|        5|          0|
|accommodation_type |         0|             1|   5|  19|     0|        8|          0|

**Variable type: numeric**

|skim_variable  | n_missing| complete_rate|     mean|     sd|      p0|      p25|     p50|      p75|  p100|hist  |
|:--------------|---------:|-------------:|--------:|------:|-------:|--------:|-------:|--------:|-----:|:-----|
|rating_count   |        35|          0.92|   155.05| 191.22|     1.0|    27.00|    84.0|   203.00|  1541|▇▁▁▁▁ |
|price          |         0|          1.00|   131.37|  91.58|    27.0|    83.00|   109.5|   146.00|  1012|▇▁▁▁▁ |
|stars          |         0|          1.00|     3.43|   0.77|     1.0|     3.00|     3.5|     4.00|     5|▁▂▆▇▂ |
|ratingta       |       103|          0.76|     3.99|   0.48|     2.0|     3.50|     4.0|     4.50|     5|▁▁▃▇▆ |
|ratingta_count |       103|          0.76|   556.52| 586.87|     2.0|   129.00|   335.0|   811.00|  3171|▇▂▁▁▁ |
|scarce_room    |         0|          1.00|     0.60|   0.49|     0.0|     0.00|     1.0|     1.00|     1|▆▁▁▁▇ |
|hotel_id       |         0|          1.00| 22153.50| 146.86| 21894.0| 22027.75| 22155.5| 22279.25| 22409|▇▇▇▇▇ |
|offer          |         0|          1.00|     0.68|   0.47|     0.0|     0.00|     1.0|     1.00|     1|▃▁▁▁▇ |
|year           |         0|          1.00|  2017.00|   0.00|  2017.0|  2017.00|  2017.0|  2017.00|  2017|▁▁▇▁▁ |
|month          |         0|          1.00|    11.00|   0.00|    11.0|    11.00|    11.0|    11.00|    11|▁▁▇▁▁ |
|weekend        |         0|          1.00|     0.00|   0.00|     0.0|     0.00|     0.0|     0.00|     0|▁▁▇▁▁ |
|holiday        |         0|          1.00|     0.00|   0.00|     0.0|     0.00|     0.0|     0.00|     0|▁▁▇▁▁ |
|distance       |         0|          1.00|     1.66|   1.60|     0.0|     0.70|     1.3|     2.00|    13|▇▁▁▁▁ |
|distance_alter |         0|          1.00|     3.72|   1.63|     0.6|     2.70|     3.4|     4.40|    13|▆▇▁▁▁ |
|nnights        |         0|          1.00|     1.00|   0.00|     1.0|     1.00|     1.0|     1.00|     1|▁▁▇▁▁ |
|rating         |        35|          0.92|     3.97|   0.58|     1.0|     3.70|     4.0|     4.40|     5|▁▁▁▇▆ |
  ]
]

---

# Better summaries with skimr

```r
vienna |> 
  mutate(stars = factor(stars)) |>
  skim()
```
.scroll-output-75[
  .small[

Table: Data summary

|                         |                             |
|:------------------------|:----------------------------|
|Name                     |mutate(vienna, stars = fa... |
|Number of rows           |428                          |
|Number of columns        |24                           |
|_______________________  |                             |
|Column type frequency:   |                             |
|character                |8                            |
|factor                   |1                            |
|numeric                  |15                           |
|________________________ |                             |
|Group variables          |None                         |

**Variable type: character**

**Variable type: factor**

|skim_variable | n_missing| complete_rate|ordered | n_unique|top_counts                     |
|:-------------|---------:|-------------:|:-------|--------:|:------------------------------|
|stars         |         0|             1|FALSE   |        8|4: 143, 3: 140, 3.5: 57, 2: 47 |

**Variable type: numeric**

|skim_variable  | n_missing| complete_rate|     mean|     sd|      p0|      p25|     p50|      p75|  p100|hist  |
|:--------------|---------:|-------------:|--------:|------:|-------:|--------:|-------:|--------:|-----:|:-----|
|rating_count   |        35|          0.92|   155.05| 191.22|     1.0|    27.00|    84.0|   203.00|  1541|▇▁▁▁▁ |
|price          |         0|          1.00|   131.37|  91.58|    27.0|    83.00|   109.5|   146.00|  1012|▇▁▁▁▁ |
|ratingta       |       103|          0.76|     3.99|   0.48|     2.0|     3.50|     4.0|     4.50|     5|▁▁▃▇▆ |
|ratingta_count |       103|          0.76|   556.52| 586.87|     2.0|   129.00|   335.0|   811.00|  3171|▇▂▁▁▁ |
|scarce_room    |         0|          1.00|     0.60|   0.49|     0.0|     0.00|     1.0|     1.00|     1|▆▁▁▁▇ |
|hotel_id       |         0|          1.00| 22153.50| 146.86| 21894.0| 22027.75| 22155.5| 22279.25| 22409|▇▇▇▇▇ |
|offer          |         0|          1.00|     0.68|   0.47|     0.0|     0.00|     1.0|     1.00|     1|▃▁▁▁▇ |
|year           |         0|          1.00|  2017.00|   0.00|  2017.0|  2017.00|  2017.0|  2017.00|  2017|▁▁▇▁▁ |
|month          |         0|          1.00|    11.00|   0.00|    11.0|    11.00|    11.0|    11.00|    11|▁▁▇▁▁ |
|weekend        |         0|          1.00|     0.00|   0.00|     0.0|     0.00|     0.0|     0.00|     0|▁▁▇▁▁ |
|holiday        |         0|          1.00|     0.00|   0.00|     0.0|     0.00|     0.0|     0.00|     0|▁▁▇▁▁ |
|distance       |         0|          1.00|     1.66|   1.60|     0.0|     0.70|     1.3|     2.00|    13|▇▁▁▁▁ |
|distance_alter |         0|          1.00|     3.72|   1.63|     0.6|     2.70|     3.4|     4.40|    13|▆▇▁▁▁ |
|nnights        |         0|          1.00|     1.00|   0.00|     1.0|     1.00|     1.0|     1.00|     1|▁▁▇▁▁ |
|rating         |        35|          0.92|     3.97|   0.58|     1.0|     3.70|     4.0|     4.40|     5|▁▁▁▇▆ |
  ]
]

---
class: inverse, middle
name: categories

# Describe your categorical variables

---

# Frequency tables

Base R:

```r
table(vienna$stars, useNA = "ifany")
```

```
## 
##   1   2 2.5   3 3.5   4 4.5   5 
##   1  47   5 140  57 143   8  27
```

</br>

Literally just the basics. Not terribly informative, well-formatted, or tidy.

---

# Frequency tables

Using the tidyverse (we've done this before):

```r
vienna |> count(stars)
```

```
## # A tibble: 8 × 2
##   stars     n
##   <dbl> <int>
## 1   1       1
## 2   2      47
## 3   2.5     5
## 4   3     140
## 5   3.5    57
## 6   4     143
## 7   4.5     8
## 8   5      27
```

</br>

Advantage: output is a tibble; can be piped elsewhere.

---

# Frequency tables

Using `summarytools::freq`:

```r
install.packages("summarytools")
```

```r
library(summarytools)
freq(vienna$stars)
```

```
## Frequencies  
## vienna$stars  
## Type: Numeric  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1      1      0.23           0.23      0.23           0.23
##           2     47     10.98          11.21     10.98          11.21
##         2.5      5      1.17          12.38      1.17          12.38
##           3    140     32.71          45.09     32.71          45.09
##         3.5     57     13.32          58.41     13.32          58.41
##           4    143     33.41          91.82     33.41          91.82
##         4.5      8      1.87          93.69      1.87          93.69
##           5     27      6.31         100.00      6.31         100.00
##        <NA>      0                               0.00         100.00
##       Total    428    100.00         100.00    100.00         100.00
```

---

# Frequency tables

Using `summarytools::freq`:

```r
install.packages("summarytools")
```

```r
library(summarytools)
freq(vienna$stars, order = "freq")
```

```
## Frequencies  
## vienna$stars  
## Type: Numeric  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           4    143     33.41          33.41     33.41          33.41
##           3    140     32.71          66.12     32.71          66.12
##         3.5     57     13.32          79.44     13.32          79.44
##           2     47     10.98          90.42     10.98          90.42
##           5     27      6.31          96.73      6.31          96.73
##         4.5      8      1.87          98.60      1.87          98.60
##         2.5      5      1.17          99.77      1.17          99.77
##           1      1      0.23         100.00      0.23         100.00
##        <NA>      0                               0.00         100.00
##       Total    428    100.00         100.00    100.00         100.00
```

---

# Crosstabs (two-way frequency tables)

Using `summarytools::ctable`:

```r
ctable(vienna$city_actual, as_factor(vienna$scarce_room))
```

```
## Cross-Tabulation, Row Proportions  
## city_actual * as_factor(vienna$scarce_room)  
## 
## ------------- ------------------------------- -------------- ------------- --------------
##                 as_factor(vienna$scarce_room)              0             1          Total
##   city_actual                                                                            
##    Fischamend                                     1 (100.0%)     0 ( 0.0%)     1 (100.0%)
##     Schwechat                                     5 ( 71.4%)     2 (28.6%)     7 (100.0%)
##        Vienna                                   164 ( 39.2%)   254 (60.8%)   418 (100.0%)
##    Voesendorf                                     2 (100.0%)     0 ( 0.0%)     2 (100.0%)
##         Total                                   172 ( 40.2%)   256 (59.8%)   428 (100.0%)
## ------------- ------------------------------- -------------- ------------- --------------
```

Default percentages are out of each row. Options to make them by column or table.

Could we make a similar table with `dplyr` and `tidyr`?

---

# Crosstabs (two-way frequency tables)

Yes, but it's fairly complicated and not nicely formatted...

```
## # A tibble: 4 × 6
## # Rowwise:  city_actual
##   city_actual   n_0   n_1 percent_0 percent_1 row_sum
##   <chr>       <int> <int>     <dbl>     <dbl>   <dbl>
## 1 Fischamend      1     0     1         0           2
## 2 Schwechat       5     2     0.714     0.286       8
## 3 Vienna        164   254     0.392     0.608     419
## 4 Voesendorf      2     0     1         0           3
```

---

# Bar plots

We're going to use `ggplot2` to make graphs. Don't worry too much about the syntax yet; we'll talk about it more in the unit on visualization (coming soon!).

```r
ggplot(vienna, aes(y = neighbourhood)) + 
  geom_bar()
```

---
class: inverse, middle
name: distributions

# Describe your distributions

---

# Histograms

With default settings:

```r
ggplot(vienna, aes(price)) + 
  geom_histogram()
```

---

# Histograms

Make bins line up with nice round numbers:

```r
ggplot(vienna, aes(price)) + 
  geom_histogram(boundary = 0, binwidth = 25)
```

---

# Histograms

Too much detail? Use a larger bin width:

```r
ggplot(vienna, aes(price)) + 
  geom_histogram(boundary = 0, binwidth = 100)
```

---

# Histograms

Want more detail? Use a smaller bin width:

```r
ggplot(vienna, aes(price)) + 
  geom_histogram(boundary = 0, binwidth = 1)
```

---

# Kernel density plots

A smoothed version of a histogram.

```r
ggplot(vienna, aes(rating)) + 
  geom_density()
```

---

# Kernel density plots

The **bandwidth** controls the degree of smoothing. Smaller:

```r
ggplot(vienna, aes(rating)) + 
  geom_density(adjust = .25)
```

---

# Kernel density plots

The **bandwidth** controls the degree of smoothing. Larger:

```r
ggplot(vienna, aes(rating)) + 
  geom_density(adjust = 2)
```

---

# Kernel density plots

Show multiple groups:

```r
vienna |>
  mutate(stars_rounded = factor(round(stars))) |>
  ggplot(aes(rating, color = stars_rounded)) + 
    geom_density(adjust = 2)
```

---

# Kernel density plots

Show multiple groups:

```r
vienna |>
  mutate(stars_rounded = factor(round(stars))) |>
  ggplot(aes(rating, fill = stars_rounded)) + 
    geom_density(adjust = 2, alpha = 0.4)
```

---

# Kernel density plots

When might you prefer a density plot vs. a histogram?

Histogram:
- Want to see your raw data as literally as possible
- Want to count extreme values
- Care about thresholds

Density plot:
- Want a more general idea of the distribution
- Want to compare tendencies of distributions

---

# Kernel density plots

How does the smoothing work?

<img src="img/kernel-density-animation.gif" width="60%" style="display: block; margin: auto;" />
.small[Image by [David Robinson](http://varianceexplained.org/files/bandwidth.html) not included under the CC license.]

---

# Bias-variance tradeoff

Bandwidth choice illustrates a **bias-variance tradeoff**.

Smaller bandwidth:
- Less bias (more literal representation of your raw data).
- But higher variance (how meaningful are those wiggles?).

---

# Bias-variance tradeoff

Bandwidth choice illustrates a **bias-variance tradeoff**.

Larger bandwidth:
- Lower variance (smoother lines).
- But more bias (less directly showing your raw data).

---

# Bias-variance tradeoff

Bias is not always bad! Often we want to accept some bias (inaccuracy) in exchange for less variance (more precision).

<img src="img/bias-variance.png" width="60%" style="display: block; margin: auto;" />
.small[Image by [Scott Fortmann-Roe](http://scott.fortmann-roe.com/docs/BiasVariance.html) not included under the CC license.]

---
class: inverse, middle
name: extreme

# Handle your extreme values

---

# Extreme values

There are a couple of really high prices in the `price` variable.

Our histogram would look a lot nicer if we could get rid of them...

```r
ggplot(vienna, aes(price)) + 
  geom_histogram(boundary = 0, binwidth = 25)
```

```r
ggplot(vienna, aes(distance)) + 
  geom_histogram()
```

---

# Extreme values

**Should we get rid of them? How do we decide?**

> 🗣I don’t know who needs to hear this but we ✨don’t✨ get rid of outliers *because* they’re extreme...<br><br>we get rid of them when their extreme-ness indicates they’re not a part of the data generating process we want to study (like a typo that says your newborn is 1000 lbs) </p>&mdash; Chelsea Parlett-Pelleriti (@ChelseaParlett) <a href = "https://twitter.com/ChelseaParlett/status/1356285012375556109?ref_src = twsrc%5Etfw">February 1, 2021</a>

---

# Extreme values

Values that are much larger or smaller than the rest of your distribution, or that fail logical checks, should be investigated until you are able to classify them into one of these categories:
1. They are **erroneous** (and can be corrected, or else excluded).
2. They are part of a **different** data generating process (and should be excluded).
3. They are **correct** and produced by the same process as less extreme values (and should be retained).

Making sound judgments about extreme values requires **domain knowledge**.
- If you don't know yourself, it's time to ask someone else! Go back to the documentation of your raw data, or contact the person who collected or gave you the data.
- This can be one of the most labor-intensive aspects of data analysis, but it is critical to ensure good data quality and accurate conclusions.

---

# Extreme values

Even when extreme values are correct and truly belong to your distribution, they can exert inordinate influence on your analysis. Your results may reflect the extreme values more than the **central tendency** of your data.
- But this alone is ***not a reason*** to remove extreme values.
- Instead, it's a sign that you should consider applying a **transformation** to your variable.

---
class: inverse, middle
name: transform

# Choose whether to transform variables

---

# Transformations

Take a look at the `rating_count` variable.

```r
ggplot(vienna, aes(rating_count)) +
  geom_histogram()
```

---

# Transformations

We can get a different view of this distribution by applying a (natural) logarithmic transformation:

```r
vienna = mutate(vienna, ln_rating_count = log(rating_count))
ggplot(vienna, aes(ln_rating_count)) + 
  geom_histogram()
```

---

# Transformations

Now we can see finer differences among values in the left (bottom) of the distribution... at the expense of compressing values in the right (top) of the distribution.

Many common and important variables tend to follow approximately lognormal distributions (e.g., income, landholdings, trade quantities).
- I.e., they are right-skewed before taking the log, but normally distributed afterward.

**Which should we prefer,** the raw variable or the log-transformed variable?

For right-skewed data, log-transformed variables...

1. Are **less sensitive to extremely large values.**
2. Better reflect the **central tendency** (the mean is closer to the median and mode).

These are nice properties, but not always the best reason to transform (or not). Two other considerations are even more fundamental:
1. What type of **variation** you care about most.
2. What type of **data generating process** you think created the variation.

---

# Transformations

**1. What type of variation do you care about more: level changes or proportional changes?**

**Levels:** Each tick on the raw histogram **increments** # of ratings by the same **number.**
- 10 `$\rightarrow$` 20 matters equally as 110 `$\rightarrow$` 120. 100 `$\rightarrow$` 200 matters much more.

**Logs:** Each tick on the log-transformed histogram **multiplies** # of ratings by the same **factor.**
- 10 `$\rightarrow$` 20 matters equally as 100 `$\rightarrow$` 200. 110 `$\rightarrow$` 120 matters much less.
- Why? If `$\log(x_1) - \log(x_0) = c$`, then `$x_1 = k x_0$` (where `$k \equiv e^c$`).

---

# Transformations

**1. What type of variation do you care about more: level changes or proportional changes?**

Often, proportional changes seem more policy-relevant than level changes (everyone's income increases by 2% vs. by $1000).

We are going to use this variation to learn about how this variable relates to other variables -- so we want to match the variation we're working with to the variation we care about.

</br>

---

# Transformations

**2. Is the variation created by an additive or multiplicative process?**

We are going to use this **variation** to learn about how this variable relates to other variables -- so we want to match the actual data generating process as closely as possible.
- Estimates are likely to be more precise.
- Their interpretation is likely to be more meaningful.

Often, proportional effects seem more likely than level effects (a policy increases everyone's income by 2% vs. by $1000).

Example of an **additive** process:
`$$Rating = 4 + 0.2 \cdot Clean + 0.3 \cdot ListingAccurate$$`

Example of a **multiplicative** process:
`$$CountRatings = CountStays \cdot \textrm{30% give ratings}$$`
`$$\log(CountRatings) = \log(CountStays) + \log(0.3)$$`

---

# Transformations

**2. Is the variation created by an additive or multiplicative process?**

Income is likely a chiefly **multiplicative** process:

`$$Income = I_0 + I_0 \cdot (5\%)^{Experience}$$`
`$$Income = I_0 \cdot (1.05)^{Experience}$$`
`$$Income = I_0 \cdot (1.05)^{Experience} \cdot (1.2)^{GradDegree}$$`

Taking logs lets us represent it linearly:

`$$\log(Income) = \log(I_0) + \log(1.05^{Experience}) + \log(1.2^{GradDegree})$$`
`$$\log(Income) = \log(I_0) + \log(1.05)\cdot Experience + \log(1.2) \cdot GradDegree$$`
`$$\log(Income) \approx \log(I_0) + 0.05\cdot Experience + 0.2 \cdot GradDegree$$`
(using the approximation `$\log(1 + x) \approx x$` for small `$x$`).

---

# Transformations

**Which type of logarithm should you use?**

- For visualizing data, the base-10 logarithm is easiest to interpret.

- For regression estimation, the natural log has the nice property that coefficients can be directly interpreted as approximate percentage changes.
   * Allows us to directly estimate elasticities, which we like because they're unitless and convenient in many theoretical models.

- Does not matter for your results, just their interpretation (they are equal up to a constant).

---

# Transformations

One other common transformation: **z-score normalization.**

$$ z_i = \frac{x_i-\mu}{\sigma} $$

where `$\mu$` is the variable's mean and `$\sigma$` is its standard deviation.

- Centers the variable at 0.
- One unit of a z-score = one standard deviation.

In causal inference, often used for variables that have no intrinsic quantitative meaning.
- E.g., student test scores in education.

In prediction, often required for inputs to many machine learning algorithms.
- Avoids floating point problems by putting all variables on similar scales.

---

# Transformations

**What about other transformations?**

Rarer in economics, because they are harder to interpret.

- Box-Cox, square root, exponential.
- Normalization on the (0, 1) interval: max/min, sigmoidal, hyperbolic tangent.

Avoid unless you have a strong reason for using a specific one.

---

# Summary of Part 1

### Understand each of your variables
* Get a initial **summary** of your dataset with `skimr::skim`.
* Explore categorical variables with **frequency tables** and **crosstabs.**
* For numerical variables, look at the **histograms** before you do anything else.
* The bandwidth of **kernel density** plots features a bias/variance tradeoff.

### Data handling decisions
* **Extreme values** call for scrutiny but should be excluded only if not created by the data generating process you want to study.
* Logarithmic **transformations** are widely useful for describing right-skewed variables.