ScPoEconometrics

.title[
# ScPoEconometrics
]
.subtitle[
## Linear Regression Extensions
]
.author[
### Florian Oswald, Mylène Feuillade, Gustave Kenedi and Junnan He
]
.date[
### SciencesPo Paris </br> 2022-10-24
]

---

---

# Today - Linear Regression Extensions

Depending on the data and the relationships between the variables of interest, you may need to move away from the baseline model.

We will focus on 3 important variations:
  
  1. ***Non-linear relationships***: log and polynomial models
  
  1. ***Interactions*** between variables

1. ***Standardized*** regression

In each case, the way we estimate these coefficients does not change (i.e OLS).

Empirical applications:

(i) *college tuition* and *earnings potential*,
  (ii) *wage*, *education* and *gender* ,
  (iii) *class size* and *student performance*

---

# Non-Linear Relationships

---

---

# Accounting for Non-Linear Relationships

There are two main "methods":

1. ***Log*** models

1. ***Polynomial*** models

---

# Log Models

* The models we have seen so far can be called ***level-level*** specifications. Both the dependent and the independent variables have been measured in level.

* This *level* can be: euros, years, number of students,... and even percentage.
  
--

* Taking the *natural* logarithm of the dependent and/or the independent variable(s) leads us to define 3 other types of regressions (abuse of notation: `$\ln(x) = \log_{e}(x)=\log(x)$`):

* ***Log - level***: `$\quad \log(y_i) = b_0 + b_1 x_{1,i} + ... + e_i$`

* ***Level - log***: `$\quad \textrm{y}_i = b_0 + b_1 \log(x_{1,i}) + ... + e_i$`

* ***Log - log***: `$\quad \log(y_i) = b_0 + b_1 \log(x_{1,i}) + ... + e_i$`
  
---

# The (natural) log Function: A Primer 😉

---

# The (natural) log Function: A Primer 😉

The [natural log function](https://en.wikipedia.org/wiki/Natural_logarithm) is the inverse function of the exponential function. , i.e. `$\log(\exp(x))=x$`

`$\rightarrow$` since for all `$x$`, `$\exp(x)>0 \implies$` natural log function is only defined for ***strictly positive values***! (It is not defined in 0!)

⚠️ You can only log your variables if they don't take 0 or negative values! Always think about this when taking the log of your dependent or independent variable(s)

---

# The (natural) log Function: A Primer 😉

If you have very ***skewed distributions*** taking the log will render it more ***normally distributed***

.pull-left[
<img src="chapter_regext_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" />
]

.pull-right[
<img src="chapter_regext_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" />
]

---

# Log Models: Simplified Interpretations

|    Specification           | Model  |  Interpretation of `$b_1$`            |
|--------------------|:---------:|:-----------------------------------:|
| Level - Level | `$y = b_0 + b_1 x + e$` | .small[A **one unit** increase in ] `$x$` .small[ is associated, on average, with a ] `$b_1$` .small[**unit change** in y]  |
| Log - Level | `$\log(y) = b_0 + b_1 x + e$` | .small[A **one unit** increase in ] `$x$` .small[ is associated, on average, with a] `$b_1 \times 100$` .small[ **percent change** in y]  |
| Level - Log | `$y = b_0 + b_1 \log(x)  + e$` | .small[A **one percent** increase in ] `$x$` .small[ is associated, on average, with a ] `$b_1 / 100$`  .small[**unit change** in y] |
| Log - Log  |  `$\log(y) = b_0 + b_1 \log(x) + e$` | .small[A **one percent** increase in ] `$x$` .small[ is associated, on average, with a] `$b_1$` .small[**percent change** in y]  |

* This may look like cooking recipes but of course it can be [derived with some relatively simple maths](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqhow-do-i-interpret-a-regression-model-when-some-variables-are-log-transformed/).

* ⚠️ these interpretations are only true for ***small*** changes in `$x$` and small/or `$b_1$`. What happens if we want to know the change in `$y$` for big changes in `$x$` or when `$b_1$` is large?

---
name: gen_log

# Log Models: General Interpretations

For ***any increase in `$x$`, `$\Delta x,$` and any `$b_1$`*** `$(\Delta x = 5\% = 0.05 \implies 1 + \Delta x = 1.05)$`:

|    Specification           | Model  |  Interpretation of `$b_1$`            |
|--------------------|:---------:|:-----------------------------------:|
| Level - Level | `$y = b_0 + b_1 x + e$` | .small[A **one unit** increase in ] `$x$` .small[ is associated, on average, with a ] `$b_1$` .small[**unit change** in y]  |
| Log - Level | `$\log(y) = b_0 + b_1 x + e$` | .small[A **one unit** increase in ] `$x$` .small[ is associated, on average, with a] `$(e^{b_1} - 1) \times 100$` .small[ **percent change** in y]  |
| Level - Log | `$y = b_0 + b_1 \log(x)  + e$` | .small[A  ] ** `$\Delta x$`** .small[**percent** increase in ] `$x$` .small[ is associated, on average, with a ] `$b_1 \times \log(1 + \Delta x)$`  .small[**unit change** in y] |
| Log - Log  |  `$\log(y) = b_0 + b_1 \log(x) + e$` | .small[A  ] ** `$\Delta x$`** .small[**percent** increase in ] `$x$` .small[ is associated, on average, with a] `$((1 + \Delta x)^{b_1} - 1) \times 100$` .small[**percent change** in y]  |

([*Appendix:*](#log_approx) Why are the previously shown approximations true?)

---

# When Should You Use log Models?

1. If the relationship betwen `$x$` and `$y$` looks like a log or exponential function.

.pull-left[
<img src="chapter_regext_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" />
]

.pull-right[
<img src="chapter_regext_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" />
]

---

# When Should You Use log Models?

1. If the relationship betwen `$x$` and `$y$` looks like a log or exponential function.

.pull-left[
<img src="chapter_regext_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" />
]

.pull-right[
<img src="chapter_regext_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" />
]

---

# When Should You Use log Models?

1. If the relationship betwen `$x$` and `$y$` looks like a log or exponential function.

1. To easily interpret coefficients as <a href="https://en.wikipedia.org/wiki/Elasticity_(economics)">***elasticities***</a> which play a central role in economic theory.

***Elasticity of `$y$` with respect to `$x$`:*** percent change in `$y$` following a 1% increase in `$x$`.

---

# Accounting for Other Types Non-Linear Relationships

What if the relationship between `$x$` and `$y$` is not exponential/log?

`$\rightarrow$` ***polynomial*** regressions: just take a polynomial function of the regressor!

---

# Polynomial Wut? 😟

.pull-left[
<img src="chapter_regext_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" />
]

.pull-right[
<img src="chapter_regext_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" />
]

---

# Polynomial Regressions

What does this mean in practice?

`$\rightarrow$` add a higher order of the regressor to the regression, depending on the visual (or expected) relationship

Several ways of doing this in `R`:

```r
lm(y ~ x + I(x^2) + I(x^3), data)
```
]

```r
lm(y ~ poly(x, 3, raw = TRUE), data)
```
]

---

# Polynomial Regressions

***2nd order:***

<img src="chapter_regext_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" />
]

***3rd order:***

<img src="chapter_regext_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" />
]

---

---

class:inverse

# Task 1: Non-linear relationships

1. Load the data [here](https://www.dropbox.com/s/2v5mb04nzw2u7bd/college_tuition_income.csv?dl=1). This dataset contains information about tuition and estimated incomes of graduates for universities in the US. More details can be found [here](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-03-10/readme.md).

2. Create a scatter plot of estimated mid career pay (`mid_career_pay`) `$(y-axis)$` as a function of out of state tuition (`out_of_state_tuition`) `$(x-axis)$`. Would you say the relationship is broadly linear or rather non-linear? Use `geom_smooth(method = "lm", se = F) + geom_smooth(method = "lm", se = F, formula = y ~ poly(x, 2, raw= T))` to fit both a linear and 2nd order regression line. This time which seems most appropriate?

3. Create a variable equal to out of state tuition divided by 1000. Regress mid career pay on out of state tuition divided by 1000. Interpret the coefficient.

4. Regress mid career pay on out of state tuition divided by 1000 and its square. *Hint:* you can use either `poly(x, 2, raw = T)` or `x + I(x^2)`, where x is your regressor. What does the positive sign on the squared term imply?

---

# Interaction Terms

---

---

# Interacting Regressors

* We interact two regressors when we believe ***the effect of one depends on the value of the other***.

* *Example:* The returns to education on wage vary by gender.
  
--
  
* In practice, if we interact `$x_1$` and `$x_2$`, we would write our model like this :

`$$y_i =  b_0 + b_1 x_{1,i} + b_2 x_{2,i} + b_3x_{1,i} \times x_{2,i} + ... + e_i$$`

* The interpretation of `$b_1$`, `$b_2$`, and `$b_3$` will depend on the type of `$x_1$` and `$x_2$`.

--
  
* Let's focus on the cases where one regressor is a ***dummy/categorical*** variable and the other is ***continuous***.

* It will give you the intuition for the other cases:

* Both regresors are dummies/categorical variables,

* Both regresors are continuous variables.
  
---

# Interacting Regressors

Let's go back to the *STAR* experiment data.

How does the effect of being in a small vs regular class vary with the experience of the teacher?

Our regression model becomes:

$$ \textrm{score}_i = \color{#d96502}{b_0} + \color{#027D83}{b_1} \textrm{small}_i + \color{#02AB0D}{b_2} \textrm{experience}_i + \color{#d90502}{b_3} \textrm{small}_i \times \textrm{experience}_i + e_i$$

Effect of small class with teacher with 10 years of experience?

`$\mathbb{E}[\textrm{score}_i | \textrm{small}_i = 1 \textrm{ & experience}_i = 10] = \color{#d96502}{b_0} + \color{#027D83}{b_1} + \color{#02AB0D}{b_2}*10 + \color{#d90502}{b_3}*10$`

`$\mathbb{E}[\textrm{score}_i | \textrm{small}_i = 0 \textrm{ & experience}_i = 10] = \color{#d96502}{b_0} + \color{#02AB0D}{b_2}*10$`

`$\begin{split} \mathbb{E}[\textrm{score}_i &| \textrm{small}_i = 1 \textrm{ & experience}_i = 10] - \mathbb{E}[\textrm{score}_i | \textrm{small}_i = 0 \textrm{ & experience}_i = 10] \\ &= \color{#d96502}{b_0} + \color{#027D83}{b_1} + \color{#02AB0D}{b_2}*10 + \color{#d90502}{b_3}*10 - (\color{#d96502}{b_0} + \color{#02AB0D}{b_2}*10) \\ &= \color{#027D83}{b_1} + \color{#d90502}{b_3}*10 \end{split}$`

---

# Interacting Regressors

Running the regression for the `math` score (for all grades), we obtain:

```r
lm(math ~ small+ experience + small*experience, star_df)
```

```
## 
## Call:
## lm(formula = math ~ small + experience + small * experience, 
##     data = star_df)
## 
## Coefficients:
##          (Intercept)             smallTRUE            experience  
##             534.1919               15.8906                1.3305  
## smallTRUE:experience  
##              -0.3034
```

***Interpretation:***

* The interaction term allows the impact of being in a small class to vary with the experience of the teacher. 
  
--
  
* In particular, we still observe a ***positive impact of being in a small class*** on math score,
  
* but this ***effect is decreasing in the experience of the teacher***.
  
---

# Interacting Regressors: Visually