ECON 4050: Introduction to Econometrics

.title[
# ECON 4050: Introduction to Econometrics
]
.subtitle[
## Linear Regression Extensions
]
.author[
### Adam Soliman, PhD
]
.date[
### Clemson University
]

---

# Today - Linear Regression Extensions

Depending on the data and the relationships between the variables of interest, you may need to move away from the baseline model.

We will focus on 3 important variations:
  
  1. ***Non-linear relationships***
  
  1. ***Interactions*** between variables

1. ***Standardized*** regression

In each case, the way we estimate these coefficients does not change (i.e OLS).

Empirical applications:

* *college tuition* and *earnings potential*,
  * *wage*, *education* and *gender*,
  * *class size* and *student performance*

---

# Non-Linear Relationships

---

# Accounting for Non-Linear Relationships

There are two main "methods":

1. ***Log*** models

1. ***Polynomial*** models (not the focus of lecture)

---

# Log Models

* The models we have seen so far can be called ***level-level*** specifications. Both the dependent and the independent variables have been measured in levels.

* The *level* can be dollars, years, number of students...even a percentage.
  
--

* Taking the *natural* logarithm of the dependent and/or the independent variable(s) leads us to define 3 other types of regressions (with a common abuse of notation where `$\ln(x) = \log_{e}(x)=\log(x)$`):

* ***Log - level***: `$\quad \log(y_i) = b_0 + b_1 x_{1,i} + ... + e_i$`

* ***Level - log***: `$\quad \textrm{y}_i = b_0 + b_1 \log(x_{1,i}) + ... + e_i$`

* ***Log - log***: `$\quad \log(y_i) = b_0 + b_1 \log(x_{1,i}) + ... + e_i$`
  
---

# The (natural) log Function: A Primer 😉

---

# The (natural) log Function: A Primer 😉

* The [natural log function](https://en.wikipedia.org/wiki/Natural_logarithm) is the inverse function of the exponential function, i.e. `$\log(\exp(x))=x$`

* Since for all `$x$`, `$\exp(x)>0 \implies$` natural log function is only defined for ***strictly positive values***! (It is not defined in 0!)

⚠️ Note you can only log your variables if they don't take 0 or negative values! Always think about this when taking the log of your dependent or independent variable(s)

---

# The (natural) log Function: A Primer 😉

If you have very ***skewed distributions*** taking the log will render it more ***normally distributed***

.pull-left[
<img src="chapter_regext_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" />
]

.pull-right[
<img src="chapter_regext_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" />
]

---

# Log Models: Simplified Interpretations

|    Specification           | Model  |  Interpretation of `$b_1$`            |
|--------------------|:---------:|:-----------------------------------:|
| Level - Level | `$y = b_0 + b_1 x + e$` | .small[A **one unit** increase in ] `$x$` .small[ is associated, on average, with a ] `$b_1$` .small[**unit change** in y]  |
| Log - Level | `$\log(y) = b_0 + b_1 x + e$` | .small[A **one unit** increase in ] `$x$` .small[ is associated, on average, with a] `$b_1 \times 100$` .small[ **percent change** in y]  |
| Level - Log | `$y = b_0 + b_1 \log(x)  + e$` | .small[A **one percent** increase in ] `$x$` .small[ is associated, on average, with a ] `$b_1 / 100$`  .small[**unit change** in y] |
| Log - Log  |  `$\log(y) = b_0 + b_1 \log(x) + e$` | .small[A **one percent** increase in ] `$x$` .small[ is associated, on average, with a] `$b_1$` .small[**percent change** in y]  |

* This may look like cooking recipes but of course it can be [derived with some relatively simple math](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqhow-do-i-interpret-a-regression-model-when-some-variables-are-log-transformed/).

* ⚠️ these interpretations are only true for ***small*** changes in `$x$` and small/or `$b_1$`. What happens if we want to know the change in `$y$` for big changes in `$x$` or when `$b_1$` is large?

---
name: gen_log

# Log Models: General Interpretations

For ***any increase in `$x$`, `$\Delta x,$` and any `$b_1$`*** `$(\Delta x = 5\% = 0.05 \implies 1 + \Delta x = 1.05)$`:

|    Specification           | Model  |  Interpretation of `$b_1$`            |
|--------------------|:---------:|:-----------------------------------:|
| Level - Level | `$y = b_0 + b_1 x + e$` | .small[A **one unit** increase in ] `$x$` .small[ is associated, on average, with a ] `$b_1$` .small[**unit change** in y]  |
| Log - Level | `$\log(y) = b_0 + b_1 x + e$` | .small[A **one unit** increase in ] `$x$` .small[ is associated, on average, with a] `$(e^{b_1} - 1) \times 100$` .small[ **percent change** in y]  |
| Level - Log | `$y = b_0 + b_1 \log(x)  + e$` | .small[A  ] ** `$\Delta x$`** .small[**percent** increase in ] `$x$` .small[ is associated, on average, with a ] `$b_1 \times \log(1 + \Delta x)$`  .small[**unit change** in y] |
| Log - Log  |  `$\log(y) = b_0 + b_1 \log(x) + e$` | .small[A  ] ** `$\Delta x$`** .small[**percent** increase in ] `$x$` .small[ is associated, on average, with a] `$((1 + \Delta x)^{b_1} - 1) \times 100$` .small[**percent change** in y]  |

([*Appendix:*](#log_approx) Why are the previously shown approximations true?)

---

# When Should You Use log Models?

1. If the relationship betwen `$x$` and `$y$` looks like a log or exponential function

.pull-left[
<img src="chapter_regext_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" />
]

.pull-right[
<img src="chapter_regext_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" />
]

---

# When Should You Use log Models?

1. If the relationship betwen `$x$` and `$y$` looks like a log or exponential function

.pull-left[
<img src="chapter_regext_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" />
]

.pull-right[
<img src="chapter_regext_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" />
]

---

# When Should You Use log Models?

1. If the relationship betwen `$x$` and `$y$` looks like a log or exponential function.

1. To easily interpret coefficients as <a href="https://en.wikipedia.org/wiki/Elasticity_(economics)">***elasticities***</a> which play a central role in economic theory

***Elasticity of `$y$` with respect to `$x$`:*** percent change in `$y$` following a 1% increase in `$x$`.

---

# An example of interpretation of a log model

- We estimate the effect of fertility on GDP using the **Gapminder** dataset using 
$$ \log(\text{gdp}) = \beta_0 + \beta_1 \cdot \text{fertility} + \epsilon $$

``` r
model <- lm(log(gdp) ~ fertility, gapminder)
model
```

```
## 
## Call:
## lm(formula = log(gdp) ~ fertility, data = gapminder)
## 
## Coefficients:
## (Intercept)    fertility  
##     25.4027      -0.5899
```

- Approximate interpretation for slope, which works for small changes: A one-unit increase in the fertility rate is associated with an approximately 100×(−0.5899)=−58.99% decrease in GDP

- Exact value for slope change: `$(e^{\beta_1}−1)×100 = (e^{−0.5899}−1)×100 = −44.53\%$`

- The intercept (25.4027) represents log(gdp) when the fertility rate is 0. This doesn't have a meaningful real-world interpretation, but you would exponentiate it to get the value, so exp(25.4027)

---

# Task 1: Non-linear relationships

1. Load the data [here](https://www.dropbox.com/scl/fi/68taafng8vjpzmub6q1xj/college_tuition_income.csv?rlkey=0c59a2kj0cwmcecjbg20jrkva&dl=0) and call the object `college`. This dataset contains information about tuition and estimated incomes of graduates for universities in the US.

2. Create a scatter plot of estimated mid-career pay (`mid_career_pay`) `$(y-axis)$` as a function of out-of-state tuition (`out_of_state_tuition`) `$(x-axis)$`. Would you say the relationship is broadly linear or rather non-linear?

3. Filter the variable `type` to keep only the values of ("For Profit", "Private", "Public"). Call this object `college_clean`. USE THIS DATA OBJECT FROM NOW ON!

4. Regress mid-career pay on university type. Interpret each coefficient. What is the reference category?

5. Regress mid-career pay on the logarithm of out-of-state tuition. You can either create a new variable using `mutate`, or incorporate the logarithm directly into the regression. Interpret the slope coefficient.

6. Regress the logarithm of mid-career pay on the variable you created in #3 (`out_state`). Interpret the slope coefficient.

7. Regress the logarithm of mid-career pay on the logarithm of out-of-state tuition. Interpret the slope coefficient.

---

# Teaser for the Next Few Topics

* You may have noticed that since the beginning we always work with **samples** drawn from the overall population.

* Each time, imagine we could draw another sample from population:

* Would we obtain the same results? 
  
  * In other words, how confident can we be that our estimates (sign, magnitude) are not just driven by randomness?

* We will answer those kind of questions:

* We'll present the notion of **sampling**, and
  
  * Understand what **statistical inference** is and how to do it. 
---

# On the way to causality

✅ How to manage data? Read it, tidy it, visualise it!

✅  **How to summarise relationships between variables?** Simple and multiple linear regression, non-linear regressions, interactions...

✅ What is causality?

❌ What if we don't observe an entire population?

❌  Are our findings just due to randomness?

❌ How to find exogeneity in practice?

---

# Bonus Material

## Accounting for Other Types Non-Linear Relationships

What if the relationship between `$x$` and `$y$` is not exponential/log?

`$\rightarrow$` ***polynomial*** regressions: just take a polynomial function of the regressor!

---

# Polynomial Wut? 😟

.pull-left[
<img src="chapter_regext_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" />
]

.pull-right[
<img src="chapter_regext_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" />
]

---

# Polynomial Regressions

What does this mean in practice?

`$\rightarrow$` add a higher order of the regressor to the regression, depending on the visual (or expected) relationship

Several ways of doing this in `R`, these are just two of the equivalent ones:

``` r
lm(y ~ x + I(x^2) + I(x^3), data)
```
]

``` r
lm(y ~ poly(x, 3, raw = TRUE), data)
```
]

---

# Polynomial Regressions

***2nd order:***

<img src="chapter_regext_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" />
]

***3rd order:***

<img src="chapter_regext_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto;" />
]

---

---

# Interaction Terms

---

# Interacting Regressors

* In a linear model *without* interaction terms, we assume that the effect of each predictor on the dependent variable (target) is independent of other predictors in the model.

* We interact two regressors when we believe ***the effect of one depends on the value of the other***.

* *Example:* The returns to education on wage vary by gender.
  
--
  
* In practice, if we interact `$x_1$` and `$x_2$`, we would write our model like this :

`$$y_i =  b_0 + b_1 x_{1,i} + b_2 x_{2,i} + b_3x_{1,i} \times x_{2,i} + ... + e_i$$`

* The interpretation of `$b_1$`, `$b_2$`, and `$b_3$` will depend on the type of `$x_1$` and `$x_2$`.
  
* We will focus on the cases where one regressor is a ***dummy/categorical*** variable and the other is ***continuous***.

* It will give you the intuition for the other cases:

* Both regresors are dummies/categorical variables,

* Both regresors are continuous variables.
  
---

# Interacting Regressors

Let's go back to the *STAR* experiment data.

How does the effect of being in a small vs regular class vary with the experience of the teacher?

Our regression model becomes:

$$ \textrm{score}_i = \color{#d96502}{b_0} + \color{#027D83}{b_1} \textrm{small}_i + \color{#02AB0D}{b_2} \textrm{experience}_i + \color{#d90502}{b_3} \textrm{small}_i \times \textrm{experience}_i + e_i$$

Let's say we estimate it in `R` with real data, and we get the following coefficients for each `$b$`:

`$$\hat{score}_i = \color{#d96502}{535} + \color{#027D83}{16} \textrm{small}_i + \color{#02AB0D}{1.3} \textrm{experience}_i + \color{#d90502}{-0.3} \textrm{small}_i \times \textrm{experience}_i$$` 
--

Because `$\textrm{small}_i$` is a binary variable, we can split up the previous equation for each class type.

* For those in regular classes, or when `$small_i = 0$`, this would look like:

`$$\hat{score}_i = \color{#d96502}{535} + \color{#02AB0D}{1.3} \textrm{experience}_i,$$`

* and for those in small classes, or when `$small_i = 1$`, we would get:

`$$\hat{score}_i = \color{#d96502}{535} + \color{#027D83}{16} + (\color{#02AB0D}{1.3} - \color{#d90502}{0.3}) \textrm{experience}_i = 551 + \textrm{experience}_i$$`

---

# Interacting Regressors

Running the regression for the `math` score (for all grades), we obtain:

``` r
lm(math ~ small+ experience + small*experience, star_df)
```

```
## 
## Call:
## lm(formula = math ~ small + experience + small * experience, 
##     data = star_df)
## 
## Coefficients:
##          (Intercept)             smallTRUE            experience  
##             534.1919               15.8906                1.3305  
## smallTRUE:experience  
##              -0.3034
```

* The interaction term allows the impact of being in a small class to vary with teacher experience.
  
* In particular, we still observe a ***positive impact of being in a small class*** on math score, but this ***effect is decreasing in the experience of the teacher***.

---

# Interacting Regressors (more formally)

Running the regression for the `math` score (for all grades), we obtain:

``` r
lm(math ~ small+ experience + small*experience, star_df)
```

* `$\color{#d96502}{b_0}$` : 534.1919 is the average math score for students assigned to the *regular* classes
* `$\color{#027D83}{b_1}$` : 15.8906 is the average increase in math scores for those in a *small* class, relative to those in the regular class
* `$\color{#02AB0D}{b_2}$` : 1.3305 means that for those in the *regular* class, average math scores increase by 1.3305 per year of teacher experience (i.e., it is the slope of the regular class line...think when `$small = 0$`)
* `$\color{#02AB0D}{b_2} + \color{#d90502}{b_3}$` : 1.3305 + -0.3034 = 1.0271 is the slope of the small class line, i.e., that for those in the *small* class, average math scores increase by 1.0271 per year of teacher experience

---

# How is this different than the base case?

``` r
lm(math ~ small+ experience, star_df)
```

```
## 
## Call:
## lm(formula = math ~ small + experience, data = star_df)
## 
## Coefficients:
## (Intercept)    smallTRUE   experience  
##     535.813       12.336        1.188
```

``` r
lm(math ~ small+ experience + small*experience, star_df)
```

---

# Interacting Regressors: Visually