ScPoEconometrics

.title[
# ScPoEconometrics
]
.subtitle[
## Multiple Regression Model
]
.author[
### Florian Oswald, Gustave Kenedi, Mylène Feuillade and Junnan He
]
.date[
### SciencesPo Paris </br> 2022-10-02
]

---

---

# Today - Multiple Regression Model

* Multiple independent variables in our model

* Interpretation for continuous and dummy regressors

* Dummy variable trap

* Omitted variable bias

* Adjusted `$R_2$`

* Empirical applications:

* *Class size* and *student performance*

---

# Class size and student performance

* Let's go back to Angrist and Lavy's (1999)'s analysis of the effect of class size on student performance in Israel.

* With a **simple linear regression**, we found that class size was positively ***associated*** with students' scores in maths and reading.

---

# Class size and student performance: Raw relationship

---

# Class size and student performance: Raw relationship

---

# Class size and student performance: Raw relationship

.pull-left[
<img src="chapter_mlr_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" />
]

```r
lm(avgmath ~ classize, grades)
```

```
## 
## Call:
## lm(formula = avgmath ~ classize, data = grades)
## 
## Coefficients:
## (Intercept)     classize  
##     57.7939       0.3175
```
]

---

# Class size and student performance

* Let's go back to Angrist and Lavy's (1999)'s analysis of the effect of class size on student performance in Israel.

* With a **simple linear regression**, we found that class size was positively ***associated*** with students' scores in maths and reading.

* This is intuitively unexpected, and contrasts with the simple results from the *STAR* randomized experiment.

* Could it be that some other variable may be related to class size ***as well as*** students' performance?

* In particular, we mentioned the **location effect**: large classes may be more common in wealthier and bigger cities, while small classes may be more likely in poorer rural areas.

* Let's investigate this hypothesis.

---

# Class size and student performance: Confounders

.pull-left[
Link between **class size** and **the share of students who come from `disadvantaged` backgrounds** in the class.

👉 On average, there is a greater % of disadvantaged students in smaller classes.
]

.pull-right[
Link between **average math score** and **the share of students who come from `disadvantaged` backgrounds** in the class.

👉 On average, the greater the % of students coming from a disadvantaged background, the lower the average math score.

]

---

# Class size and student performance: Multiple regression

* Suppose we want to know the effect of class size on average math scores, ***controlling for*** the fact that there is a negative relationship between the % of disadvantaged students and class size **AND** average math score.

* To do so, we have to include both `classize` and `disadvantaged` variables as *regressors* in the regression.

* As such we can obtain an estimate of the effect of class size on average math score, ** *purged* of the effect of the disadvantaged variable**.

* The model we want to estimate becomes:

$$
 \textrm{average math score}_i = b_0 + b_1 \textrm{class size}_i + b_2 \textrm{% disadvantaged}_i + e_i
$$
--

* This is ***multiple regression***! We will estimate this model in a few slides. Let's formalize what we have seen so far.

---

# Multiple Regression's Purpose

* Recall from two weeks ago, the **Simple Linear Model** can be written as

`$$y_i = b_0 + b_1 x_i + e_i,$$`

where `$y_i$` is the ***dependent variable*** and `$x_i$` is the ***independent variable***.

* Remember: We say that `X` *causes* `Y` when if we were to intervene and change the value of `X` ***without changing anything else*** then `Y` would also change as a result.

⚠️ Unless all other factors affecting `$y_i$` are uncorrelated with `$x_i$`, `$b_1$` **cannot be interpreted as a causal effect**.

We need to **enrich the model** and take into account factors that are simultaneously related to `$y_i$` **and** `$x_i$`.

---

# Multiple Regression Model

The expanded model can be written as:

`$$y_i = b_0 + b_1 x_{1,i} + b_2 x_{2,i}  + b_3 x_{3,i} + \dots + b_k x_{k,i} + e_i,$$`

where `$x_1$`, `$x_2$`, ..., `$x_k$` are `$k$` regressors, and `$b_1$`, `$b_2$`, ..., `$b_k$` are the associated `$k$` coefficients.

***Estimation***: We obtain the values for `$(b_0, b_1, b_2, ..., b_k)$` in the same way as before, using **OLS**.

* `$(b_0^{OLS}, b_1^{OLS}, b_2^{OLS}, ..., b_k^{OLS})$` are the values that minimize the **Sum of Squared Residuals**. 
  
* That is they minimize
$$
`\begin{align}
\sum_{i}{e_i^2} &= \sum_{i}{(y_i - \hat{y_i})^2} \\
&= \sum_{i}{[y_i - (b_0 + b_1 x_{1,i} + b_2 x_{2,i}  + b_3 x_{3,i} + \dots + b_k x_{k,i})]^2}
\end{align}`
$$

---

# Multiple Regression Model: Interpretation

For now assume both the dependent variable `$(y_i)$` and the independent variables `$(x_k)$` are numeric.

> Intercept `$(b_0)$`: **The predicted value of `$y$` `$(\widehat{y})$` if all the regressors `$(x_1, x_2, x_3,...)$` are equal to 0.**

> Slope `$(b_k)$`: **The predicted change, on average, in the value of `$y$` *associated* to a one-unit increase in `$x_k$`...** <br/>
> `$\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad$` **... keeping all the other regressors constant!**

* Notice that the *keeping all the other regressors constant* is the only part that changes compared to SLM.

* In other words, you are considering the individual effect of the variable `$x_k$` on `$y$` **in isolation** of the effect that the other regressors might have on `$y$`.

* **Link with causal inference**: Only the regressors included in the model are held constant, those that are not in the model can still vary and "bias" your estimates.

---

# Multiple Regression with `R`

* Very similar to simple linear regression:

```r
lm(formula = dependent variable ~  independent variable 1 + independent variable 2 + ...,
   data = data.frame containing the data)
```

## Class size and student performance: Multiple regression

Let's estimate the model from earlier on by OLS: `$\textrm{average math score}_i = b_0 + b_1 \textrm{class size}_i + b_2 \textrm{% disadvantaged}_i + e_i$`

```r
lm(avgmath ~ classize + disadvantaged, grades)
```

```
## 
## Call:
## lm(formula = avgmath ~ classize + disadvantaged, data = grades)
## 
## Coefficients:
##   (Intercept)       classize  disadvantaged  
##      69.94438        0.07168       -0.33958
```

---

# Class size and student performance: Multiple regression

***Questions***

1. How do you interpret each of these coefficients?
1. How do you explain the change in the `classize` coefficient compared to the SLM case?

---

# Class size and student performance: Multiple regression

***Answers***

1\. How do you interpret each of these coefficients?

* `$b_0$` = 69.94: When `class size` and `disadvantaged` are set to 0, the *predicted* value of the average math score is 69.94.

* `$b_1$` = 0.07: Keeping the percentage of *disadvantaged students* constant in the class, a 1-student increase in class size is ***associated, on average,*** with a 0.07 point increase in average math score.

* `$b_2$` = - 0.34: Keeping the *class size* constant, a 1-***percentage point*** increase in the share of *disadvantaged students* is ***associated, on average,*** with a 0.34 point decrease in average math score.

---

# Class size and student performance: Multiple regression

***Answers***

2\. How do you explain the change in the `classize` coefficient compared to the SLM case?

* `$b_1$` decreases when the `disadvantaged` variable is taken into account. This was expected since part of the positive effect of class size was partly due to the smaller share of disadvantaged students in bigger classes.

---

# Percentage vs. Percentage Point: A Primer

Example: % of disadvantaged students in class increases from 10 to 25 %?

***Questions:***

1. What's the *percentage point* change?

1. What's the *percentage* change?

---

# Percentage vs. Percentage Point: A Primer

Example: % of disadvantaged students in class increases from 10 to 25 %?

***Answers:***

1. This is a `$25-10=15$` ***percentage points*** increase.

1. This is a `$\frac{25-10}{10} \%= 150$` ***percent*** increase.

You ***need*** to pay attention to whether you are talking about ***percentage points*** or ***percentage*** changes! They imply drastically different magnitudes!

---

class:inverse

# Task 1

Let's analyse the regression results using **reading** score as the dependent variable.

1. Load the data from [here](https://www.dropbox.com/s/wwp2cs9f0dubmhr/grade5.dta?dl=1) using the `read_dta()` function from the `haven` package. Assign it to an object `grades`.

1. Regress `avgverb` on `classize` and `disadvantaged` and assign the output to a new object `reg`. Interpret the coefficients. How do they compare to the simple linear regression? How do they compare with the math score regression coefficients?

1. (Optional) What are the other available variables that we may add in the regression? 
  * Run the regression with all these variables and assign it to `reg_full`.
  * Look at the coefficients.
  * Discuss all coefficients: sign and magnitude.

---

# A Numeric and a Dummy Regressor: Interpretation

You know how to interpret coefficients when the variable is numeric (i.e. continuous).

What if one of the regressor is a ***dummy variable***, that is it takes a value 1 if some condition is `TRUE` and 0 otherwise?

*Example:* How do I interpret the coefficients in the following model

$$
\text{average math score}_i = b_0 + b_1\text{class size}_i + b_2\text{religious}_i +e_i
$$
`religious` is a dummy variable equal to 1 if the school is a religious school, 0 if it isn't

```r
lm(avgmath ~ classize + religious, grades)
```

```
## 
## Call:
## lm(formula = avgmath ~ classize + religious, data = grades)
## 
## Coefficients:
## (Intercept)     classize    religious  
##     61.3092       0.2311      -3.7800
```

---

# A Numeric and a Dummy Regressor: Formally

Our model is:
$$
\text{average math score}_i = \color{#d96502}{b_0} + \color{#d90502}{b_1}\text{class size}_i + \color{#027D83}{b_2}\text{religious}_i +e_i
$$

We have the following equalities:

`\begin{align}
\mathbb{E}(\text{average math score} | \text{religious} = 0 \text{ & } \text{class size} = 0) &= \color{#d96502}{b_0} + \color{#d90502}{b_1} \times 0 + \color{#027D83}{b_2} \times 0 \\
&= \color{#d96502}{b_0}
\end{align}`

`$\rightarrow$` `$\color{#d96502}{b_0}$` corresponds to the expected average math score when class size is 0 and the school is not religious

---

# A Numeric and a Dummy Regressor: Formally

Our model is:
$$
\text{average math score}_i = \color{#d96502}{b_0} + \color{#d90502}{b_1}\text{class size}_i + \color{#027D83}{b_2}\text{religious}_i +e_i
$$

We have the following equalities:

`\begin{equation}
\mathbb{E}(\text{average math score} | \text{religious} \in \{0,1\} \text{ & } \color{#d90502}{\text{class size} = n_1}) = \color{#d96502}{b_0} + \color{#d90502}{b_1} \times n_1 + \color{#027D83}{b_2} \times \text{religious}
\end{equation}`

`\begin{multline}
\mathbb{E}(\text{average math score} | \text{religious} \in \{0,1\} \text{ & } \color{#d90502}{\text{class size} = n_1+1}) = \\ \color{#d96502}{b_0} + \color{#d90502}{b_1} \times (n_1+1) + \color{#027D83}{b_2} \times \text{religious}
\end{multline}`

`\begin{multline}
\mathbb{E}(\text{average math score} | \text{religious} \in \{0,1\} \text{ & } \color{#d90502}{\text{class size} = n_1+1}) - \\ \mathbb{E}(\text{average math score} | \text{religious} \in \{0,1\} \text{ & } \color{#d90502}{\text{class size} = n_1}) \\
= \color{#d96502}{b_0} + \color{#d90502}{b_1} \times (n_1+1) + \color{#027D83}{b_2} \times \text{religious} - (\color{#d96502}{b_0} + \color{#d90502}{b_1} \times n_1 + \color{#027D83}{b_2} \times \text{religious}) = \color{#d90502}{b_1}
\end{multline}`

`$\rightarrow$` `$\color{#d90502}{b_1}$` corresponds to the expected change in average math score associated, on average, with a 1 student increase in class size, controlling for the religious status of the school (= keeping the religious status constant)

---

# A Numeric and a Dummy Regressor: Formally

Our model is:
$$
\text{average math score}_i = \color{#d96502}{b_0} + \color{#d90502}{b_1}\text{class size}_i + \color{#027D83}{b_2}\text{religious}_i +e_i
$$

We have the following equalities:

`\begin{align}
\mathbb{E}(\text{average math score} | \color{#027D83}{\text{religious} = 1} \text{ & } \text{class size} \in \mathbb{N}) &= \color{#d96502}{b_0} + \color{#d90502}{b_1} \times \text{class size} + \color{#027D83}{b_2} \times 1 \\
&= \color{#d96502}{b_0} + \color{#d90502}{b_1} \times \text{class size} + \color{#027D83}{b_2}
\end{align}`

`\begin{align}
\mathbb{E}(\text{average math score} | \color{#027D83}{\text{religious} = 0} \text{ & } \text{class size} \in \mathbb{N}) &= \color{#d96502}{b_0} + \color{#d90502}{b_1} \times \text{class size} + \color{#027D83}{b_2} \times 0 \\
&= \color{#d96502}{b_0} + \color{#d90502}{b_1} \times \text{class size}
\end{align}`

`\begin{multline}
\mathbb{E}(\text{average math score} | \color{#027D83}{\text{religious} = 1} \text{ & } \text{class size} \in \mathbb{N}) - \\ \mathbb{E}(\text{average math score} | \color{#027D83}{\text{religious} =0} \text{ & } \text{class size} \in \mathbb{N}) \\
= \color{#d96502}{b_0} + \color{#d90502}{b_1} \times \text{class size} + \color{#027D83}{b_2}- (\color{#d96502}{b_0} + \color{#d90502}{b_1} \times \text{class size}) = \color{#027D83}{b_2}
\end{multline}`

`$\rightarrow$` `$\color{#027D83}{b_2}$` corresponds to the expected difference in average math score between religious and non-religious schools, keeping class size constant.

---

# A Numeric and a Dummy Regressor: Summary

Our model is:
$$
\text{average math score}_i = \color{#d96502}{b_0} + \color{#d90502}{b_1}\text{class size}_i + \color{#027D83}{b_2}\text{religious}_i +e_i
$$

We have the following equalities:

`\begin{equation}
\color{#d96502}{b_0} = \mathbb{E}(\text{average math score} | \text{religious} = 0 \text{ & } \text{class size} = 0)
\end{equation}`

`\begin{multline}
\color{#d90502}{b_1} = \mathbb{E}(\text{average math score} | \text{religious} \in \{0,1\} \text{ & } \color{#d90502}{\text{class size} = n_1+1}) - \\ \mathbb{E}(\text{average math score} | \text{religious} \in \{0,1\} \text{ & } \color{#d90502}{\text{class size} = n_1})
\end{multline}`

`\begin{multline}
\color{#027D83}{b_2} = \mathbb{E}(\text{average math score} | \color{#027D83}{\text{religious} = 1} \text{ & } \text{class size} \in \mathbb{N}) - \\ \mathbb{E}(\text{average math score} | \color{#027D83}{\text{religious} =0} \text{ & } \text{class size} \in \mathbb{N})
\end{multline}`

`\begin{equation}
\color{#d96502}{b_0} + \color{#027D83}{b_2} = \mathbb{E}(\text{average math score} | \text{religious} = 1 \text{ & } \text{class size} = 0)
\end{equation}`

---

# A Numeric and a Dummy Regressor: Visually