ScPoEconometrics

.title[
# ScPoEconometrics
]
.subtitle[
## Simple Linear Regression
]
.author[
### Gustave Kenedi, Mylène Feuillade, Florian Oswald and Junnan He
]
.date[
### SciencesPo Paris </br> 2022-09-20
]

---

---

# Today - Real 'metrics finally ✌️

* Introduction to the ***Simple Linear Regression Model*** and ***Ordinary Least Squares (OLS)*** *estimation*.

* Empirical application: *class size* and *student performance*

* Keep in mind that we are interested in uncovering **causal** relationships

---

# Class size and student performance

* What policies *lead* to improved student learning?

* Class size reduction has been at the heart of policy debates for *decades*.

* We will be using data from a famous paper by [Joshua Angrist and Victor Lavy (1999)](https://economics.mit.edu/files/8273), obtained from [Raj Chetty and Greg Bruich's course](https://opportunityinsights.org/course/).

* Consists of test scores and class/school characteristics for fifth graders (10-11 years old) in Jewish public elementary schools in Israel in 1991.

* National tests measured *mathematics* and (Hebrew) *reading* skills. The raw scores were scaled from 1-100.

---

class:: inverse

# Task 1: Getting to know the data

<div class="countdown" id="timer_079d74d0" data-update-every="1" tabindex="0" style="top:0;right:0;">
<div class="countdown-controls"><button class="countdown-bump-down">−</button><button class="countdown-bump-up">+</button></div>
<code class="countdown-time"><span class="countdown-digits minutes">07</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
 
1. Load the data from [here](https://www.dropbox.com/s/wwp2cs9f0dubmhr/grade5.dta?dl=1) as `grades`. *Hint: Use the `read_dta` from the `haven` library to import the file, which has a format .dta.* (FYI: *.dta* is the extension for data files used in [*Stata*](https://www.stata.com/))

1. Describe the dataset:

* What is the unit of observations, i.e. what does each row correspond to?
  * How many observations are there?
  * View the dataset. What variables do we have? What do the variables `avgmath` and `avgverb` correspond to?
  * Use the `skim` function from the `skimr` package to obtain common summary statistics for the variables `classize`, `avgmath` and `avgverb`. (*Hint: use `dplyr` to `select` the variables and then simply pipe (`%>%`) `skim()`.*)

1. Do you have any priors about the actual (linear) relationship between class size and student achievement? What would you do to get a first insight?

1. Compute the correlation between class size and math and verbal scores. Is the relationship positive/negative, strong/weak?

---

# Class size and student performance: Scatter plot

.pull-left[
<img src="chapter_slr_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" />
]

.pull-right[
<img src="chapter_slr_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" />
]

* Somewhat positive association as suggested by the correlations. Let's compute the average score by class size to see things more clearly!

---

# Class size and student performance: Binned scatter plot

.pull-left[
<img src="chapter_slr_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" />
]

.pull-right[
<img src="chapter_slr_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" />
]

---

# Class size and student performance: Binned scatter plot

* We'll first focus on the mathematics scores and for visual simplicity we'll zoom in

---

# Class size and student performance: Regression line

How to visually summarize the relationship: **a line through the scatter plot**

.left-wide[
<img src="chapter_slr_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto auto auto 0;" />
]

* A *line*! Great. But **which** line? This one?

* That's a *flat* line. But average mathematics score is somewhat *increasing* with class size 😩

]

---

# Class size and student performance: Regression line

How to visually summarize the relationship: **a line through the scatter plot**

.left-wide[
<img src="chapter_slr_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto auto auto 0;" />
]

* **That** one?

* Slightly better! Has a **slope** and an **intercept** 😐

* We need a rule to decide!

]

---

# Simple Linear Regression

Let's formalise a bit what we are doing so far.

* We are interested in the relationship between two variables:

* an __outcome variable__ (also called __dependent variable__):  
  *average mathematics score* `$(y)$`
  
--
  
  * an __explanatory variable__ (also called __independent variable__ or __regressor__):  
  *class size* `$(x)$`
  
--

* For each class `$i$` we observe both `$x_i$` and `$y_i$`, and therefore we can plot the *joint distribution* of class size and average mathematics score.

* We summarise this relationship with a line (for now). The equation for such a line with an intercept `$b_0$` and a slope `$b_1$` is:
    $$
    \widehat{y}_i = b\_0 + b\_1 x\_i
    $$

* `$\widehat{y}_i$` is our *prediction* for `$y$` at observation `$i$` `$(y_i)$` given our model (i.e. the line).

---

# What's A Line: A Refresher

---

# What's A Line: A Refresher

---

# What's A Line: A Refresher

---

# Simple Linear Regression: Residual

* If all the data points were __on__ the line then `$\widehat{y}_i = y_i$`.

---

# Simple Linear Regression: Residual

* If all the data points were __on__ the line then `$\widehat{y}_i = y_i$`.

---

# Simple Linear Regression: Residual

* If all the data points were __on__ the line then `$\widehat{y}_i = y_i$`.

* However, since in most cases the *dependent variable* `$(y)$` is not *only* explained by the chosen *independent variable* `$(x)$`, `$\widehat{y}_i \neq y_i$`, i.e. we make an __error__.  
This __error__ is called the __residual__.

* At point `$(x_i,y_i)$`, we note this residual `$e_i$`.

* The *actual data* `$(x_i,y_i)$` can thus be written as *prediction + residual*:

$$
  y_i = \widehat y_i + e_i = b_0 + b_1 x_i + e_i
  $$

---