Applied Data Analysis for Public Policy Studies

# Applied Data Analysis for Public Policy Studies
## Simple Linear Regression
### Michele Fioretti
### SciencesPo Paris </br> 2022-08-29

---

---

# Recap from past weeks

* `R` basics, importing data

* Exploratory data analysis:

* Summary statistics: *mean*, *median*, *variance*, *standard deviation*
    * Data visualization: base `R` and `ggplot2`
    * Data wrangling: `dplyr`

## Today - Real 'metrics finally ✌️

* Introduction to the __Simple Linear Regression Model__ and __Ordinary Least Squares__ *estimation*.

* Empirical application: *class size* and *student performance*

* Keep in mind that we are interested in uncovering **causal** relationships

---

# Class size and student performance

* What policies *lead* to improved student learning?

* Class size reduction has been at the heart of policy debates for *decades*.

* We will be using data from a famous paper by [Joshua Angrist and Victor Lavy (1999)](https://economics.mit.edu/files/8273), obtained from [Raj Chetty and Greg Bruich's course](https://opportunityinsights.org/course/).

* Consists of test scores and class/school characteristics for fifth graders (10-11 years old) in Jewish public elementary schools in Israel in 1991.

* National tests measured *mathematics* and (Hebrew) *reading* skills. The raw scores were scaled from 1-100.

---

class:: inverse

# Task 1: Getting to know the data (7 minutes)
 
1. Load the data from [here](https://www.dropbox.com/s/wwp2cs9f0dubmhr/grade5.dta?dl=1). You need to find the function that enables importing *.dta* files. (FYI: *.dta* is the extension for data files used in [*Stata*](https://www.stata.com/))

1. Describe the dataset:

* What is the unit of observations, i.e. what does each row correspond to?
  * How many observations are there?
  * What variables do we have? `View` the dataset to see what the variables correspond to.
  * What do the variables `avgmath` and `avgverb` correspond to?
  * Use the `skim` function from the `skimr` package to obtain common summary statistics for the variables `classize`, `avgmath` and `avgverb`.  
  Hint: use `dplyr` to `select` the variables and then simply pipe (`%>%`) `skim()`.

1. Do you have any priors about the actual (linear) relationship between class size and student achievement? What would you do to get a first insight?

1. Compute the correlation between class size and math and verbal scores. Is the relationship positive/negative, strong/weak?

---

# Class size and student performance: Scatter plot

.pull-left[
<img src="chapter3_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" />
]

.pull-right[
<img src="chapter3_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" />
]

* Hard to see much because all the data points are aligned vertically. Let's add a bit of `jitter` to disperse the data slightly.

---

# Class size and student performance: `jitter` scatter plot

.pull-left[
<img src="chapter3_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" />
]

.pull-right[
<img src="chapter3_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" />
]

* Somewhat positive association as suggested by correlations. Let's compute the average score by class size to see things more clearly!

---

class:: inverse

# Task 2: Binned scatter plot (7 minutes)

1. Create a new dataset (`grades_avg_cs`) where math and verbal scores are averaged by class size. Let's call these new average scores `avgmath_cs` and `avgverb_cs`.  
*N.B.: the "raw" scores are already averages at the class level. Here we average these averages by class size.*

1. Redo the same plots as before. Is the sign of the relationship more apparent?

1. Compute the correlation between class size and the new aggreagated math and verbal scores variables. Why is the (linear) association so much stronger?

---

# Class size and student performance: Binned scatter plot

.pull-left[
<img src="chapter3_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" />
]

.pull-right[
<img src="chapter3_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" />
]

---

# Class size and student performance: Binned scatter plot

* We'll first focus on the mathematics scores and for visual simplicity we'll zoom in

---

# Class size and student performance: Regression line

How to visually summarize the relationship: **a line through the scatter plot**

.left-wide[
<img src="chapter3_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto auto auto 0;" />
]

* A *line*! Great. But **which** line? This one?

* That's a *flat* line. But average mathematics score is somewhat *increasing* with class size 😩

]

---

# Class size and student performance: Regression line

How to visually summarize the relationship: **a line through the scatter plot**

.left-wide[
<img src="chapter3_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto auto auto 0;" />
]

* **That** one?

* Slightly better! Has a **slope** and an **intercept** 😐

* We need a rule to decide!

]

---

# Simple Linear Regression

Let's formalise a bit what we are doing so far.

* We are interested in the relationship between two variables:

* an __outcome variable__ (also called __dependent variable__):  
  *average mathematics score* `$(y)$`
  
--
  
  * an __explanatory variable__ (also called __independent variable__ or __regressor__):  
  *class size* `$(x)$`
  
--

* For each class `$i$` we observe both `$x_i$` and `$y_i$`, and therefore we can plot the *joint distribution* of class size and average mathematics score.

* We summarise this relationship with a line (for now). The equation for such a line with an intercept `$b_0$` and a slope `$b_1$` is:
    $$
    \widehat{y}_i = b\_0 + b\_1 x\_i
    $$

* `$\widehat{y}_i$` is our *prediction* for `$y$` at observation `$i$` `$(y_i)$` given our model (i.e. the line).

---

# Simple Linear Regression: Error term

* If all the data points were __on__ the line then `$\widehat{y}_i = y_i$`.

---

# Simple Linear Regression: Error term

* If all the data points were __on__ the line then `$\widehat{y}_i = y_i$`.

---

# Simple Linear Regression: Error term

* If all the data points were __on__ the line then `$\widehat{y}_i = y_i$`.

* However, since in most cases the *dependent variable* `$(y)$` is not ***only*** explained by the chosen *independent variable* `$(x)$`, `$\widehat{y}_i \neq y_i$`, i.e. we make an __error__.  
This __error__ is called the __error term__.

* At point `$(x_i,y_i)$`, we note this error `$e_i$`.

* The *actual data* `$(x_i,y_i)$` can thus be written as *prediction + error*:

$$
  y_i = \widehat y_i + e_i = b_0 + b_1 x_i + e_i
  $$

* **Goals**
  1. Find the values for `$b_0$` and `$b_1$` that **make the errors as small as possible**,
  
  1. Check whether these values **give a reasonable description of the data**.

---