class: center, middle, inverse, title-slide # Applied Data Analysis for Public Policy Studies ## Simple Linear Regression ### Michele Fioretti ### SciencesPo Paris 2022-08-29 --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Recap from past weeks * `R` basics, importing data * Exploratory data analysis: * Summary statistics: *mean*, *median*, *variance*, *standard deviation* * Data visualization: base `R` and `ggplot2` * Data wrangling: `dplyr` -- ## Today - Real 'metrics finally ✌️ * Introduction to the __Simple Linear Regression Model__ and __Ordinary Least Squares__ *estimation*. * Empirical application: *class size* and *student performance* * Keep in mind that we are interested in uncovering **causal** relationships --- # Class size and student performance * What policies *lead* to improved student learning? * Class size reduction has been at the heart of policy debates for *decades*. -- * We will be using data from a famous paper by [Joshua Angrist and Victor Lavy (1999)](https://economics.mit.edu/files/8273), obtained from [Raj Chetty and Greg Bruich's course](https://opportunityinsights.org/course/). * Consists of test scores and class/school characteristics for fifth graders (10-11 years old) in Jewish public elementary schools in Israel in 1991. * National tests measured *mathematics* and (Hebrew) *reading* skills. The raw scores were scaled from 1-100. --- class:: inverse # Task 1: Getting to know the data (7 minutes) 1. Load the data from [here](https://www.dropbox.com/s/wwp2cs9f0dubmhr/grade5.dta?dl=1). You need to find the function that enables importing *.dta* files. (FYI: *.dta* is the extension for data files used in [*Stata*](https://www.stata.com/)) 1. Describe the dataset: * What is the unit of observations, i.e. what does each row correspond to? * How many observations are there? * What variables do we have? `View` the dataset to see what the variables correspond to. * What do the variables `avgmath` and `avgverb` correspond to? * Use the `skim` function from the `skimr` package to obtain common summary statistics for the variables `classize`, `avgmath` and `avgverb`. Hint: use `dplyr` to `select` the variables and then simply pipe (`%>%`) `skim()`. 1. Do you have any priors about the actual (linear) relationship between class size and student achievement? What would you do to get a first insight? 1. Compute the correlation between class size and math and verbal scores. Is the relationship positive/negative, strong/weak? --- # Class size and student performance: Scatter plot .pull-left[ <img src="chapter3_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ <img src="chapter3_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> ] -- * Hard to see much because all the data points are aligned vertically. Let's add a bit of `jitter` to disperse the data slightly. --- # Class size and student performance: `jitter` scatter plot .pull-left[ <img src="chapter3_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ <img src="chapter3_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" /> ] -- * Somewhat positive association as suggested by correlations. Let's compute the average score by class size to see things more clearly! --- class:: inverse # Task 2: Binned scatter plot (7 minutes) 1. Create a new dataset (`grades_avg_cs`) where math and verbal scores are averaged by class size. Let's call these new average scores `avgmath_cs` and `avgverb_cs`. *N.B.: the "raw" scores are already averages at the class level. Here we average these averages by class size.* 1. Redo the same plots as before. Is the sign of the relationship more apparent? 1. Compute the correlation between class size and the new aggreagated math and verbal scores variables. Why is the (linear) association so much stronger? --- # Class size and student performance: Binned scatter plot .pull-left[ <img src="chapter3_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ <img src="chapter3_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> ] --- # Class size and student performance: Binned scatter plot * We'll first focus on the mathematics scores and for visual simplicity we'll zoom in <img src="chapter3_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" /> --- # Class size and student performance: Regression line How to visually summarize the relationship: **a line through the scatter plot** -- .left-wide[ <img src="chapter3_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto auto auto 0;" /> ] -- .right-thin[ <br> * A *line*! Great. But **which** line? This one? * That's a *flat* line. But average mathematics score is somewhat *increasing* with class size 😩 ] --- # Class size and student performance: Regression line How to visually summarize the relationship: **a line through the scatter plot** .left-wide[ <img src="chapter3_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto auto auto 0;" /> ] -- .right-thin[ <br> * **That** one? * Slightly better! Has a **slope** and an **intercept** 😐 * We need a rule to decide! ] --- # Simple Linear Regression Let's formalise a bit what we are doing so far. * We are interested in the relationship between two variables: -- * an __outcome variable__ (also called __dependent variable__): *average mathematics score* `\((y)\)` -- * an __explanatory variable__ (also called __independent variable__ or __regressor__): *class size* `\((x)\)` -- * For each class `\(i\)` we observe both `\(x_i\)` and `\(y_i\)`, and therefore we can plot the *joint distribution* of class size and average mathematics score. -- * We summarise this relationship with a line (for now). The equation for such a line with an intercept `\(b_0\)` and a slope `\(b_1\)` is: $$ \widehat{y}_i = b\_0 + b\_1 x\_i $$ -- * `\(\widehat{y}_i\)` is our *prediction* for `\(y\)` at observation `\(i\)` `\((y_i)\)` given our model (i.e. the line). --- # Simple Linear Regression: Error term * If all the data points were __on__ the line then `\(\widehat{y}_i = y_i\)`. -- <img src="chapter3_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Error term * If all the data points were __on__ the line then `\(\widehat{y}_i = y_i\)`. <img src="chapter3_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Error term * If all the data points were __on__ the line then `\(\widehat{y}_i = y_i\)`. * However, since in most cases the *dependent variable* `\((y)\)` is not ***only*** explained by the chosen *independent variable* `\((x)\)`, `\(\widehat{y}_i \neq y_i\)`, i.e. we make an __error__. This __error__ is called the __error term__. -- * At point `\((x_i,y_i)\)`, we note this error `\(e_i\)`. -- * The *actual data* `\((x_i,y_i)\)` can thus be written as *prediction + error*: $$ y_i = \widehat y_i + e_i = b_0 + b_1 x_i + e_i $$ -- * **Goals** 1. Find the values for `\(b_0\)` and `\(b_1\)` that **make the errors as small as possible**, 1. Check whether these values **give a reasonable description of the data**. --- # Simple Linear Regression: Graphically <img src="chapter3_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Graphically <img src="chapter3_files/figure-html/unnamed-chunk-16-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Graphically <img src="chapter3_files/figure-html/unnamed-chunk-17-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Graphically <img src="chapter3_files/figure-html/unnamed-chunk-18-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Graphically <img src="chapter3_files/figure-html/unnamed-chunk-19-1.svg" style="display: block; margin: auto;" /> --- class: inverse # App Time! (5 minutes) Intuitively one might want to simply minimize the absolute value of the sum of all the errors `\(\mid \sum^n_{i=1}{e_i} \mid\)`, that is one might want the sum of errors as close to 0 as possible. Let's try to find the best line by minimizing the absolute value of the sum of errors. The line *won't* turn green so don't spend all day waiting for it. ```r library(ScPoApps) # load our library launchApp('reg_simple_arrows') aboutApp('reg_simple_arrows') # explainer about app ``` --- # Minimizing the Absolute Value of the Sum of Errors <img src="chapter3_files/figure-html/unnamed-chunk-22-1.svg" style="display: block; margin: auto;" /> --- # Minimizing the Absolute Value of the Sum of Errors <img src="chapter3_files/figure-html/unnamed-chunk-23-1.svg" style="display: block; margin: auto;" /> --- # Minimizing the Absolute Value of the Sum of Errors .pull-left[ <img src="chapter3_files/figure-html/unnamed-chunk-24-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ <br> <br> * This line minimizes the absolute value of the sum of errors since the data points are symmetric around `\(y = 5\)`. * Yet it clearly does not fit the data well! * Note also that many other lines would also yield a sum of errors of 0 since the data are symmetric. A unique solution is not guaranteed! ] --- # **O**rdinary **L**east **S**quares (OLS) Estimation * Errors of different sign `\((+/-)\)` cancel out, so let's consider **squared residuals** `$$\forall i \in [1,N], e_i^2 = (y_i - \widehat y_i)^2 = (y_i - b_0 - b_1 x_i)^2$$` * Choose `\((b_0,b_1)\)` such that `\(\sum_{i = 1}^N e_1^2 + \dots + e_N^2\)` is **as small as possible**. -- <img src="chapter3_files/figure-html/unnamed-chunk-25-1.svg" style="display: block; margin: auto;" /> --- # **O**rdinary **L**east **S**quares (OLS) Estimation * Errors of different sign `\((+/-)\)` cancel out, so let's consider **squared residuals** `$$\forall i \in [1,N], e_i^2 = (y_i - \widehat y_i)^2 = (y_i - b_0 - b_1 x_i)^2$$` * Choose `\((b_0,b_1)\)` such that `\(\sum_{i = 1}^N e_1^2 + \dots + e_N^2\)` is **as small as possible**. <img src="chapter3_files/figure-html/unnamed-chunk-26-1.svg" style="display: block; margin: auto;" /> --- class: inverse # App Time! #2 (3 minutes) Let's minimize some squared errors! ```r launchApp('reg_simple') aboutApp('reg_simple') ``` --- # **O**rdinary **L**east **S**quares (OLS): Coefficient Formulas * **OLS**: *estimation* method consisting in minimizing the sum of squared residuals. * Yields __unique__ solutions to this minization problem. * So what are the formulas for `\(b_0\)` (intercept) and `\(b_1\)` (slope)? -- * In our single independent variable case: > ### __Slope: `\(b_1^{OLS} = \frac{cov(x,y)}{var(x)}\)` `\(\hspace{2cm}\)` Intercept: `\(b_0^{OLS} = \bar{y} - b_1\bar{x}\)`__ -- * ❗ You should know these formulas, especially the one for `\(b_1^{OLS}\)` ❗ -- * These formulas do not appear from magic. They can be found by solving the minimisation of squared errors. The maths can be found [here](https://www.youtube.com/watch?v=Hi5EJnBHFB4) for those who are interested. --- class: inverse # App Time! #3 (3 minutes) How does OLS actually perform the minimization problem? Some intuition without maths. ```r launchApp('SSR_cone') aboutApp('SSR_cone') # after ``` --- # **O**rdinary **L**east **S**quares (OLS): Interpretation For now assume both the dependent variable `\((y)\)` and the independent variable `\((x)\)` are numeric. -- > Intercept `\((b_0)\)`: **The predicted value of `\(y\)` `\((\widehat{y})\)` if `\(x = 0\)`.** -- > Slope `\((b_1)\)`: **The predicted change, on average, in the value of `\(y\)` *associated* to a one-unit increase in `\(x\)`.** -- * ⚠️ Note that we use the term *associated*, **clearly avoiding interpreting `\(b_1\)` as the causal impact of `\(x\)` on `\(y\)`**. To make such a claim, we need some specific conditions to be met. (Next week!) -- * Also notice that the units of `\(x\)` will matter for the interpretation (and magnitude!) of `\(b_1\)`. --- # OLS with `R` * In `R`, OLS regressions are estimated using the `lm` function. * This is how it works: ```r lm(formula = dependent variable ~ independent variable, data = data.frame containing the data) ``` -- ## Class size and student performance Let's estimate the following model by OLS: `\(\textrm{avgmath_cs}_i = b_0 + b_1 \textrm{classsize}_i + e_i\)` ```r # OLS regression of class size on average maths score lm(avgmath_cs ~ classize, grades_avg_cs) ``` ``` ## ## Call: ## lm(formula = avgmath_cs ~ classize, data = grades_avg_cs) ## ## Coefficients: ## (Intercept) classize ## 61.1092 0.1913 ``` --- # **O**rdinary **L**east **S**quares (OLS): Prediction ``` ## ## Call: ## lm(formula = avgmath_cs ~ classize, data = grades_avg_cs) ## ## Coefficients: ## (Intercept) classize ## 61.1092 0.1913 ``` -- This implies (abstracting the `\(_i\)` subscript for simplicity): $$ `\begin{aligned} \widehat y &= b_0 + b_1 x \\ \widehat {\text{avgmath_cs}} &= b_0 + b_1 \cdot \text{classize} \\ \widehat {\text{avgmath_cs}} &= 61.11 + 0.19 \cdot \text{classize} \end{aligned}` $$ -- What's the predicted average score for a class of 26 students? (Using the *exact* coefficients.) $$ `\begin{aligned} \widehat {\text{avgmath_cs}} &= 61.11 + 0.19 \cdot 26 \\ \widehat {\text{avgmath_cs}} &= 66.08 \end{aligned}` $$ --- class: inverse # Task 3: OLS Regression (7 minutes) 1. Compute the OLS coefficients using the formulas on slide 28. 1. Regress class size (independant variable) on average verbal score (dependent variable). 2. Is the slope coefficient similar to the one found for average math score? Was it expected based on the graphical evidence? 3. What is the predicted average verbal score when class size is equal to 0? (Does that even make sense?!) 3. What is the predicted average verbal score when the class size is equal to 30 students? --- # OLS variations / restrictions * All are described [in the book](https://michelefioretti.github.io/ScPoEconometrics/linreg.html#OLS). Optional 🤓. * There is an app for each of them: <br> <br> type | App -------- | -------- No Intercept, No regressors | `launchApp('reg_constrained')` Centered Regression | `launchApp('demeaned_reg')` Standardized Regression | `launchApp('reg_standardized')` --- # Predictions and Residuals: Properties .pull-left[ * __The average of `\(\widehat{y}_i\)` is equal to `\(\bar{y}\)`.__ `$$\begin{align} \frac{1}{N} \sum_{i=1}^N \widehat{y}_i &= \frac{1}{N} \sum_{i=1}^N b_0 + b_1 x_i \\ &= b_0 + b_1 \bar{x} = \bar{y} \end{align}$$` * __The average (or sum) of errors is 0.__ `$$\begin{align} \frac{1}{N} \sum_{i=1}^N e_i &= \frac{1}{N} \sum_{i=1}^N (y_i - \widehat y_i) \\ &= \bar{y} - \frac{1}{N} \sum_{i=1}^N \widehat{y}_i \\\ &= 0 \end{align}$$` ] -- .pull-right[ * __ Regressor and errors are uncorelated (by definition).__ `$$Cov(x_i, e_i) = 0$$` * __Prediction and errors are uncorrelated.__ `$$\begin{align} Cov(\widehat y_i, e_i) &= Cov(b_0 + b_1x_i, e_i) \\ &= b_1Cov(x_i,e_i) \\ &= 0 \end{align}$$` Since `\(Cov(a + bx, y) = bCov(x,y)\)`. ] --- # Linearity Assumption: Visualize your Data! * It's important to keep in mind that covariance, correlation and simple OLS regression only measure **linear relationships** between two variables. * Two datasets with *identical* correlations and regression lines could look *vastly* different. -- * Is that even possible? <img src="https://media.giphy.com/media/5aLrlDiJPMPFS/giphy.gif" height = "400" align = "middle" /> --- # Linearity Assumption: Anscombe * Francis Anscombe (1973) came up with 4 datasets with identical stats. But look! .left-wide[ <img src="chapter3_files/figure-html/unnamed-chunk-33-1.svg" style="display: block; margin: auto;" /> ] -- .right-thin[ </br> </br> <table class="table table-striped" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> dataset </th> <th style="text-align:right;"> cov </th> <th style="text-align:right;"> var(y) </th> <th style="text-align:right;"> var(x) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5.501 </td> <td style="text-align:right;"> 4.127 </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 5.500 </td> <td style="text-align:right;"> 4.128 </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 5.497 </td> <td style="text-align:right;"> 4.123 </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5.499 </td> <td style="text-align:right;"> 4.123 </td> <td style="text-align:right;"> 11 </td> </tr> </tbody> </table> ] --- # Nonlinear Relationships in Data? .pull-left[ * We can accomodate non-linear relationships in regressions. * Just add a *higher order* term like this: $$ y_i = b_0 + b_1 x_i + b_2 x_i^2 + e_i $$ * This is __multiple regression__ (in 2 weeks!) ] -- .pull-right[ * For example, suppose we had this data and fit the previous regression model: <img src="chapter3_files/figure-html/non-line-cars-ols2-1.svg" style="display: block; margin: auto;" /> ] --- # Analysis of Variance (ANOVA) * Remember that `\(y_i = \widehat{y}_i + e_i\)`. * We have the following decomposition: `$$\begin{align} Var(y) &= Var(\widehat{y} + e)\\&= Var(\widehat{y}) + Var(e) + 2 Cov(\widehat{y},e)\\&= Var(\widehat{y}) + Var(e)\end{align}$$` * Because: * `\(Var(x+y) = Var(x) + Var(y) + 2Cov(x,y)\)` * `\(Cov(\hat{y},e)=0\)` * __Total variation (SST) = Model explained (SSE) + Unexplained (SSR)__ * SST = sum of squares total * SSE = sum of squares error * SSE = sum of squares regresion --- # Goodness of Fit * The __ `\(R^2\)` __ measures how well the __model fits the data__. -- $$ R^2 = \frac{\text{variance explained}}{\text{total variance}} = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}\in[0,1] $$ -- * `\(R^2\)` close to `\(1\)` indicates a __very ***high*** explanatory power__ of the model. * `\(R^2\)` close to `\(0\)` indicates a __very ***low*** explanatory power__ of the model. -- * *Interpretation:* an `\(R^2\)` of 0.5, for example, means that the variation in `\(x\)` "explains" 50% of the variation in `\(y\)`. -- * ⚠️ Low `\(R^2\)` does __NOT__ mean it's a useless model! Remember that econometrics is interested in causal mechanisms, not prediction! --- class: inverse # Task 4: `\(R^2\)` and goodness of fit (10 minutes) 1. Regress `classize` on `avgmath_cs`. Assign to an object `math_reg`. 1. Pass `math_reg` in the `summary()` function. What is the (multiple) `\(R^2\)` for this regression? How can you interpret it? 1. Compute the squared correlation between `classize` and `avgmath_cs`. What does this tell you of the relationship between `\(R^2\)` and the correlation in a regression with only one regressor? 1. Install and load the `broom` package. Pass `math_reg` in the `broom::augment()` function and assign it to a new object. Use the variance in `avgmath_cs` (SST) and the variance in `.fitted` (predicted values; SSE) to find the `\(R^2\)` using the formula in the previous slide. 1. Repeat steps 1 and 2 for `avgverb_cs`. For which exam does the variance in class size explain more of the variance in students' scores? --- class: title-slide-final, middle background-image: url(../img/logo/ScPo-econ.png) background-size: 250px background-position: 9% 19% # SEE YOU NEXT WEEK! | | | | :--------------------------------------------------------------------------------------------------------- | :-------------------------------- | | <a href="mailto:michele.fioretti@sciencespo.fr">.ScPored[<i class="fa fa-paper-plane fa-fw"></i>] | michele.fioretti@sciencespo.fr | | <a href="https://michelefioretti.github.io/ScPoEconometrics-Slides/">.ScPored[<i class="fa fa-link fa-fw"></i>] | Slides | | <a href="https://michelefioretti.github.io/ScPoEconometrics/">.ScPored[<i class="fa fa-link fa-fw"></i>] | Book | | <a href="http://twitter.com/ScPoEcon">.ScPored[<i class="fa fa-twitter fa-fw"></i>] | @ScPoEcon | | <a href="http://github.com/ScPoEcon">.ScPored[<i class="fa fa-github fa-fw"></i>] | @ScPoEcon |