class: center, middle, inverse, title-slide .title[ # ScPoEconometrics ] .subtitle[ ## Simple Linear Regression ] .author[ ### Gustave Kenedi, Mylène Feuillade, Florian Oswald and Junnan He ] .date[ ### SciencesPo Paris 2022-09-20 ] --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Today - Real 'metrics finally ✌️ * Introduction to the ***Simple Linear Regression Model*** and ***Ordinary Least Squares (OLS)*** *estimation*. * Empirical application: *class size* and *student performance* * Keep in mind that we are interested in uncovering **causal** relationships --- # Class size and student performance * What policies *lead* to improved student learning? * Class size reduction has been at the heart of policy debates for *decades*. -- * We will be using data from a famous paper by [Joshua Angrist and Victor Lavy (1999)](https://economics.mit.edu/files/8273), obtained from [Raj Chetty and Greg Bruich's course](https://opportunityinsights.org/course/). * Consists of test scores and class/school characteristics for fifth graders (10-11 years old) in Jewish public elementary schools in Israel in 1991. * National tests measured *mathematics* and (Hebrew) *reading* skills. The raw scores were scaled from 1-100. --- class:: inverse # Task 1: Getting to know the data
−
+
07
:
00
1. Load the data from [here](https://www.dropbox.com/s/wwp2cs9f0dubmhr/grade5.dta?dl=1) as `grades`. *Hint: Use the `read_dta` from the `haven` library to import the file, which has a format .dta.* (FYI: *.dta* is the extension for data files used in [*Stata*](https://www.stata.com/)) 1. Describe the dataset: * What is the unit of observations, i.e. what does each row correspond to? * How many observations are there? * View the dataset. What variables do we have? What do the variables `avgmath` and `avgverb` correspond to? * Use the `skim` function from the `skimr` package to obtain common summary statistics for the variables `classize`, `avgmath` and `avgverb`. (*Hint: use `dplyr` to `select` the variables and then simply pipe (`%>%`) `skim()`.*) 1. Do you have any priors about the actual (linear) relationship between class size and student achievement? What would you do to get a first insight? 1. Compute the correlation between class size and math and verbal scores. Is the relationship positive/negative, strong/weak? --- # Class size and student performance: Scatter plot .pull-left[ <img src="chapter_slr_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ <img src="chapter_slr_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> ] -- * Somewhat positive association as suggested by the correlations. Let's compute the average score by class size to see things more clearly! --- # Class size and student performance: Binned scatter plot .pull-left[ <img src="chapter_slr_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ <img src="chapter_slr_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> ] --- # Class size and student performance: Binned scatter plot * We'll first focus on the mathematics scores and for visual simplicity we'll zoom in <img src="chapter_slr_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" /> --- # Class size and student performance: Regression line How to visually summarize the relationship: **a line through the scatter plot** -- .left-wide[ <img src="chapter_slr_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto auto auto 0;" /> ] -- .right-thin[ <br> * A *line*! Great. But **which** line? This one? * That's a *flat* line. But average mathematics score is somewhat *increasing* with class size 😩 ] --- # Class size and student performance: Regression line How to visually summarize the relationship: **a line through the scatter plot** .left-wide[ <img src="chapter_slr_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto auto auto 0;" /> ] .right-thin[ <br> * **That** one? * Slightly better! Has a **slope** and an **intercept** 😐 * We need a rule to decide! ] --- # Simple Linear Regression Let's formalise a bit what we are doing so far. * We are interested in the relationship between two variables: -- * an __outcome variable__ (also called __dependent variable__): *average mathematics score* `\((y)\)` -- * an __explanatory variable__ (also called __independent variable__ or __regressor__): *class size* `\((x)\)` -- * For each class `\(i\)` we observe both `\(x_i\)` and `\(y_i\)`, and therefore we can plot the *joint distribution* of class size and average mathematics score. -- * We summarise this relationship with a line (for now). The equation for such a line with an intercept `\(b_0\)` and a slope `\(b_1\)` is: $$ \widehat{y}_i = b\_0 + b\_1 x\_i $$ -- * `\(\widehat{y}_i\)` is our *prediction* for `\(y\)` at observation `\(i\)` `\((y_i)\)` given our model (i.e. the line). --- # What's A Line: A Refresher <img src="chapter_slr_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> --- # What's A Line: A Refresher <img src="chapter_slr_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" /> --- # What's A Line: A Refresher <img src="chapter_slr_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Residual * If all the data points were __on__ the line then `\(\widehat{y}_i = y_i\)`. -- <img src="chapter_slr_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Residual * If all the data points were __on__ the line then `\(\widehat{y}_i = y_i\)`. <img src="chapter_slr_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Residual * If all the data points were __on__ the line then `\(\widehat{y}_i = y_i\)`. * However, since in most cases the *dependent variable* `\((y)\)` is not *only* explained by the chosen *independent variable* `\((x)\)`, `\(\widehat{y}_i \neq y_i\)`, i.e. we make an __error__. This __error__ is called the __residual__. -- * At point `\((x_i,y_i)\)`, we note this residual `\(e_i\)`. -- * The *actual data* `\((x_i,y_i)\)` can thus be written as *prediction + residual*: $$ y_i = \widehat y_i + e_i = b_0 + b_1 x_i + e_i $$ --- # Simple Linear Regression: Graphically <img src="chapter_slr_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Graphically <img src="chapter_slr_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Graphically <img src="chapter_slr_files/figure-html/unnamed-chunk-16-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Graphically <img src="chapter_slr_files/figure-html/unnamed-chunk-17-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Graphically <img src="chapter_slr_files/figure-html/unnamed-chunk-18-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Graphically .left-wide[ <img src="chapter_slr_files/figure-html/unnamed-chunk-19-1.svg" width="100%" style="display: block; margin: auto;" /> ] .right-thin[ <br> <br> <p style="text-align: center; font-weight: bold; font-size: 35px; color: #d90502;">Which "minimisation" criterion should (can) be used?</strong> ] --- # **O**rdinary **L**east **S**quares (OLS) Estimation -- * Errors of different sign `\((+/-)\)` cancel out, so we consider **squared residuals** `$$\forall i \in [1,N], e_i^2 = (y_i - \widehat y_i)^2 = (y_i - b_0 - b_1 x_i)^2$$` * Choose `\((b_0,b_1)\)` such that `\(\sum_{i = 1}^N e_1^2 + \dots + e_N^2\)` is **as small as possible**. -- <img src="chapter_slr_files/figure-html/unnamed-chunk-20-1.svg" style="display: block; margin: auto;" /> --- # **O**rdinary **L**east **S**quares (OLS) Estimation * Errors of different sign `\((+/-)\)` cancel out, so we consider **squared residuals** `$$\forall i \in [1,N], e_i^2 = (y_i - \widehat y_i)^2 = (y_i - b_0 - b_1 x_i)^2$$` * Choose `\((b_0,b_1)\)` such that `\(\sum_{i = 1}^N e_1^2 + \dots + e_N^2\)` is **as small as possible**. <img src="chapter_slr_files/figure-html/unnamed-chunk-21-1.svg" style="display: block; margin: auto;" /> --- # **O**rdinary **L**east **S**quares (OLS) Estimation <iframe src="https://gustavek.shinyapps.io/reg_simple/" width="100%" height="400px" data-external="1" style="border: none;"></iframe> --- # **O**rdinary **L**east **S**quares (OLS) Estimation <iframe src="https://gustavek.shinyapps.io/SSR_cone/" width="100%" height="400px" data-external="1" style="border: none;"></iframe> --- # **O**rdinary **L**east **S**quares (OLS): Coefficient Formulas * **OLS**: *estimation* method consisting in minimizing the sum of squared residuals. * Yields __unique__ solutions to this minization problem. * So what are the formulas for `\(b_0\)` (intercept) and `\(b_1\)` (slope)? -- * In our single independent variable case: > ### __Slope: `\(b_1^{OLS} = \frac{cov(x,y)}{var(x)}\)` `\(\hspace{2cm}\)` Intercept: `\(b_0^{OLS} = \bar{y} - b_1\bar{x}\)`__ -- * These formulas do not appear from magic. They can be found by solving the minimisation of squared errors. The maths can be found [here](https://www.youtube.com/watch?v=Hi5EJnBHFB4) for those who are interested. --- # **O**rdinary **L**east **S**quares (OLS): Interpretation For now assume both the dependent variable `\((y)\)` and the independent variable `\((x)\)` are numeric. -- > Intercept `\((b_0)\)`: **The predicted value of `\(y\)` `\((\widehat{y})\)` if `\(x = 0\)`.** -- > Slope `\((b_1)\)`: **The predicted change, on average, in the value of `\(y\)` *associated* to a one-unit increase in `\(x\)`.** -- * ⚠️ Note that we use the term *associated*, **clearly avoiding interpreting `\(b_1\)` as the causal impact of `\(x\)` on `\(y\)`**. To make such a claim, we need some specific conditions to be met. (Next week!) -- * Also notice that the units of `\(x\)` will matter for the interpretation (and magnitude!) of `\(b_1\)`. -- * **You need to be explicit about what the unit of `\(x\)` is!** --- # OLS with `R` * In `R`, OLS regressions are estimated using the `lm` function. * This is how it works: ```r lm(formula = dependent variable ~ independent variable, data = data.frame containing the data) ``` -- ## Class size and student performance Let's estimate the following model by OLS: `\(\textrm{average math score}_i = b_0 + b_1 \textrm{class size}_i + e_i\)` .pull-left[ ```r # OLS regression of class size on average maths score lm(avgmath_cs ~ classize, grades_avg_cs) ``` ] .pull-right[ ``` ## ## Call: ## lm(formula = avgmath_cs ~ classize, data = grades_avg_cs) ## ## Coefficients: ## (Intercept) classize ## 61.1092 0.1913 ``` ] --- # **O**rdinary **L**east **S**quares (OLS): Prediction ``` ## ## Call: ## lm(formula = avgmath_cs ~ classize, data = grades_avg_cs) ## ## Coefficients: ## (Intercept) classize ## 61.1092 0.1913 ``` -- This implies (abstracting the `\(_i\)` subscript for simplicity): $$ `\begin{aligned} \widehat y &= b_0 + b_1 x \\ \widehat {\text{average math score}} &= b_0 + b_1 \cdot \text{class size} \\ \widehat {\text{average math score}} &= 61.11 + 0.19 \cdot \text{class size} \end{aligned}` $$ -- What's the predicted average score for a class of 26 students? (Using the *exact* coefficients.) $$ `\begin{aligned} \widehat {\text{average math score}} &= 61.11 + 0.19 \cdot 26 \\ \widehat {\text{average math score}} &= 66.08 \end{aligned}` $$ --- class: inverse # Task 2: OLS Regression
−
+
10
:
00
Run the following code to aggregate the data at the class size level: ```r grades_avg_cs <- grades %>% group_by(classize) %>% summarise(avgmath_cs = mean(avgmath), avgverb_cs = mean(avgverb)) ``` 1. Regress average verbal score (dependent variable) on class size (independant variable). Interpret the coefficients. 1. Compute the OLS coefficients `\(b_0\)` and `\(b_1\)` of the previous regression using the formulas on slide 25. (*Hint:* you need to use the `cov`, `var`, and `mean` functions.) 1. What is the predicted average verbal score when class size is equal to 0? (Does that even make sense?!) 1. What is the predicted average verbal score when the class size is equal to 30 students? --- # Predictions and Residuals: Properties .pull-left[ * __The average of `\(\widehat{y}_i\)` is equal to `\(\bar{y}\)`.__ `$$\begin{align} \frac{1}{N} \sum_{i=1}^N \widehat{y}_i &= \frac{1}{N} \sum_{i=1}^N b_0 + b_1 x_i \\ &= b_0 + b_1 \bar{x} = \bar{y} \end{align}$$` * __The average (or sum) of residuals is 0.__ `$$\begin{align} \frac{1}{N} \sum_{i=1}^N e_i &= \frac{1}{N} \sum_{i=1}^N (y_i - \widehat y_i) \\ &= \bar{y} - \frac{1}{N} \sum_{i=1}^N \widehat{y}_i \\\ &= 0 \end{align}$$` ] -- .pull-right[ * __ Regressor and residuals are uncorrelated (by definition).__ `$$Cov(x_i, e_i) = 0$$` * __Prediction and residuals are uncorrelated.__ `$$\begin{align} Cov(\widehat y_i, e_i) &= Cov(b_0 + b_1x_i, e_i) \\ &= b_1Cov(x_i,e_i) \\ &= 0 \end{align}$$` Since `\(Cov(a + bx, y) = bCov(x,y)\)`. ] --- # Linearity Assumption: Visualize your Data! * It's important to keep in mind that covariance, correlation and simple OLS regression only measure **linear relationships** between two variables. * Two datasets with *identical* correlations and regression lines could look *vastly* different. -- * Is that even possible? <img src="https://media.giphy.com/media/5aLrlDiJPMPFS/giphy.gif" height = "400" align = "middle" /> --- # Linearity Assumption: Anscombe * Francis Anscombe (1973) came up with 4 datasets with identical stats. But look! .left-wide[ <img src="chapter_slr_files/figure-html/unnamed-chunk-29-1.svg" style="display: block; margin: auto;" /> ] -- .right-thin[ </br> </br> <table class="table table-striped" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> dataset </th> <th style="text-align:right;"> cov </th> <th style="text-align:right;"> var(y) </th> <th style="text-align:right;"> var(x) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5.501 </td> <td style="text-align:right;"> 4.127 </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 5.500 </td> <td style="text-align:right;"> 4.128 </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 5.497 </td> <td style="text-align:right;"> 4.123 </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5.499 </td> <td style="text-align:right;"> 4.123 </td> <td style="text-align:right;"> 11 </td> </tr> </tbody> </table> ] --- # Nonlinear Relationships in Data? .pull-left[ * We can accomodate non-linear relationships in regressions. * Just add a *higher order* term like this: $$ y_i = b_0 + b_1 x_i + b_2 x_i^2 + e_i $$ * This is __multiple regression__ (in 2 weeks!) ] -- .pull-right[ * For example, suppose we had this data and fit the previous regression model: <img src="chapter_slr_files/figure-html/non-line-cars-ols2-1.svg" style="display: block; margin: auto;" /> ] --- # Analysis of Variance * Remember that `\(y_i = \widehat{y}_i + e_i\)`. * We have the following decomposition: `$$\begin{align} Var(y) &= Var(\widehat{y} + e)\\&= Var(\widehat{y}) + Var(e) + 2 Cov(\widehat{y},e)\\&= Var(\widehat{y}) + Var(e)\end{align}$$` * Because: * `\(Var(x+y) = Var(x) + Var(y) + 2Cov(x,y)\)` * `\(Cov(\hat{y},e)=0\)` * __Total variation (SST) = Model explained (SSE) + Unexplained (SSR)__ --- # Goodness of Fit * The __ `\(R^2\)` __ measures how well the __model fits the data__. -- $$ R^2 = \frac{\text{variance explained}}{\text{total variance}} = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}\in[0,1] $$ -- * `\(R^2\)` close to `\(1\)` indicates a __very ***high*** explanatory power__ of the model. * `\(R^2\)` close to `\(0\)` indicates a __very ***low*** explanatory power__ of the model. -- * *Interpretation:* an `\(R^2\)` of 0.5, for example, means that the variation in `\(x\)` "explains" 50% of the variation in `\(y\)`. -- * ⚠️ Low `\(R^2\)` does __NOT__ mean it's a useless model! Remember that econometrics is interested in causal mechanisms, not prediction! -- * ⚠️ The `\(R^2\)` is __NOT__ an indicator of whether a relationship is causal! --- class: inverse # Task 3: `\(R^2\)` and goodness of fit
−
+
10
:
00
1. Regress `avgmath_cs` on `classize`. Assign to an object `math_reg`. 1. Pass `math_reg` in the `summary()` function. What is the (multiple) `\(R^2\)` for this regression? How can you interpret it? 1. Compute the squared correlation between `classize` and `avgmath_cs`. What does this tell you about the relationship between `\(R^2\)` and the correlation in a regression with only one regressor? 1. Repeat steps 1 and 2 for `avgverb_cs`. For which exam does the variance in class size explain more of the variance in students' scores? 1. (Optional) Install and load the `broom` package. Pass `math_reg` in the `augment()` function and assign it to a new object. Use the variance in `avgmath_cs` (SST) and the variance in `.fitted` (predicted values; SSE) to find the `\(R^2\)` using the formula on the previous slide. --- # On the way to causality ✅ How to manage data? Read it, tidy it, visualise it... 🚧 **How to summarise relationships between variables?** Simple linear regression... to be continued ❌ What is causality? ❌ What if we don't observe an entire population? ❌ Are our findings just due to randomness? ❌ How to find exogeneity in practice? --- class: title-slide-final, middle background-image: url(../img/logo/ScPo-econ.png) background-size: 250px background-position: 9% 19% # SEE YOU NEXT WEEK! | | | | :--------------------------------------------------------------------------------------------------------- | :-------------------------------- | | <a href="mailto:florian.oswald@sciencespo.fr">.ScPored[<i class="fa fa-paper-plane fa-fw"></i>] | florian.oswald@sciencespo.fr | | <a href="https://github.com/ScPoEcon/ScPoEconometrics-Slides">.ScPored[<i class="fa fa-link fa-fw"></i>] | Slides | | <a href="https://scpoecon.github.io/ScPoEconometrics">.ScPored[<i class="fa fa-link fa-fw"></i>] | Book | | <a href="http://twitter.com/ScPoEcon">.ScPored[<i class="fa fa-twitter fa-fw"></i>] | @ScPoEcon | | <a href="http://github.com/ScPoEcon">.ScPored[<i class="fa fa-github fa-fw"></i>] | @ScPoEcon |