class: center, middle, inverse, title-slide .title[ # Econometrics ] .subtitle[ ## Simple Linear Regression ] .author[ ### Florian Oswald ] .date[ ### UniTo ESOMAS 2025-11-05 ] --- layout: true <div class="my-footer"><img src="../img/logo/unito-shield.png" style="height: 60px;"/></div> --- # Today - Real 'metrics finally ✌️ * Introduction to the ***Simple Linear Regression Model*** and ***Ordinary Least Squares (OLS)*** *estimation*. * Empirical application: *class size* and *student performance* * Keep in mind that we are interested in uncovering **causal** relationships --- # Class size and student performance * What policies *lead* to improved student learning? * Class size reduction has been at the heart of policy debates for *decades*. -- * We will be using data from a famous paper by [Joshua Angrist and Victor Lavy (1999)](https://economics.mit.edu/files/8273), obtained from [Raj Chetty and Greg Bruich's course](https://opportunityinsights.org/course/). * Consists of test scores and class/school characteristics for fifth graders (10-11 years old) in Jewish public elementary schools in Israel in 1991. * National tests measured *mathematics* and (Hebrew) *reading* skills. The raw scores were scaled from 1-100. --- class:: inverse # Task 1: Getting to know the data
−
+
07
:
00
1. Load the data from [here](https://www.dropbox.com/s/wwp2cs9f0dubmhr/grade5.dta?dl=1) as `grades`. *Hint: Use the `read_dta` from the `haven` library to import the file, which has a format .dta.* (FYI: *.dta* is the extension for data files used in [*Stata*](https://www.stata.com/)) 1. Describe the dataset: * What is the unit of observations, i.e. what does each row correspond to? * How many observations are there? * View the dataset. What variables do we have? What do the variables `avgmath` and `avgverb` correspond to? * Use the `skim` function from the `skimr` package to obtain common summary statistics for the variables `classize`, `avgmath` and `avgverb`. (*Hint: use `dplyr` to `select` the variables and then simply pipe (`%>%`) `skim()`.*) 1. Do you have any priors about the actual (linear) relationship between class size and student achievement? What would you do to get a first insight? 1. Compute the correlation between class size and math and verbal scores. Is the relationship positive/negative, strong/weak? --- # Class size and student performance: Scatter plot .pull-left[ <img src="chapter_slr_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ <img src="chapter_slr_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> ] -- * Somewhat positive association as suggested by the correlations. Let's compute the average score by class size to see things more clearly! --- # Class size and student performance: Binned scatter plot .pull-left[ <img src="chapter_slr_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ <img src="chapter_slr_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> ] --- # Class size and student performance: Binned scatter plot * We'll first focus on the mathematics scores and for visual simplicity we'll zoom in <img src="chapter_slr_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" /> --- # Class size and student performance: Regression line How to visually summarize the relationship: **a line through the scatter plot** -- .left-wide[ <img src="chapter_slr_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto auto auto 0;" /> ] -- .right-thin[ <br> * A *line*! Great. But **which** line? This one? * That's a *flat* line. But average mathematics score is somewhat *increasing* with class size 😩 ] --- # Class size and student performance: Regression line How to visually summarize the relationship: **a line through the scatter plot** .left-wide[ <img src="chapter_slr_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto auto auto 0;" /> ] .right-thin[ <br> * **That** one? * Slightly better! Has a **slope** and an **intercept** 😐 * We need a rule to decide! ] --- # Simple Linear Regression Let's formalise a bit what we are doing so far. * We are interested in the relationship between two variables: -- * an __outcome variable__ (also called __dependent variable__): *average mathematics score* `\((y)\)` -- * an __explanatory variable__ (also called __independent variable__ or __regressor__): *class size* `\((x)\)` -- * For each class `\(i\)` we observe both `\(x_i\)` and `\(y_i\)`, and therefore we can plot the *joint distribution* of class size and average mathematics score. -- * We summarise this relationship with a line (for now). The equation for such a line with an intercept `\(b_0\)` and a slope `\(b_1\)` is: $$ \widehat{y}_i = b\_0 + b\_1 x\_i $$ -- * `\(\widehat{y}_i\)` is our *prediction* for `\(y\)` at observation `\(i\)` `\((y_i)\)` given our model (i.e. the line). --- # What's A Line: A Refresher <img src="chapter_slr_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> --- # What's A Line: A Refresher <img src="chapter_slr_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" /> --- # What's A Line: A Refresher <img src="chapter_slr_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Residual * If all the data points were __on__ the line then `\(\widehat{y}_i = y_i\)`. -- <img src="chapter_slr_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Residual * If all the data points were __on__ the line then `\(\widehat{y}_i = y_i\)`. <img src="chapter_slr_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Residual * If all the data points were __on__ the line then `\(\widehat{y}_i = y_i\)`. * However, since in most cases the *dependent variable* `\((y)\)` is not *only* explained by the chosen *independent variable* `\((x)\)`, `\(\widehat{y}_i \neq y_i\)`, i.e. we make an __error__. This __error__ is called the __residual__. -- * At point `\((x_i,y_i)\)`, we note this residual `\(e_i\)`. -- * The *actual data* `\((x_i,y_i)\)` can thus be written as *prediction + residual*: $$ y_i = \widehat y_i + e_i = b_0 + b_1 x_i + e_i $$ --- # Simple Linear Regression: Graphically <img src="chapter_slr_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Graphically <img src="chapter_slr_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Graphically <img src="chapter_slr_files/figure-html/unnamed-chunk-16-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Graphically <img src="chapter_slr_files/figure-html/unnamed-chunk-17-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Graphically <img src="chapter_slr_files/figure-html/unnamed-chunk-18-1.svg" style="display: block; margin: auto;" /> --- # Simple Linear Regression: Graphically .left-wide[ <img src="chapter_slr_files/figure-html/unnamed-chunk-19-1.svg" width="100%" style="display: block; margin: auto;" /> ] .right-thin[ <br> <br> <p style="text-align: center; font-weight: bold; font-size: 35px; color: #d90502;">Which "minimisation" criterion should (can) be used?</strong> ] --- # **O**rdinary **L**east **S**quares (OLS) Estimation -- * Errors of different sign `\((+/-)\)` cancel out, so we consider **squared residuals** `$$\forall i \in [1,N], e_i^2 = (y_i - \widehat y_i)^2 = (y_i - b_0 - b_1 x_i)^2$$` * Choose `\((b_0,b_1)\)` such that `\(\sum_{i = 1}^N e_1^2 + \dots + e_N^2\)` is **as small as possible**. -- <img src="chapter_slr_files/figure-html/unnamed-chunk-20-1.svg" style="display: block; margin: auto;" /> --- # **O**rdinary **L**east **S**quares (OLS) Estimation * Errors of different sign `\((+/-)\)` cancel out, so we consider **squared residuals** `$$\forall i \in [1,N], e_i^2 = (y_i - \widehat y_i)^2 = (y_i - b_0 - b_1 x_i)^2$$` * Choose `\((b_0,b_1)\)` such that `\(\sum_{i = 1}^N e_1^2 + \dots + e_N^2\)` is **as small as possible**. <img src="chapter_slr_files/figure-html/unnamed-chunk-21-1.svg" style="display: block; margin: auto;" /> --- # **O**rdinary **L**east **S**quares (OLS) Estimation <iframe src="https://gustavek.shinyapps.io/reg_simple/" width="100%" height="400px" data-external="1" style="border: none;"></iframe> --- # **O**rdinary **L**east **S**quares (OLS) Estimation <iframe src="https://gustavek.shinyapps.io/SSR_cone/" width="100%" height="400px" data-external="1" style="border: none;"></iframe> --- # **O**rdinary **L**east **S**quares (OLS): Coefficient Formulas * **OLS**: *estimation* method consisting in minimizing the sum of squared residuals. * Yields __unique__ solutions to this minization problem. * So what are the formulas for `\(b_0\)` (intercept) and `\(b_1\)` (slope)? -- * In our single independent variable case: > ### __Slope: `\(b_1^{OLS} = \frac{cov(x,y)}{var(x)}\)` `\(\hspace{2cm}\)` Intercept: `\(b_0^{OLS} = \bar{y} - b_1\bar{x}\)`__ -- * These formulas do not appear from magic. They can be found by solving the minimisation of squared errors. The maths can be found [here](https://www.youtube.com/watch?v=Hi5EJnBHFB4) for those who are interested. --- # **O**rdinary **L**east **S**quares (OLS): Interpretation For now assume both the dependent variable `\((y)\)` and the independent variable `\((x)\)` are numeric. -- > Intercept `\((b_0)\)`: **The predicted value of `\(y\)` `\((\widehat{y})\)` if `\(x = 0\)`.** -- > Slope `\((b_1)\)`: **The predicted change, on average, in the value of `\(y\)` *associated* to a one-unit increase in `\(x\)`.** -- * ⚠️ Note that we use the term *associated*, **clearly avoiding interpreting `\(b_1\)` as the causal impact of `\(x\)` on `\(y\)`**. To make such a claim, we need some specific conditions to be met. (Next week!) -- * Also notice that the units of `\(x\)` will matter for the interpretation (and magnitude!) of `\(b_1\)`. -- * **You need to be explicit about what the unit of `\(x\)` is!** --- # OLS with `R` * In `R`, OLS regressions are estimated using the `lm` function. * This is how it works: ``` r lm(formula = dependent variable ~ independent variable, data = data.frame containing the data) ``` -- ## Class size and student performance Let's estimate the following model by OLS: `\(\textrm{average math score}_i = b_0 + b_1 \textrm{class size}_i + e_i\)` .pull-left[ ``` r # OLS regression of class size on average maths score lm(avgmath_cs ~ classize, grades_avg_cs) ``` ] .pull-right[ ``` ## ## Call: ## lm(formula = avgmath_cs ~ classize, data = grades_avg_cs) ## ## Coefficients: ## (Intercept) classize ## 61.1092 0.1913 ``` ] --- # **O**rdinary **L**east **S**quares (OLS): Prediction ``` ## ## Call: ## lm(formula = avgmath_cs ~ classize, data = grades_avg_cs) ## ## Coefficients: ## (Intercept) classize ## 61.1092 0.1913 ``` -- This implies (abstracting the `\(_i\)` subscript for simplicity): $$ `\begin{aligned} \widehat y &= b_0 + b_1 x \\ \widehat {\text{average math score}} &= b_0 + b_1 \cdot \text{class size} \\ \widehat {\text{average math score}} &= 61.11 + 0.19 \cdot \text{class size} \end{aligned}` $$ -- What's the predicted average score for a class of 26 students? (Using the *exact* coefficients.) $$ `\begin{aligned} \widehat {\text{average math score}} &= 61.11 + 0.19 \cdot 26 \\ \widehat {\text{average math score}} &= 66.08 \end{aligned}` $$ --- class: inverse # Task 2: OLS Regression
−
+
10
:
00
Run the following code to aggregate the data at the class size level: ``` r grades_avg_cs <- grades %>% group_by(classize) %>% summarise(avgmath_cs = mean(avgmath), avgverb_cs = mean(avgverb)) ``` 1. Regress average verbal score (dependent variable) on class size (independant variable). Interpret the coefficients. 1. Compute the OLS coefficients `\(b_0\)` and `\(b_1\)` of the previous regression using the formulas on slide 25. (*Hint:* you need to use the `cov`, `var`, and `mean` functions.) 1. What is the predicted average verbal score when class size is equal to 0? (Does that even make sense?!) 1. What is the predicted average verbal score when the class size is equal to 30 students? --- # Predictions and Residuals: Properties .pull-left[ * __The average of `\(\widehat{y}_i\)` is equal to `\(\bar{y}\)`.__ `$$\begin{align} \frac{1}{N} \sum_{i=1}^N \widehat{y}_i &= \frac{1}{N} \sum_{i=1}^N b_0 + b_1 x_i \\ &= b_0 + b_1 \bar{x} = \bar{y} \end{align}$$` * __The average (or sum) of residuals is 0.__ `$$\begin{align} \frac{1}{N} \sum_{i=1}^N e_i &= \frac{1}{N} \sum_{i=1}^N (y_i - \widehat y_i) \\ &= \bar{y} - \frac{1}{N} \sum_{i=1}^N \widehat{y}_i \\\ &= 0 \end{align}$$` ] -- .pull-right[ * __ Regressor and residuals are uncorrelated (by definition).__ `$$Cov(x_i, e_i) = 0$$` * __Prediction and residuals are uncorrelated.__ `$$\begin{align} Cov(\widehat y_i, e_i) &= Cov(b_0 + b_1x_i, e_i) \\ &= b_1Cov(x_i,e_i) \\ &= 0 \end{align}$$` Since `\(Cov(a + bx, y) = bCov(x,y)\)`. ] --- # Linearity Assumption: Visualize your Data! * It's important to keep in mind that covariance, correlation and simple OLS regression only measure **linear relationships** between two variables. * Two datasets with *identical* correlations and regression lines could look *vastly* different. -- * Is that even possible? <img src="https://media.giphy.com/media/5aLrlDiJPMPFS/giphy.gif" height = "400" align = "middle" /> --- # Linearity Assumption: Anscombe * Francis Anscombe (1973) came up with 4 datasets with identical stats. But look! .left-wide[ <img src="chapter_slr_files/figure-html/unnamed-chunk-29-1.svg" style="display: block; margin: auto;" /> ] -- .right-thin[ </br> </br> <table class="table table-striped" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> dataset </th> <th style="text-align:right;"> cov </th> <th style="text-align:right;"> var(y) </th> <th style="text-align:right;"> var(x) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5.501 </td> <td style="text-align:right;"> 4.127 </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 5.500 </td> <td style="text-align:right;"> 4.128 </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 5.497 </td> <td style="text-align:right;"> 4.123 </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5.499 </td> <td style="text-align:right;"> 4.123 </td> <td style="text-align:right;"> 11 </td> </tr> </tbody> </table> ] --- # Nonlinear Relationships in Data? .pull-left[ * We can accomodate non-linear relationships in regressions. * Just add a *higher order* term like this: $$ y_i = b_0 + b_1 x_i + b_2 x_i^2 + e_i $$ * This is __multiple regression__ (in 2 weeks!) * Notice that we can **not** have non-linearities in the `\(b\)`'s!! ] -- .pull-right[ * For example, suppose we had this data and fit the previous regression model: <img src="chapter_slr_files/figure-html/non-line-cars-ols2-1.svg" style="display: block; margin: auto;" /> ] --- # Analysis of Variance * Remember that `\(y_i = \widehat{y}_i + e_i\)`. * We have the following decomposition: `$$\begin{align} Var(y) &= Var(\widehat{y} + e)\\&= Var(\widehat{y}) + Var(e) + 2 Cov(\widehat{y},e)\\&= Var(\widehat{y}) + Var(e)\end{align}$$` * Because: * `\(Var(x+y) = Var(x) + Var(y) + 2Cov(x,y)\)` * `\(Cov(\hat{y},e)=0\)` * __Total variation (SST) = Model explained (SSE) + Unexplained (SSR)__ --- # Goodness of Fit * The __ `\(R^2\)` __ measures how well the __model fits the data__. -- $$ R^2 = \frac{\text{variance explained}}{\text{total variance}} = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}\in[0,1] $$ -- * `\(R^2\)` close to `\(1\)` indicates a __very ***high*** explanatory power__ of the model. * `\(R^2\)` close to `\(0\)` indicates a __very ***low*** explanatory power__ of the model. -- * *Interpretation:* an `\(R^2\)` of 0.5, for example, means that the variation in `\(x\)` "explains" 50% of the variation in `\(y\)`. -- * ⚠️ Low `\(R^2\)` does __NOT__ mean it's a useless model! Remember that econometrics is interested in causal mechanisms, not prediction! -- * ⚠️ The `\(R^2\)` is __NOT__ an indicator of whether a relationship is causal! --- class: inverse # Task 3: `\(R^2\)` and goodness of fit
−
+
10
:
00
1. Regress `avgmath_cs` on `classize`. Assign to an object `math_reg`. 1. Pass `math_reg` in the `summary()` function. What is the (multiple) `\(R^2\)` for this regression? How can you interpret it? 1. Compute the squared correlation between `classize` and `avgmath_cs`. What does this tell you about the relationship between `\(R^2\)` and the correlation in a regression with only one regressor? 1. Repeat steps 1 and 2 for `avgverb_cs`. For which exam does the variance in class size explain more of the variance in students' scores? 1. (Optional) Install and load the `broom` package. Pass `math_reg` in the `augment()` function and assign it to a new object. Use the variance in `avgmath_cs` (SST) and the variance in `.fitted` (predicted values; SSE) to find the `\(R^2\)` using the formula on the previous slide. --- # Least Squares Assumptions We need 3 conditions for OLS to yield unbiased estimates: 1. `\(E[u | X] = 0\)`: The conditional mean of the population regression function (`\(u\)`) is uncorrelated to `\(X\)`. In other words, all unobserved factors - absent from the model and lumped into `\(u\)` - are *not* systematically related to the explanatory variable `\(X\)`. 2. `\(\left\{X_i,Y_i\right\}_{i=1}^n\)` is an i.i.d. sample. There is no dependence between observations (like in time series data for example) 3. There are no large outliers in the data. --- # Large Sample Distribution of OLS Estimators Under the assumptions on the previous slide, we have: 1. Both coefficients are unbiased: `\(E(b_0) = \beta_0, E(b_1) = \beta_1\)`. 2. `\(b_0,b_1\)` are *joint* normally distributed (if sample is sufficiently large). We have `$$\sigma_{b_1}^2 = \frac{1}{n} \frac{var\left((X_i - \mu_X) u_i \right)}{var\left(X_i \right)^2}$$` and `$$\sigma_{b_0}^2 = \frac{1}{n} \frac{var\left(H_i u_i \right)}{\left[E(H_i^2)\right]^2}, H_i = 1 - \left[ \frac{\mu_X}{E(X_i^2)} \right] X_i$$` --- # Large Sample Distribution of OLS Estimators `$$\sigma_{b_1}^2 = \frac{1}{n} \frac{var\left((X_i - \mu_X) u_i \right)}{var\left(X_i \right)^2}$$` and `$$\sigma_{b_0}^2 = \frac{1}{n} \frac{var\left(H_i u_i \right)}{\left[E(H_i^2)\right]^2}, H_i = 1 - \left[ \frac{\mu_X}{E(X_i^2)} \right] X_i$$` 👉 Both variances here *decrease* with `\(n\)`! 👉 Both variances here *increase* with `\(var(u)\)`! 👉 Both variances here *decrease* with `\(var(X)\)`! --- # How does `\(Var(X)\)` impact OLS? <img src="chapter_slr_files/figure-html/unnamed-chunk-31-1.svg" style="display: block; margin: auto;" /> <!-- ] --> <!-- .right-thin[ --> <!-- <br> --> <!-- <br> --> <!-- * larger `\(var(X)\)` makes it easier to draw the line! --> <!-- * Look at those error bands! --> <!-- ] --> --- # Task 4: Mean Reversion Sir Francis Galton examined the relationship between the height of children and their parents towards the end of the 19th century. You decide to update his findings by collecting data from 110 ESOMAS students. Your data is [here](https://www.dropbox.com/scl/fi/yrv6m41sgpa46997uj4om/galton_students_110.csv?rlkey=udp2uy6mdicdul7n5jn0bgbzq&dl=1) 1. Load the data and make a scatter plot of `child_height_cm` vs `mid_parent_height_cm`. Ideally add a regression line to your plot. 2. estimate the model `$$\text{child_height_cm} = \beta_0 + \beta_1 \text{mid_parent_height_cm} + u$$` and save it as object `m`. 2. What is the meaning of the cofficients and the R2 measure? 3. What is the prediction for the height of a child whose parents have an average height of 180cms? What if they have 160 cm? 4. type `summary(m)` to get the model summary and tell us what the Residual Standard Error in this regression is. What is its interpretation? 5. Show that in a single linear regression model, the formula for the slope estimate `\(b_1\)` can be written as `$$b_1 = r \frac{s_y}{s_x}$$` where `\(r\)` is `\(corr(x,y)\)`, and `\(s_k\)` is the standard deviation of variable `\(k\)`. Compute the slope coefficient in this way. --- # Task 4: Mean Reversion <span>6.</span> Given the positive intercept and the fact that the slope lies between zero and one, what can you say about the height of students who have quite tall parents? Those who have quite short parents? <span>7.</span> Galton was concerned about the height of the English aristocracy and referred to the above result as "regression towards mediocrity." Can you figure out what his concern was? Why do you think that we refer to this result today as "Galton's Fallacy"? --- # Task 5 We now know that we can write the slope estimator also as `$$b_1 = r \frac{s_y}{s_x}$$`. Show that if we *standardize* both x and y, then we have that `\(b_1\)` is *exactly equal* to `\(r\)`! --- # Task 6: no intercept What is a regression without any intercept? In other words, `\(b_0 = 0\)`. To generate some intuition, let's get one of R's built-in datasets and have a look: .pull-left[ ``` r plt(mpg ~ disp, mtcars) plt_add(type = "lm") ``` <img src="chapter_slr_files/figure-html/unnamed-chunk-32-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ ``` r lm(mpg ~ disp, mtcars) ``` ``` ## ## Call: ## lm(formula = mpg ~ disp, data = mtcars) ## ## Coefficients: ## (Intercept) disp ## 29.59985 -0.04122 ``` How does this look if we **force** `\(b_0 = 0\)`? ] --- # On the way to causality ✅ How to manage data? Read it, tidy it, visualise it... 🚧 **How to summarise relationships between variables?** Simple and multiple linear regression... to be continued ❌ What is causality? ❌ What if we don't observe an entire population? ❌ Are our findings just due to randomness? ❌ How to find exogeneity in practice? --- class: title-slide-final, middle background-image: url(../img/logo/esomas.png) background-size: 250px background-position: 9% 19% # SEE YOU NEXT WEEK!