Econometrics

.title[
# Econometrics
]
.subtitle[
## Simple Linear Regression
]
.author[
### Mustapha Douch based on <br> Florian Oswald’s slides
]
.date[
### UniTo ESOMAS </br> 2025-10-22
]

---

---

## Where We Are Now: Building the Foundation 🏗️

<br>

### Covered Concepts (Stock & Watson Chapters 1–3)

Up until now, we've focused on the core tools of **descriptive statistics** and **probability** that form the language of econometrics:

<br>

- **Descriptive Statistics:**  
  Summarizing data using the *mean*, *median*, *variance*, and *standard deviation*.

- **Probability Theory:**  
  Understanding *random variables*, their *distributions* (especially the *normal distribution*), and the concept of *covariance* and *correlation* to measure association.

- **Asymptotics:**  
  Grasping the crucial role of the *Law of Large Numbers (LLN)* and the *Central Limit Theorem (CLT)* in ensuring our sample estimates are reliable.

---

## Next Steps: From Description to Causation 🚀

This week, we begin our journey into **the Simple Regression Model** —  
using data to explain **how one variable affects another**.

We’ll build on the descriptive and probabilistic foundations from before to estimate relationships and test hypotheses.

Next week, we move further into **causal inference** with **Difference-in-Differences (DiD)** — comparing before-and-after outcomes
to identify policy or treatment effects.

---

# Today - Real 'metrics finally ✌️

* Introduction to the ***Simple Linear Regression Model*** and ***Ordinary Least Squares (OLS)*** *estimation*.

* Empirical application: *class size* and *student performance*

* Keep in mind that we are interested in uncovering **causal** relationships

---

##  How Does One Variable Affect Another? 🎯

> A state implements tough new penalties on drunk drivers — what is the effect on highway fatalities?  
> A school district cuts the size of its elementary school classes — what is the effect on its students’ standardized test scores?  
> You successfully complete one more year of college classes — what is the effect on your future earnings?

All three questions are about the **unknown effect of changing one variable**, $ X $, (on penalties, class size, or years of schooling) on another variable, $ Y $, (highway deaths, test scores, or earnings).

This week, we introduce the **linear regression model** relating $ X $ to $ Y $. It postulates a **linear relationship** between $ X $ and $ Y $:  the **slope** represents the effect of a one-unit change in $ X $ on $ Y $.

Just as the **mean of $ Y $** is an unknown population characteristic, 
the **slope of the line** relating $ X $ and $ Y $ is an unknown feature of their **joint distribution**.

---
##  How Does One Variable Affect Another? 🎯

Our econometric task:  
Estimate this slope — that is, estimate **the effect of a unit change in $ X $ on $ Y $** —  using a random sample of data.

Finally, we’ll see how this is done using **Ordinary Least Squares (OLS)**,  which allows us to test hypotheses and construct confidence intervals for the slope.

---

# Student performance

* What policies *lead* to improved student learning?

--
* Class size reduction has been at the heart of policy debates for *decades*.

* We will be using data from a famous paper by [Joshua Angrist and Victor Lavy (1999)](https://economics.mit.edu/files/8273), obtained from [Raj Chetty and Greg Bruich's course](https://opportunityinsights.org/course/).

* Consists of test scores and class/school characteristics for fifth graders (10-11 years old) in Jewish public elementary schools in Israel in 1991.

* National tests measured *mathematics* and (Hebrew) *reading* skills. The raw scores were scaled from 1-100.

---

class:: inverse

# Task 1: Getting to know the data

<div class="countdown" id="timer_a233f4cf" data-update-every="1" tabindex="0" style="top:0;right:0;">
<div class="countdown-controls"><button class="countdown-bump-down">−</button><button class="countdown-bump-up">+</button></div>
<code class="countdown-time"><span class="countdown-digits minutes">07</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
 
1. Load the data from [here](https://www.dropbox.com/s/wwp2cs9f0dubmhr/grade5.dta?dl=1) as `grades`. *Hint: Use the `read_dta` from the `haven` library to import the file, which has a format .dta.* (FYI: *.dta* is the extension for data files used in [*Stata*](https://www.stata.com/))

1. Describe the dataset:

* What is the unit of observations, i.e. what does each row correspond to?
  * How many observations are there?
  * View the dataset. What variables do we have? What do the variables `avgmath` and `avgverb` correspond to?
  * Use the `skim` function from the `skimr` package to obtain common summary statistics for the variables `classize`, `avgmath` and `avgverb`. (*Hint: use `dplyr` to `select` the variables and then simply pipe (`%>%`) `skim()`.*)

1. Do you have any priors about the actual (linear) relationship between class size and student achievement? What would you do to get a first insight?

1. Compute the correlation between class size and math and verbal scores. Is the relationship positive/negative, strong/weak?

---

# Class size and student performance: Scatter plot

.pull-left[
<img src="chapter_slr_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" />
]

.pull-right[
<img src="chapter_slr_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" />
]

* Somewhat positive association as suggested by the correlations. Let's compute the average score by class size to see things more clearly!

---

# Class size and student performance: Binned scatter plot

.pull-left[
<img src="chapter_slr_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" />
]

.pull-right[
<img src="chapter_slr_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" />
]

---

# Class size and student performance: Binned scatter plot

* We'll first focus on the mathematics scores and for visual simplicity we'll zoom in

---

# Class size and student performance: Regression line

How to visually summarize the relationship: **a line through the scatter plot**

.left-wide[
<img src="chapter_slr_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto auto auto 0;" />
]

* A *line*! Great. But **which** line? This one?

* That's a *flat* line. But average mathematics score is somewhat *increasing* with class size

]

---

# Class size and student performance: Regression line

How to visually summarize the relationship: **a line through the scatter plot**

.left-wide[
<img src="chapter_slr_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto auto auto 0;" />
]

* **That** one?

* Slightly better! Has a **slope** and an **intercept** 😐

* We need a rule to decide!

]

---

# Simple Linear Regression

Let's formalise a bit what we are doing so far.

* We are interested in the relationship between two variables:

* an __outcome variable__ (also called __dependent variable__):  
  *average mathematics score* `$(y)$`
  
--
  
  * an __explanatory variable__ (also called __independent variable__ or __regressor__):  
  *class size* `$(x)$`
  
--

* For each class `$i$` we observe both `$x_i$` and `$y_i$`, and therefore we can plot the *joint distribution* of class size and average mathematics score.

* We summarise this relationship with a line (for now). The equation for such a line with an intercept `$b_0$` and a slope `$b_1$` is:
    $$
    \widehat{y}_i = b\_0 + b\_1 x\_i
    $$

* `$\widehat{y}_i$` is our *prediction* for `$y$` at observation `$i$` `$(y_i)$` given our model (i.e. the line).

---

# What's A Line: A Refresher

---

# What's A Line: A Refresher

---

# What's A Line: A Refresher

---

# Simple Linear Regression: Residual

* If all the data points were __on__ the line then `$\widehat{y}_i = y_i$`.

---

# Simple Linear Regression: Residual

* If all the data points were __on__ the line then `$\widehat{y}_i = y_i$`.

---

# Simple Linear Regression: Residual

* If all the data points were __on__ the line then `$\widehat{y}_i = y_i$`.

* However, since in most cases the *dependent variable* `$(y)$` is not *only* explained by the chosen *independent variable* `$(x)$`, `$\widehat{y}_i \neq y_i$`, i.e. we make an __error__.  
This __error__ is called the __residual__.

* At point `$(x_i,y_i)$`, we note this residual `$e_i$`.

* The *actual data* `$(x_i,y_i)$` can thus be written as *prediction + residual*:

$$
  y_i = \widehat y_i + e_i = b_0 + b_1 x_i + e_i
  $$

---