ECON 4050: Introduction to Econometrics

.title[
# ECON 4050: Introduction to Econometrics
]
.subtitle[
## Regression Inference
]
.author[
### Adam Soliman, PhD
]
.date[
### Clemson University
]

---

# Today - Statistical inference in the regression framework

* Fully understand a ***regression table***

* Compare ***theory-based*** and ***simulation-based*** inference

* ***Classical Regression Model*** assumptions

* Empirical applications:

* Class size and student performance
  * Returns to education by gender

---

# Back to class size and student performance

* Let's go back the ***STAR*** experiment data, and focus on *small* and *regular* classes for those in *Kindergarten*

* We consider the following regression model and estimate it by OLS:

$$ \text{math score}_i = b_0 + b_1 \text{small}_i + e_i$$

``` r
library(tidyverse)

star_df = read.csv("~/Library/CloudStorage/Dropbox/Clemson/Econometrics Course/data for tasks/star_data.csv")

star_df <- star_df %>%
 filter(complete.cases(.)) %>%
 filter(star %in% c("small","regular") &
 grade == "k") %>% 
 mutate(small = (star == "small"))
```
]

``` r
reg_star = lm(math ~ small, star_df)
reg_star
```

```
## 
## Call:
## lm(formula = math ~ small, data = star_df)
## 
## Coefficients:
## (Intercept)    smallTRUE  
##     484.446        8.895
```
]

* What if we drew another random sample of schools from Tennessee and redid the experiment, would we find a different value for `$b_1$`?

* We know the answer is *yes*, but how different is this estimate likely to be?

---

# Regression Inference: `$b_k$` vs `$\beta_k$`

* `$b_0, b_1$` are ***point estimates*** computed from our sample.
  
  * Just like the sample proportion `$\hat{p}$` from our pasta example!

* In fact, our model's prediction...
    `$$\hat{y} = b_0 + b_1 x_1$$`
]

... is an **estimate** about an unknown, **true population line**
`$$y = \beta_0 + \beta_1 x_1$$`
]

where `$\beta_0, \beta_1$` are the ***population parameters*** of interest.

* You will often find `$\hat{\beta_k}$` rather than `$b_k$`, both refer to sample estimate of `$\beta_k$`.

* Let's bring what we know about ***confidence intervals***, ***hypothesis testing*** and ***standard errors*** to bear on those `$\hat{\beta_k}$`!

---

# Understanding Regression Tables

``` r
summary(lm(math ~ small, star_df))
```

```
## 
## Call:
## lm(formula = math ~ small, data = star_df)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -164.446 -34.342 -4.342 28.554 141.554 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 484.446 1.150 421.1 < 2e-16 ***
## smallTRUE 8.895 1.678 5.3 1.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48.55 on 3357 degrees of freedom
## Multiple R-squared: 0.008298,	Adjusted R-squared: 0.008003 
## F-statistic: 28.09 on 1 and 3357 DF, p-value: 1.233e-07
```

* There are 3 new columns here: standard error, t-value (test statistic), and Pr(>|t|) (p-value)

---

# Understanding Regression Tables

``` r
# alternative
library(broom)
tidy(lm(math ~ small, star_df)) 
```

```
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 484. 1.15 421. 0 
## 2 smallTRUE 8.90 1.68 5.30 0.000000123
```

Entry | Meaning
----- | ----
`std. error` |  Standard error of `$b_k$`
`statistic` |  Observed test statistic associated to `$H_0:\beta_k = 0,H_A:\beta_k \neq 0$`
`p.value` |  p-value associated to `$H_0:\beta_k = 0,H_A:\beta_k \neq 0$`

* Let's focus on the `small` coefficient and make sense of each entry.

---
# Standard Error of `$b_k$`

> ***Standard Error of `$b_k$`:*** Standard deviation of the sampling distribution of `$b_k$`.

* Let's imagine we could redo the experiment 1,000 times on 1,000 different samples, and thus we'd run 1,000 regressions and obtain 1,000 estimates of `$\beta_k$`, `$b_k$`.

* The standard error tells us how much these estimates would vary across samples. A smaller standard error means that our estimate is more precise, while a larger standard error suggests more uncertainty.

## Standard Error of `$b_\textrm{small}$`

* From the table, we get `$\hat{\textrm{SE}}(b_\textrm{small}) = 1.68$`
  
* The true standard error (SE) is unknown because we only have one sample.

* This estimate of the standard error allows us to quantify uncertainty in `$b_\textrm{small}$` and conduct statistical inference.

* Let's simulate the sampling distribution of `$b_\textrm{small}$` to see where it comes from.
    
---

# Standard Error from Bootstrap Distribution

``` r
library(infer)
bootstrap_distrib = star_df %>% 
    mutate(small=as.numeric(small)) %>% 
    specify(formula = math ~ small) %>%
    generate(reps = 1000, type = "bootstrap") %>%
    calculate(stat = "slope")
se_simul = round(sd(bootstrap_distrib$stat),3)

# ggplot(bootstrap_distrib, aes(x = stat)) +
# geom_histogram(boundary = 9, binwidth = 0.5, 
# col = "white", fill = "#d90502") +
# labs(x = "Bootstrap sample slope estimate", 
# y = "Frequency") + theme_bw(base_size = 14)
```
]

.pull-right[
<img src="reg_inference_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" />

***standard error:*** 1.68 `$\rightarrow$` very close to the one in the table (1.68)!

Not exactly equal, because we used bootstrapping instead of the theory approach used by `R`.
]
---

# Back to our regression results

``` r
library(broom)
tidy(lm(math ~ small, star_df))
```

* We have made sense of the `std.error` column.

* The next two columns in our regression are `statistic` and `p.value`

* We know those terms from our previous class on hypothesis testing
  
  * But which hypothesis test do they correspond to?

---

# Testing `$\beta_k = 0$` vs `$\beta_k \neq 0$`

By default, the regression output provides the results associated with the following hypothesis test:

`$$\begin{align}H_0:& \beta_k = 0\\H_A:& \beta_k \neq 0\end{align}$$`

* It allows to statistically test if there is a true relationship between the outcome and our regressor.

* If `$H_0$` is true, there is **no** relationship between the outcome and our regressor.

* In that case observing `$b_1 \neq 0$` was just chance.

* If `$H_0$` is false, then there **is** a true relationship.

* ***Important:*** This is a ***two-sided*** test!

---

# Test statistic and p-value

* As we saw in the previous lecture, to conduct such a test we need to:

* Derive the sampling distribution of our **test statistic** (`statistic`) assuming `$H_0$` is true, i.e. the *null distribution*.

* Quantify how extreme the **observed test statistic** is in this hypothetic world.

* Our *observed test statistic* (`statistic`) equals `$\frac{b}{\hat{SE}(b)}$`.

* Why not just `$b$`? We'll come back and explain this formula later.

``` r
observed_stat = reg_star$coefficients[2]/
  sd(bootstrap_distrib$stat)
round(observed_stat,2)
```

```
## smallTRUE 
##      5.29
```
]

* The **p-value** measures the area outside of `$\pm$` *observed test statistic* under the *null distribution*.

* Finally, we check if we can reject `$H_0$` at the usual **significance levels**: `$\alpha$` = 0.1, 0.05, 0.01.

---

# Testing `$\beta_\textrm{small} = 0$` vs `$\beta_\textrm{small} \neq 0$`

* We will approximate the null distribution of `$\frac{b_\textrm{small}}{\hat{SE}(b_\textrm{small})}$` through a simulation exercise.

* If there is no relationship between math score and class size, i.e. `$H_0$` is true and `$\beta_\textrm{small} = 0$`, then *reshuffling* / *permuting* the values of `small` across students should play no role.

``` r
null_distribution <- star_df %>% 
 mutate(small=as.numeric(small)) %>% 
 specify(formula = math ~ small) %>%
 hypothesize(null = "independence") %>% 
 generate(reps = 1000, type = "permute") %>% 
 calculate(stat = "slope")
```
]

* We can compute the distribution of our test statistic `$\frac{b_\textrm{small}}{\hat{SE}(b_\textrm{small})}$` under the null:

``` r
null_distribution <- null_distribution %>%
 mutate(test_stat = stat/sd(bootstrap_distrib$stat))
```

* Remember we got `$\hat{SE}(b_\textrm{small})$` = 1.68 from our bootstrap distribution.
]

---

# Testing `$\beta_\textrm{small} = 0$` vs `$\beta_\textrm{small} \neq 0$`

---

# Testing `$\beta_\textrm{small} = 0$` vs `$\beta_\textrm{small} \neq 0$`

Unlikely to obtain `$b_\textrm{small}$` = 8.9 if `$H_0$` is true, as the associated test statistic (5.3) is far in the null distribution's tail. This comes from 8.9/1.68=5.3, meaning the effect on small classes (8.9) is 5.3 standard errors from 0.

---

# Testing `$\beta_\textrm{small} = 0$` vs `$\beta_\textrm{small} \neq 0$`

* To decide if we reject `$H_0$`, recall we are considering a **two-sided test** here: *more extreme* means inferior to -5.286 **or** superior to 5.286.

What does the p-value correspond to?

---

# Testing `$\beta_\textrm{small} = 0$` vs `$\beta_\textrm{small} \neq 0$`

* To decide if we reject `$H_0$`, recall we are considering a **two-sided test** here: *more extreme* means inferior to -5.29 **or** superior to 5.29.

* Computing the *p-value* we get:

``` r
p_value = mean(abs(null_distribution$test_stat) >= observed_stat)
p_value
```

```
## [1] 0
```

* This is the same value as in the regression table.

* ***Question:*** Can we reject the null hypothesis at the 5% level?

---

# Testing `$\beta_\textrm{small} = 0$` vs `$\beta_\textrm{small} \neq 0$`

* To decide if we reject `$H_0$`, recall we are considering a **two-sided test** here: *more extreme* means inferior to -5.29 **or** superior to 5.29.

* Computing the *p-value* we get:

``` r
p_value = mean(abs(null_distribution$test_stat) >= observed_stat)
p_value
```

```
## [1] 0
```

* This is the same value as in the regression table.

* ***Answer:***

* Since the *p-value* is equal to 0 it means that we would reject `$H_0$` at any significance level: the p-value would always be inferior to `$\alpha$`.
  * In other words, we can say that `$b_\textrm{small}$` is **statistically different from 0** at any significance level.
  * We also say that `$b_\textrm{small}$` is *statistically significant* (at any significance level).

---

# Decision rules and how to interpret the p-value (Hypothetical Examples)

| Term      | Estimate | P-value    | Interpretation                                            |
|-----------|----------|------------|-----------------------------------------------------------|
| smallTRUE | 8.90     | 0.000000123 | **Reject the null hypothesis**; the effect is very unlikely due to chance/random variation. Put another way, the p-value suggests that the probability of observing an effect as extreme (or more extreme) as the one found in the data, assuming the null hypothesis is true, is essentially zero. |
| smallTRUE | 8.90     | 0.25       | **Fail to reject the null hypothesis**; there’s a 25% chance the observed effect is due to random variation. Put another way, the p-value suggests weak evidence against the null hypothesis, meaning we do not have enough evidence to confidently assert that the effect is real.
| smallTRUE | 8.90     | 0.57       | **Fail to reject the null hypothesis**; The p-value suggests that the probability of observing an effect as extreme (or more extreme) as the one found in the data, assuming the null hypothesis is true, is relatively high (57%), indicating that the data do not provide strong enough evidence to rule out the null hypothesis. |

---

# Regression Inference: Theory

---

# Regression Inference: Theory

* Up to now we presented simulation-based inference

* The values reported by statistical packages in `R` are instead obtained from theory.

* Theoretical inference is based on **large sample approximations**.
  * One can show that sampling distributions *converge* to suitable distributions `$\rightarrow$` ***Central Limit Theorem***
  
* Let's briefly look into the theory-based approach.

---

# Regression Inference: Theory

* Theory-based approach uses one fundamental result: the sampling distribution of the sample statistic `$\frac{b - \beta}{\hat{\textrm{SE}(b)}}$` *converges* to a ***standard normal distribution*** as the sample size gets larger and larger.

* `$\hat{\textrm{SE}(b)}$` is the sample estimate of the standard deviation of `$b$`.
  * It is also obtained through a theoretical formula (which you can find in the [book](https://scpoecon.github.io/ScPoEconometrics/std-errors.html#se-theory)!) but we'll leave it aside.

* A ***standard normal distribution*** is a *normal distribution* with *mean* 0 and *standard deviation* 1.

* We don't need to simulate any sampling distribution here, we derive it from theory and use it to construct confidence intervals or to conduct hypothesis tests.

* Note that if `$\frac{b - \beta}{\hat{\textrm{SE}(b)}}$` *converges* to a ***standard normal distribution***, then `$b$` converges to a ***normal distribution*** with mean `$\beta$` and standard deviation `$\hat{\textrm{SE}(b)}$`.

---

# Normal Distribution: A Refresher

.center[
<img src="../img/photos/standard_normal_distrib.png" width="850px" style="display: block; margin: auto;" />
]

---

# Theory-Based Inference: Confidence Interval

* Let's take the example of a 95% confidence interval.

* Since the sampling distribution of `$b$` is assumed to be normally shaped, we can use the ***95% rule of thumb*** about normal distributions.

* We know indeed that 95% of the values of a normal distribution lie within approximately 2 standard deviations of the mean (exactly 1.96).

* So, we can compute a 95% CI for `$\beta$` as: `$\textrm{CI}_{95\%} = [ b \pm 1.96*\hat{\textrm{SE}}(b)]$`

``` r
tidy(lm(math ~ small, star_df),
     conf.int = TRUE, conf.level = 0.95) %>%
  filter(term == "smallTRUE") %>%
  select(term, conf.low, conf.high)
```

```
## # A tibble: 1 × 3
## term conf.low conf.high
## <chr> <dbl> <dbl>
## 1 smallTRUE 5.60 12.2
```
]

``` r
bootstrap_distrib %>%
  summarise(
    lower_bound = 8.895 - 1.96*sd(stat),
    upper_bound = 8.895 + 1.96*sd(stat))
```

```
## # A tibble: 1 × 2
## lower_bound upper_bound
## <dbl> <dbl>
## 1 5.60 12.2
```
]

* This can easily be generalized to any confidence level by taking the appropriate quantile of the normal distribution.

---

# Confidence Intervals: Visually

---

# Theory-Based Inference: Hypothesis Testing

* Theory tells us that `$\frac{b - \beta_k}{\hat{\textrm{SE}(b)}}$` converges to a standard normal distribution

* As we already mentioned, the default test that is conducted by any statistical software is:

`$$\begin{align}H_0:& \beta_k = 0\\H_A:& \beta_k \neq 0\end{align}$$`

* So, **under the null hypothesis**, `$\beta_k=0$`, and we get from theory that the sampling distribution of `$\frac{b}{\hat{\textrm{SE}(b)}}$` will be a standard normal distribution.

* So the *standard normal distribution* is the **null distribution** of our test statistic.

* The ***p-value*** associated to our test is then equal to the area of the *standard normal distribution* outside `$\pm$` the observed value of `$\frac{b}{\hat{\textrm{SE}(b)}}$`.

* Common rule of thumb: if the *estimate* is ***twice the size of the standard error***, then it is significant at the 5% level. Why?

---

# Formatting a regression table

* Now that we have learned about all components of a regression table, let's finally learn how to create and read one!

``` r
reg_simple_math <- lm(math ~ small, data=star_df)
reg_gender_math <- lm(math ~ small + gender , data=star_df)
reg_simple_read <- lm(read ~ small, data=star_df)
reg_gender_read <- lm(read ~ small + gender , data=star_df)

export_summs(reg_simple_math, reg_gender_math, reg_simple_read, reg_gender_read,
             model.names = c("Math score", "Math Score",
                             "Reading score", "Reading score"),
             coefs=c("Intercept" = "(Intercept)",
                     "Small class" = "smallTRUE",
                     "Male gender" = "gendermale"))
```

---

# Formatting a regression table

<table class="huxtable" data-quarto-disable-processing="true" style="border-collapse: collapse; border: 0px; margin-bottom: 2em; margin-top: 2em; ; margin-left: auto; margin-right: auto; ">
<col><col><col><col><col><tr>
<th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0.4pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">Math score</th><th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0.4pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">Math Score</th><th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0.4pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">Reading score</th><th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0.4pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">Reading score</th></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">Intercept</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">484.45 ***</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">488.85 ***</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">435.76 ***</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">439.62 ***</td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(1.15)   </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(1.43)   </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(0.75)   </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(0.93)   </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">Small class</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">8.90 ***</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">8.94 ***</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">5.37 ***</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">5.41 ***</td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(1.68)   </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(1.67)   </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(1.09)   </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(1.09)   </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">Male gender</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">       </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">-8.56 ***</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">       </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">-7.49 ***</td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">       </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(1.67)   </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">       </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(1.09)   </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">N</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">3359       </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">3359       </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">3359       </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">3359       </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">R2</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.01    </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.02    </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.01    </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.02    </td></tr>
<tr>
<th colspan="5" style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0pt 0pt; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"> *** p < 0.001; ** p < 0.01; * p < 0.05.</th></tr>
</table>

---

# Reading a regression table

.center[
<img src="reg_inference_files/figure-html/reg_table.png" width="500px" style="display: block; margin: auto;" />
]

* Each column corresponds to a regression. For the first regression we have: 
 * the **name of the outcome variable** in blue
 * the **estimated coefficient associated to being in a small class** `$\hat{\beta_\textrm{small}}$` in green
 * its **estimated standard error** in yellow
 * the **number of observations** in purple
 * the **R-squared** in red
 * interpretation of the stars at the bottom

---

# Classical Regression Model

---

# Classical Regression Model

* Whether the inference is made from theory or simulations, some assumptions have to be met for this inference to be valid.

* The set of assumptions needed defines the *Classical Regression Model* (CRM).

* Before delving into these assumptions, let's see the small but important modifications we apply to our model (back to [*lecture SLR*](https://raw.githack.com/adamsoliman/IntroEconometrics/master/chapter_slr/chapter_slr.html)):

* We already mentioned the distinction between the sample estimate `$b_k$` (or `$\hat{\beta_k}$`) and the population parameter `$\beta_k$`.

* In the same way, we distinguish `$e$`, the sample error (*residual*), from `$\varepsilon$`, the error term from the true population model: 
  
  `$$y_i = \beta_0 + \beta_1 x_{1,i} + ... + \beta_k x_{k,i} + \varepsilon_i$$`

* The classical regression model applies to **correctly specified linear regressions**: the model needs to be linear in parameters, include all relevant variables, and variables cannot be collinear.

---

# CRM Assumptions

1. ***Mean Independence:*** the mean of the residuals conditional on `$x$` should be zero, `$E[\varepsilon|x] = 0$`. Notice that this also means that `$Cov(\varepsilon,x) = 0$`, i.e. that the errors and our explanatory variable(s) should be *uncorrelated*.

* Violating this assumption will lead to **biased** estimates of `$\beta_k$`.

---

# Mean Independence of Error: E[u | small] = ?

---

# Mean Independence of Error: E[u | small] = ?

---

# Exogeneity Assumption

The CRM assumption #1 is also know as the (strict) **exogeneity assumption**.

* When this assumption is violated our estimate `$b$` will be a ***biased*** estimate of `$\beta$`, i.e. `$\mathop{\mathbb{E}}[b] \neq \beta$`

* For example, imagine you are interested in the effect of education on wage

`$$\text{wage}_i = \beta_0 + \beta_1 \text{education}_i + \varepsilon_i$$`
  
  * Under the exogeneity assumption `$\beta_1$` denotes the causal effect of education in the population.

* Suppose there is *unobserved* ability `$a_i$`. 
  * High ability means higher wage.
  * It *also* means school is easier, and so `$i$` selects into more schooling.

---

# Exogeneity Assumption

* Given ability is *unobserved*,  `$a_i$` goes into the error `$\varepsilon_i$`

* Our *ceteris paribus* assumption (all else equal) does not hold.

* Then regressing the wage on education we will attribute to `education` part of the effect on wages that is actually *caused* by ability `$a_i$`!

* Remember the formula of the **omitted variable bias**:
  
  `$$\text{OVB} = \text{multiple regression coefficient on omitted variable} \times \frac{Cov(x,z)}{Var(x)}$$`

* Thus, we have:
  
  $$ \mathbb{E}(b_1) = \beta_1 + OVB > \beta_1$$

* *Interpretation*: taking repeated sample from the population and computing `$b_1$` each time, we would **systematically overestimate** the effect of education on wage.

---

# CRM Assumptions

1\. ***Mean Independence:*** the mean of the residuals conditional on `$x$` should be zero, `$E[\varepsilon|x] = 0$`. Notice that this also means that `$Cov(\varepsilon,x) = 0$`, i.e. that the errors and our explanatory variable(s) should be *uncorrelated*.

* Violating this assumption will lead to **biased** estimates of `$\beta_k$`.

2\. ***Independently and identically distributed:*** data drawn from a **random sample** of size `$n$`: observation `$(x_i,y_i)$` comes from the exact same distribution, and is **independent** of observation `$(x_j,y_j)$`, for all `$i\neq j$`.

* Violating this assumption would make your sample less representative of the underlying population. It will lead to **biased** estimates of `$\beta_k$`.

3\. ***Homoskedasticity:*** the variance of the error term `$\varepsilon$` is the same for each value of `$x$`: `$Var(\varepsilon|x) = \sigma^2$`.

* If this assumption is violated, you can still obtain unbiased estimates of `$\beta_k$`. However your estimate of `$\hat{\textrm{SE}}(b_k)$` will be biased, which will affect your test statistic and p-value.

4\. ***Normally distributed errors:*** the error term is normally distributed, i.e. `$\varepsilon \sim \mathcal{N}(0,\sigma^2)$`

* Not strictly necessary, but makes inference possible even with small sample sizes.

👉 Takeaway: **if assumptions violated, inference is invalid!**

---

# Illustrating Uncertainty

Let's repeat the procedure you just made 100 times!

``` r
library(AER)
data("CPS1985")
cps = CPS1985 %>% mutate(log_wage = log(wage))

set.seed(1)
bootstrap_sample = cps %>% 
    rep_sample_n(size = nrow(cps), reps = 100, replace = TRUE)

ggplot(data=cps,aes(y = log_wage, x = education, colour = gender)) +
  geom_point(size = 1, alpha = 0.7) +
  geom_smooth(method = "lm", alpha = 2) +
  geom_smooth(data=bootstrap_sample,
              size = 0.2,
              aes(y = log_wage, x = education, group = replicate),
              method = "lm", se = FALSE) +
  facet_wrap(~gender) +
  scale_colour_manual(values = c("darkblue", "darkred")) +
  labs(x = "Education", y = "Log wage") +
  guides(colour=FALSE) +
  theme_bw(base_size = 20)
```
]

]

---

# Illustrating Uncertainty

]

Even better : [`ungeviz`](https://github.com/wilkelab/ungeviz) and `gganimate` bring you moving lines!

* We took 20 bootstrap samples from our data

* You can see how different data points are included in each bootstrap sample.

* Those different points imply different regression lines.

* On average, 95% of these lines should fall into the shaded area.

* You should remember those moving lines when looking at the shaded area!

]

---

# On the way to causality

✅ How to manage data? Read it, tidy it, visualise it!

✅  How to summarise relationships between variables? Simple and multiple linear regression, non-linear regressions, interactions...

✅ What is causality?

✅ What if we don't observe an entire population? Sampling!

✅  **Are our findings just due to randomness?** Confidence intervals and hypothesis testing, regression inference.

❌ How to find exogeneity in practice?

---

#  THANKS

To the amazing [moderndive](https://moderndive.com/) team!

Big Thanks 🎉 to [ungeviz](https://github.com/wilkelab/ungeviz) and 🎊 [gganimate](https://github.com/thomasp85/gganimate) for their awesome packages!