.b[Linear Regression: Inference]

.title[
# .b[Linear Regression: Inference]
]
.subtitle[
## .b[.green[EC 339]]
]
.author[
### Marcio Santetti
]
.date[
### Fall 2022
]

---

# Motivation

---

# A critique

Here, we are dealing with the so-called .hi[frequentist] approach to Statistics/Econometrics.

It assumes that there exists an underlying .hi-orange[true population parameter] in nature.

Therefore, while this .hi[population parameter] value is fixed in nature, .hi-orange[samples] are variable.

And .hi[using samples] is the best we can do.

- But this is .hi[not] the only approach!

---

# There are more ways to think Inference

- .hi[Bayesian] inference is a completely different animal!

---

# Confidence Intervals

---

# Confidence intervals

In practical terms, a regression returns a .hi[point estimate] of our desired parameter(s).

Supposedly, it .hi-orange[represents], to the best of our efforts, the "true" population parameter.

But wouldn't it be better if we could have a .hi[range] of values for `$\beta_i$`?

Given a .hi[confidence level] `$(1-\alpha)$`, we can easily construct a .hi-orange[confidence interval] for `$\beta_i$`.

---

# Confidence intervals

From .hi[Stats], we know:

$$ \text{CI} = \bar{x} \pm t_c \cdot \sigma $$
--

$$  \text{CI} = \bar{x} \pm t_c \cdot \frac{s}{\sqrt{n}}   $$
--

And now:

$$
`\begin{aligned}
\text{CI} = \hat{\beta}_k \pm t_c \cdot SE(\hat{\beta}_k)
\end{aligned}`
$$
--

where `$t_c = t_{1-\alpha/2, \ n-k-1}$`.

It denotes the `$1-\alpha/2$` .hi[quantile] of a *t* distribution, with n-k-1 .hi[degrees-of-freedom].

---

# Confidence intervals

- The .hi-orange[standard error] (SE) of an estimate:

`$$\mathop{\text{SE}} \left( \hat{\beta}_2 \right) = \sqrt{ \frac{s^2_u}{\sum_{i=1}^n (x_i - \bar{x})^2} }.$$`
--

where `$s^2_u = \dfrac{\sum_i \hat{u}_i^2}{n - k - 1}$` is the variance of `$u_i$`.

The standard error of an estimate is nothing but its .hi[standard deviation].

---

# Confidence intervals

- .hi[Informal interpretation:]

- The confidence interval is a region in which we are able to place some .hi[trust] for containing the parameter of interest.

- .hi-orange[Formal interpretation:]

- With .hi-orange[repeated sampling] from the population, we can construct confidence intervals for each of these samples. Then `$(1-\alpha) \cdot100$` percent of our intervals (*e.g.,* 95%) will contain the population parameter _.hi[somewhere in this interval]_.

---

# Confidence intervals - An example

```
#> 
#> ===============================================
#> Dependent variable: 
#> ---------------------------
#> lsalary 
#> -----------------------------------------------
#> age -0.001 
#> (0.005) 
#> lsales 0.225*** 
#> (0.028) 
#> Constant 5.005*** 
#> (0.303) 
#> -----------------------------------------------
#> Observations 177 
#> R2 0.281 
#> Adjusted R2 0.273 
#> Residual Std. Error 0.517 (df = 174) 
#> F Statistic 34.004*** (df = 2; 174) 
#> ===============================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
```

]

---

# Confidence intervals - An example

From the previous regression output, we have:

- `$\hat{\beta}_{lsales_{i}}$`: 0.225
  
  - `$SE(\hat{\beta}_{lsales_{i}})$`: 0.0277

In addition, the sample size (*n*) is 177.

---

# Confidence intervals - An example

- Then, we can calculate a 95% confidence interval for `$\beta_{lsales_{i}}$`:

$$ 
`\begin{align}
\text{CI} =  \hat{\beta}_{lsales_{i}} \pm t_c \cdot SE(\hat{\beta}_{lsales_{i}})
\end{align}`
$$
--

$$
`\begin{align}
\text{CI} = 0.225 \ \pm \ t_{1-0.05/2, \ 177 - 2 - 1} \ \cdot \ 0.0277 
\end{align}`
$$

$$
`\begin{align}
\text{CI} = 0.225 \ \pm \ t_{1-0.05/2, \ 174} \ \cdot \ 0.0277 
\end{align}`
$$

- `$t_{1-0.05/2, \ 174} =$` `-1.973691`

- The interval is `[0.17, 0.28]`.

---

# Confidence intervals - An example

With .hi-orange[repeated sampling] from the population, 95% of our intervals will contain the population parameter _.hi[somewhere in this [0.17, 0.28] interval]_.

---

# Confidence intervals - An example

- If we estimate a 99% confidence interval, we have:

$$
`\begin{align}
\text{CI} = 0.225 \ \pm \ t_{1-0.01/2, \ 174} \ \cdot \ 0.0277 
\end{align}`
$$

- `$t_{1-0.01/2, \ 174} =$` `2.604379`
- The interval is `[0.15, 0.29]`.

---

# Hypothesis Testing

---

# Hypothesis testing

- When doing *hypothesis testing*, our aim is to determine whether there is enough .hi[statistical evidence] to reject a hypothesized value or range of values.

- In Econometrics, we usually run .hi-orange[two-sided (tailed)] tests about *regression parameters*.

- `$H_0: \beta_i = 0$`
  - `$H_a: \beta_i \neq 0$`

- The above testing procedure is a test of .hi[statistical significance].

- If we .hi-orange[do not reject] `$H_0$`, the coefficient is not statistically significant.
  - If we .hi[reject] `$H_0$`, we have enough evidence to support the coefficient's  statistical significance.

---

# Hypothesis testing

In .mono[R]...

```r
wage_model <- lm(wage ~ educ + exper + tenure, data = wage2)
wage_model %>% 
 tidy()
```

```
#> # A tibble: 4 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -276. 107. -2.59 9.78e- 3
#> 2 educ 74.4 6.29 11.8 3.28e-30
#> 3 exper 14.9 3.25 4.58 5.33e- 6
#> 4 tenure 8.26 2.50 3.31 9.83e- 4
```

---

# Hypothesis testing

In .mono[Stata]...

---

# Hypothesis testing

- Where does the `11.8` *t* value come from?

$$
`\begin{align}
t = \dfrac{\hat{\beta}_k - \beta_{H_0}}{SE(\hat{\beta}_k)} = \dfrac{74.4 - 0}{6.29} = 11.8283
\end{align}`
$$

- Where does the `4.58` *t* value come from?

$$
`\begin{align}
t = \dfrac{\hat{\beta}_k - \beta_{H_0}}{SE(\hat{\beta}_k)} = \dfrac{14.9 - 0}{3.25} = 4.584615
\end{align}`
$$

---

# Hypothesis testing

What are we supposed to do with these test statistics?

---

# Hypothesis testing

### Interpretation

At 5% of significance, we have enough evidence to .hi[reject the null hypothesis] that `educ` is not statistically significant.

At 5% of significance, we have enough evidence to .hi[reject the null hypothesis] that `exper` is not statistically significant.

At 5% of significance, we have enough evidence to .hi[reject the null hypothesis] that `tenure` is not statistically significant.

Therefore, all coefficients are (individually) .hi-orange[statistically significant].

---

# Hypothesis testing

## The F-test

Sometimes, a coefficient on a .hi[specific variable] may not be *statistically significant*.

However, it may be of use in the .hi-orange[model's context].

Thus, a test of .hi[joint] significance is appropriate to evaluate whether .hi[all slope coefficients] are *jointly* significant within the model.

$$
`\begin{align}
F = \dfrac{R^2_{\text{unr}} - R^2_{\text{rest}}}{1 - R^2_{\text{unr}}}\cdot \dfrac{(n-k-1)}{q} 
\end{align}`
$$

---

# The F-test

Still with our .hi[wage] model:

Suppose we want to test whether `educ` and `exper` are .hi-orange[jointly] significant.

For the purpose of this test, our previous model is the .hi[unrestricted] (full) model.

Then, we estimate a .hi-orange[restricted] model, excluding `educ` and `exper`.

- Its R-squared is .b[0.0165]; while the unrestricted's is .b[0.146].
  
--

We have imposed .hi[2] restrictions to the full model. Thus, *q=2*.

And the .hi-orange[sample size] is *n=935*, which gives *n-k-1 = 931* for the full model.

---

# The F-test

$$
`\begin{align}
F = \dfrac{R^2_{\text{unr}} - R^2_{\text{rest}}}{1 - R^2_{\text{unr}}}\cdot \dfrac{(n-k-1)}{q}  \\ \\ = \dfrac{0.146 - 0.0165}{1 - 0.146} \cdot \dfrac{935-3-1}{2} = 70.588
\end{align}`
$$
--

- 70.588 is the .hi[test statistic] for the F-test

- Then, we compare the above value with the .hi-orange[critical values] given by the F-distribution table.

- Right-tail critical value:

- `$F_{1-.05/2, \ 2, \ 931}=$` 3.703535
  - Thus, we .hi[reject the null hypothesis], meaning that we have enough evidence to infer that `educ` and `exper` are .hi-orange[jointly significant] in this model.

---

# Next time: Inference in practice

---
exclude: true