Regression Logic

class: center, middle, inverse, title-slide

# Regression Logic
## EC 320: Introduction to Econometrics
### Winter 2022

---

class: inverse, middle

# Prologue

---
# Housekeeping

- Lab 3 & Exercise 3 & Extra office hour 
  - Lab 3 @ 4 p.m.
  - Exercise 3 due today 
  - Extra office hour @ 7 p.m.

- Problem Set 1 solution will be available later today.

- Problem Sets due dates changed
   - Extra three days
   - Due Monday instead of Friday starting Problem Set 2

- Midterm 1 next week (Wednesday)

- Midterm review on Monday

---
# Last Time

1. Fundamental problem of econometrics

2. Selection bias

3. Randomized control trials

---
class: inverse, middle

# Regression Logic

---
# Regression

Economists often rely on (linear) regression for statistical comparisons.

- *"Linear"* is more flexible than you think.

Regression analysis helps us make *other things equal* comparisons.

- We can model the effect of `$X$` on `$Y$` while .hi[controlling] .pink[for potential confounders].
- Forces us to be explicit about the potential sources of selection bias.
- Failure to control for confounding variables leads to .hi[omitted-variable bias], a close cousin of selection bias

---
# Returns to Private College

**Research Question:** Does going to a private college instead of a public college increase future earnings?

- **Outcome variable:** earnings
- **Treatment variable:** going to a private college (binary)

**Q:** How might a private school education increase earnings?

**Q:** Does a comparison of the average earnings of private college graduates with those of public school graduates .pink[isolate the economic returns to private college education]? Why or why not?

---
# Returns to Private College

**How might we estimate the causal effect of private college on earnings?**

**Approach 1:** Compare average earnings of private college graduates with those of public college graduates.

- Prone to selection bias.

**Approach 2:** Use a matching estimator that compares the earnings of individuals the same admissions profiles.

- Cleaner comparison than a simple difference-in-means.
- Somewhat difficult to implement.
- Throws away data (inefficient).

**Approach 3:** Estimate a regression that compares the earnings of individuals with the same admissions profiles.

---
# The Regression Model

We can estimate the effect of `$X$` on `$Y$` by estimating a .hi[regression model]:

`$$Y_i = \beta_0 + \beta_1 X_i + u_i$$`

- `$Y_i$` is the outcome variable.

- `$X_i$` is the treatment variable (continuous).

- `$u_i$` is an error term that includes all other (omitted) factors affecting `$Y_i$`.

- `$\beta_0$` is the **intercept** parameter.

- `$\beta_1$` is the **slope** parameter.

---
# Running Regressions

The intercept and slope are population parameters.

Using an estimator with data on `$X_i$` and `$Y_i$`, we can estimate a .hi[fitted regression line]:

`$$\hat{Y_i} = \hat{\beta}_0 + \hat{\beta}_1 X_i$$`

- `$\hat{Y_i}$` is the **fitted value** of `$Y_i$`.

- `$\hat{\beta}_0$` is the **estimated intercept**.

- `$\hat{\beta}_1$` is the **estimated slope**.

The estimation procedure produces misses called .hi[residuals], defined as `$Y_i - \hat{Y_i}$`.

---
# Running Regressions

In practice, we estimate the regression coefficients using an estimator called .hi[Ordinary Least Squares] (OLS).

- Picks estimates that make `$\hat{Y_i}$` as close as possible to `$Y_i$` given the information we have on `$X$` and `$Y$`.
 
- We will dive into the weeds after the midterm.

---
# Running Regressions

OLS picks `$\hat{\beta}_0$` and `$\hat{\beta}_1$` that trace out the line of best fit. Ideally, we wound like to interpret the slope of the line as the causal effect of `$X$` on `$Y$`.

---
# Confounders

However, the data are grouped by a third variable `$W$`. How would omitting `$W$` from the regression model affect the slope estimator?

---
# Confounders

The problem with `$W$` is that it affects both `$Y$` and `$X$`. Without adjusting for `$W$`, we cannot isolate the causal effect of `$X$` on `$Y$`.