Regression Logic

# Regression Logic
## EC 320: Introduction to Econometrics
### Philip Economides
### Winter 2022

---

# Prologue

---
# Housekeeping

.hi-pink[Problem Set 2 due 01/24 at 5pm], which will address review content, fundamental thoughts and today's content.

Do not forget the computational portion of problem sets, so make sure you've got **R** working **before Monday**.

Please do not hesitate to reach out if you are having trouble, Zoom office hours:

* Tuesday 3pm

* Thursday 10am

If these times do not suit, email me at peconomi@uoregon.edu

---

<br>

.hi-pink[So far] we've identified the fundamental problem econometricians face. How do we proceed? **Regressions!**

- Running models

- Confounders

- Omitted Variable Bias

---
class: inverse, middle

# Regression Logic

---
# Regression

Modeling is about reducing something really complicated into something simple that still represents some part of the complicated reality.

- It’s about telling stories that are easy to understand, and thus, easy to learn from

Economists often rely on .hi-pink[(linear) regression] for statistical comparisons.

- *"Linear"* is more flexible than you think

- Describes the relationship between a dependent (endogenous) variable and one or more explanatory (exogenous) variable(s)

We will focus on the .hi-pink[simple univariate] case today.

---

# Regression

<br>

Regression analysis helps us make *all else equal* comparisons.

- We can model the effect of `$X$` on `$Y$` while .hi[controlling] .pink[for potential confounders]
- Forces us to be explicit about the potential sources of selection bias
- Failure to control for confounding variables leads to .hi[omitted-variable bias], a close cousin of selection bias
- Why? The omitted variable, correlated with our covariate of interest, is sitting inside the error term causing chaos

---
# Returns to Private College

<br>

**Research Question:** Does going to a private college instead of a public college increase future earnings?

- **Outcome variable:** earnings
- **Treatment variable:** going to a private college (binary)

**Q:** How might a private school education increase earnings?

**Q:** Does a comparison of the average earnings of private college graduates with those of public school graduates .pink[isolate the economic returns to private college education]? Why or why not?

---
# Returns to Private College

**How might we estimate the causal effect of private college on earnings?**

**Approach 1:** Compare average earnings of private college graduates with those of public college graduates.

- Prone to selection bias.

**Approach 2:** Use a matching estimator that compares the earnings of individuals the same admissions profiles.

- Cleaner comparison than a simple difference-in-means.
- Somewhat difficult to implement.
- Throws away data (inefficient).

**Approach 3:** Estimate a regression that compares the earnings of individuals with the same admissions profiles.

---
# The Regression Model

We can estimate the effect of `$X$` on `$Y$` by estimating a .hi[regression model]:

`$$Y_i = \beta_0 + \beta_1 X_i + u_i$$`

- `$Y_i$` is the outcome variable.

- `$X_i$` is the treatment variable (continuous).

- `$\beta_0$` is the **intercept** parameter. `$\mathop{\mathbb{E}}\left[ {Y_i | X_i=0} \right] = \beta_0$`

- `$\beta_1$` is the **slope** parameter, which under the correct causal setting represents marginal change in `$X_i$`'s effect on `$Y_i$`. `$\frac{\partial Y_i}{\partial X_i} = \beta_1$`

- `$u_i$` is an error (disturbance) term that includes all other (omitted) factors affecting `$Y_i$`.

---

# The Error term

<br>

`$u_i$` is quite special. If we consider the data generating process of variable `$Y_i$`, `$u_i$` captures all the unobserved variables that explain variation in `$Y_i$`.

- Always some error to our models, we just aim for it to be small relative to the challenge we face

- Some aspects of the observed data that was collected may also have been inputted incorrectly (measurement error)

The error term is the price we are willing to accept for a more simplified model.

---

# The Error Term

To be explicit, there are five items that contribute to the existence of this disturbance term.

* Omission of Explanatory Variables

* Aggregation of Variables

* Model Misspecificiation

* Functional Misspecification

* Measurement Error

---

# Running Regressions

The intercept and slope are population parameters.

Using an estimator with data on `$X_i$` and `$Y_i$`, we can estimate a .hi[fitted regression line]:

`$$\hat{Y_i} = \hat{\beta}_0 + \hat{\beta}_1 X_i = b_0 + b_1 X_i$$`

- `$\hat{Y_i}$` is the **fitted value** of `$Y_i$`.

- `$\hat{\beta}_0$` is the **estimated intercept**.

- `$\hat{\beta}_1$` is the **estimated slope**.

The estimation procedure produces misses called .hi[residuals], defined as `$Y_i - \hat{Y_i}$`.

---
# Running Regressions

<br>

In practice, we estimate the regression coefficients using an estimator called .hi[Ordinary Least Squares] (OLS).

- Picks estimates that make `$\hat{Y_i}$` as close as possible to `$Y_i$` given the information we have on `$X$` and `$Y$`.

- The residual sum of squares (RSS), `$\sum_{i=1}^n (Y_i - \hat{Y_i})^2$`, gives us an idea of how accurate our model is.

- .hi[OLS] minimizes this sum. 
 
- We will dive into the details next class.

---
# Running Regressions

OLS picks `$\hat{\beta}_0$` and `$\hat{\beta}_1$` that trace out the line of best fit. Ideally, we wound like to interpret the slope of the line as the causal effect of `$X$` on `$Y$`.

---
# Confounders

However, the data are grouped by a third variable `$W$`. How would omitting `$W$` from the regression model affect the slope estimator?

---
# Confounders

The problem with `$W$` is that it affects both `$Y$` and `$X$`. Without adjusting for `$W$`, we cannot isolate the causal effect of `$X$` on `$Y$`.