Topic 10: Regression Modeling

class: center, middle, inverse, title-slide

.title[
# Topic 10: Regression Modeling
]
.author[
### Nick Hagerty <br> ECNS 460/560 <br> Montana State University
]
.date[
### .small[<br> *Parts of these slides are adapted from <a href="https://raw.githack.com/uo-ec607/lectures/master/08-regression/08-regression.html">“Data Science for Economists”</a> by Grant R. McDermott, used under the <a href="https://github.com/uo-ec607/lectures/blob/master/LICENSE">MIT License</a>.]
]

---

# Table of contents

1. [Basic regression in R](#basics)

1. [Indicator and interaction terms](#indicators)

1. [Handling missing data](#missing)

1. [Modeling nonlinear relationships](#nonlinear)

1. [Using regression models for prediction](#prediction)

1. [Econometrics packages in R](#functions)

1. [Appendix: Interpreting coefficients](#interpret)

---
class: inverse, middle
name: basics

# Regression in R

---

# Prep work

Load the packages we'll be using:

```r
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, estimatr, broom, summarytools, fixest, binsreg)
```

We'll mostly be working with the `starwars` data frame that comes with `dplyr`.

```r
head(starwars)
```

```
## # A tibble: 6 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
## 2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
## 3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
## 4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
## 5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
## 6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
```

---

# Basic regression in R

Base R's command for running regression models is `lm()` ("linear models").

```r
lm(y ~ x1 + x2 + x3, data = df)
```

It estimates this regression:

`$$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} + u_i$$`

Using the variables (columns) `y` and `x` from the object `df`.

---

# Basic regression in R

Let's run a simple bivariate regression of mass on height using our dataset of Star Wars characters.

```r
ols1 = lm(mass ~ height, data = starwars)
ols1
```

```
## 
## Call:
## lm(formula = mass ~ height, data = starwars)
## 
## Coefficients:
## (Intercept)       height  
##     -11.487        0.624
```

Most valuable information is buried within this object's internal list structure. Try `View(ols1)`.

---

# Basic regression in R

To summarise the key pieces of information, we can use the generic summary() function. This will look pretty similar to the default regression output from Stata.

```r
summary(ols1)
```

```
## 
## Call:
## lm(formula = mass ~ height, data = starwars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -60.95  -29.51  -20.83  -17.65 1260.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.4868   111.3842  -0.103    0.918
## height        0.6240     0.6262   0.997    0.323
## 
## Residual standard error: 169.5 on 57 degrees of freedom
##   (28 observations deleted due to missingness)
## Multiple R-squared:  0.01712,	Adjusted R-squared:  -0.0001194 
## F-statistic: 0.9931 on 1 and 57 DF,  p-value: 0.3232
```

---

# Basic regression in R

To just show the coefficients:

```r
summary(ols1)$coefficients
```

```
##                Estimate  Std. Error    t value  Pr(>|t|)
## (Intercept) -11.4868157 111.3841576 -0.1031279 0.9182234
## height        0.6240033   0.6261744  0.9965328 0.3232031
```

---

# Get tidy regression coefficients

Note advantage over Stata: the regression is an object, not an action.
- Results are stored for later use -- they don't just disappear.

In practice you'll often want to convert your regression output to a tidy data frame (aka tibble). You can do this using `tidy()` from the `broom` package.

```r
tidy(ols1, conf.int = TRUE)
```

```
## # A tibble: 2 × 7
##   term        estimate std.error statistic p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)  -11.5     111.       -0.103   0.918 -235.       212.  
## 2 height         0.624     0.626     0.997   0.323   -0.630      1.88
```

---

# Subsetting data for a regression

What data analysis mistake have we made already?

**We didn't plot our data.** 😱

```r
ggplot(starwars, aes(x=height, y=mass)) +
  geom_point()
```

---

# Subsetting data for a regression

If we think Jabba is not part of the same data generating process we want to study, we can filter him out.

```r
no_jabba = starwars |> filter(!str_detect(name, "Jabba"))
ols2 = lm(mass ~ height, no_jabba)
summary(ols2) |> tidy()
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  -31.3     12.8        -2.44 1.79e- 2
## 2 height         0.613    0.0720      8.51 1.14e-11
```

---
class: inverse, middle
name: indicators

# Indicator and interaction terms

---

# Indicator variables

For the next example, let's use only the human characters.

```r
humans = filter(starwars, species=="Human")

freq(humans$gender)
```

```
## Frequencies  
## humans$gender  
## Type: Character  
## 
##                   Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## --------------- ------ --------- -------------- --------- --------------
##        feminine      9     25.71          25.71     25.71          25.71
##       masculine     26     74.29         100.00     74.29         100.00
##            <NA>      0                               0.00         100.00
##           Total     35    100.00         100.00    100.00         100.00
```

---

# Indicator variables

```r
humans = humans |> mutate(masculine = as.integer(gender == "masculine"))
lm(mass ~ masculine, data = humans) |> tidy()
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)     56.3      9.52      5.91 0.0000134
## 2 masculine       29.4     10.3       2.84 0.0108
```

How do we interpret the `masculine` coefficient?

--
- The variable is coded as `$0$` for `feminine` and `$1$` for `masculine`.
- So intercept gives the average mass of `feminine` characters: `$\alpha = 56$` kg.
- Relative to that, the average effect of being `masculine` on mass is `$30.6$` kg.
- So the average mass of `masculine` characters is `$56 + 31 = 87$` kg.

---

# Regressions just estimate means

If we don't want to have to add up terms we can estimate the mean for each group separately! (Just omit the intercept.)

```r
humans = humans |> mutate(feminine = 1 - masculine)
lm(mass ~ feminine + masculine + 0, data = humans) |> tidy()
```

```
## # A tibble: 2 × 5
##   term      estimate std.error statistic  p.value
##   <chr>        <dbl>     <dbl>     <dbl>    <dbl>
## 1 feminine      56.3      9.52      5.91 1.34e- 5
## 2 masculine     85.7      4.00     21.4  2.93e-14
```

```r
group_by(humans, gender) |> summarize(mean(mass, na.rm=T))
```

```
## # A tibble: 2 × 2
##   gender    `mean(mass, na.rm = T)`
##   <chr>                       <dbl>
## 1 feminine                     56.3
## 2 masculine                    85.7
```

---

# Factor variables as group indicators

We actually didn't need to specifically create new binary variables. R will automatically interpret any factor (or string!) variables as indicators:

```r
lm(mass ~ gender, data = humans) |> tidy()
```

```
## # A tibble: 2 × 5
##   term            estimate std.error statistic   p.value
##   <chr>              <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)         56.3      9.52      5.91 0.0000134
## 2 gendermasculine     29.4     10.3       2.84 0.0108
```

---

# Interaction terms

To find the effect of height separately for each gender, we can estimate this regression:
`$$Mass_i = \beta_0 + \beta_1 Masc_i + \beta_2 Height_i + \beta_3 Masc_i \times Height_i + u_i$$`

Two equivalent syntaxes:

```r
lm(mass ~ gender * height, data = humans)
```

```r
lm(mass ~ height + gender + height:gender, data = humans)
```

Using `*` is a good default. In most cases, you need to include all "parent" terms in order to interpret the interaction correctly.

---

# Interaction terms

To find the effect of height separately for each gender, we can estimate this regression:
`$$Mass_i = \beta_0 + \beta_1 Masc_i + \beta_2 Height_i + \beta_3 Masc_i \times Height_i + u_i$$`

```r
lm(mass ~ gender * height, data = humans) |> tidy()
```

```
## # A tibble: 4 × 5
##   term                   estimate std.error statistic p.value
##   <chr>                     <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)              87.9     103.        0.852   0.407
## 2 gendermasculine        -177.      130.       -1.36    0.193
## 3 height                   -0.189     0.616    -0.307   0.763
## 4 gendermasculine:height    1.15      0.755     1.52    0.148
```

How do we interpret the two height coefficients?

--
- For each 1-cm increase in height, mass increases for feminine characters by `$0.73$` kg.
- For each 1-cm increase in height, mass increases by `$0.16$` kg *more* (or *faster*) for masculine characters than it does for feminine characters.
- Or: For each 1-cm increase in height, mass increases for masculine characters by `$0.73 + 0.16 = 0.90$` kg.

---

# Transformations

Conveniently, you can make transformations inline, without having to `mutate()`.

- Arithmetic transformations: `x^2`, `I(x/3)`, `I((x - mean(x))/sd(x))`

- Log/exponential transformations: `log(x)`, `exp(x)`

- Indicators: `I(x < 100)`, `I(x == "Montana")`

---

# Challenge

Estimate this regression in the original `starwars` dataset:

`$$\begin{eqnarray}
Mass_{i} &=& \beta_0 + \beta_1 Height_i + \beta_2 Height^2_i + \beta_3 \log(Age_i) + \\
& & \beta_4 Height_i \times \log(Age_i) + \beta_5 1(Jabba_i) + u_i
\end{eqnarray}$$`

(Remember the variable `birth_year` encodes age, for some reason.)

```
## # A tibble: 6 × 4
##     estimate std.error statistic  p.value
##        <dbl>     <dbl>     <dbl>    <dbl>
## 1   -0.688    64.4       -0.0107 9.92e- 1
## 2   -0.114     0.699     -0.163  8.71e- 1
## 3    8.44      9.00       0.937  3.56e- 1
## 4    0.00432   0.00232    1.86   7.25e- 2
## 5 1309.       22.5       58.2    2.07e-32
## 6   -0.104     0.0641    -1.63   1.14e- 1
```

---

# Challenge

Estimate this regression in the original `starwars` dataset:

`$$\begin{eqnarray}
Mass_{i} &=& \beta_0 + \beta_1 Height_i + \beta_2 Height^2_i + \beta_3 \log(Age_i) + \\
& & \beta_4 Height_i \times \log(Age_i) + \beta_5 1(Jabba_i) + u_i
\end{eqnarray}$$`

```r
reg_challenge = lm(
  mass ~ height * log(birth_year) + I(height^2) + I(str_detect(name, "Jabba")),
  data = starwars
  )
tidy(summary(reg_challenge))
```

```
## # A tibble: 6 × 5
##   term                                   estimate std.error statistic  p.value
##   <chr>                                     <dbl>     <dbl>     <dbl>    <dbl>
## 1 "(Intercept)"                          -0.688    64.4       -0.0107 9.92e- 1
## 2 "height"                               -0.114     0.699     -0.163  8.71e- 1
## 3 "log(birth_year)"                       8.44      9.00       0.937  3.56e- 1
## 4 "I(height^2)"                           0.00432   0.00232    1.86   7.25e- 2
## 5 "I(str_detect(name, \"Jabba\"))TRUE" 1309.       22.5       58.2    2.07e-32
## 6 "height:log(birth_year)"               -0.104     0.0641    -1.63   1.14e- 1
```

---

# More challenge

1. Estimate this regression in the original `starwars` dataset:

`$$\begin{eqnarray}
Mass_{i} &=& \beta_0 + \beta_1 Height_i + \beta_2 Height^2_i + \beta_3 \log(Age_i) + \\
& & \beta_4 Height_i \times \log(Age_i) + \beta_5 1(Jabba_i) + u_i
\end{eqnarray}$$`

2. Run the same regression, but instead of creating an indicator just for Jabba, filter him out beforehand. Do you get the same results?

3. There was a lot of missing data for `birth_year` (i.e., age). How many observations did `lm()` actually use? (Hint: Explore the resulting object some more.)

---
class: inverse, middle
name: missing

# Handling missing data

---

# Missing data

Best option: **Find it.** (The alternatives are not great...)

Try to figure out why the values are missing. Then choose one of these options:

**Delete** the observations with missing values.
- Most common choice in economics research.
- Caveat results as valid *for the nonmissing sample*.
- No problem at all under a "Missing Completely at Random" assumption. (Plausible?)
- Usually the only option if your outcome variable is missing.

**Impute** the values (with means/medians, group means, or "multiple imputation" methods).
- Fill in the missing value, predicting it based on the values of other variables.
- Still relies on a "Missing at Random" assumption.
- Common in prediction problems.
- Rare in economics. For causal inference, can introduce new problems.

---

# Missing data

**Interpolate** the values between its neighbors in time or space.
- Time: Assume values change linearly (or exponentially) between observations.
  - Useful in panel data for control variables / predictors.
  - Example: Control for population in each year, but only have data every 10 years.
- Space: Assume values change smoothly in two spatial dimensions.
  - Inverse distance weighting; kriging (can be done using GIS tools)
  - Example: Fill in groundwater levels at every farm, but only have measurements at certain places.

**Create a flag** (binary variable) to indicate observations with missing values.
- Then replace the `NA`'s in the original variable with a consistent value, such as `0` or the sample mean. This works so long as you always include both variables in the analysis.
- Allows you to keep the observation and not lose the rest of its information.
- Easy. Works well for both prediction and causal inference.

---

# More challenge

Let's not throw out all those observations just because we can't control for age. Use the flag approach to deal with this variable:

1. Create a new binary indicator variable that =1 if `birth_year` is missing and =0 otherwise.

1. Create a new variable that equals `log(birth_year)` except a specific value is filled in for all rows with missing values (choose a value outside the existing range of `log(birth_year)`).

1. Verify that `mass ~ log(birth_year)` and `mass ~ new_variable + flag_variable` yield the same coefficient on the continuous variable.

1. Re-run the same regression on the previous slide, using this flag approach. (Make sure to treat the flag variable *exactly* the same way as the continuous variable -- if you were interacting both with X before, interact both with X now.)

---
class: inverse, middle
name: nonlinear

# Modeling nonlinear relationships

---

# Some preliminaries

Load the London Airbnb data we used in the Exploratory Analysis lecture, and take a 5% random sample:

```r
set.seed(123)
london = read_csv("https://osf.io/ey5p7/download")
london_sample = london |> 
  mutate(ln_price = log(price)) |> 
  filter(!is.na(price)) |>
  slice_sample(prop = 0.05)
```
  
---

# Modeling nonlinear relationships

**How should we model the relationship between longitude and price?**

First, we need to check the CEF with a quick binscatter:

```r
binsreg(y = london_sample$ln_price, x = london_sample$longitude)
```

```
## Call: binsreg
## 
## Binscatter Plot
## Bin/Degree selection method (binsmethod)  =  IMSE direct plug-in (select # of bins)
## Placement (binspos)                       =  Quantile-spaced
## Derivative (deriv)                        =  0
## 
## Group (by)                         =  Full Sample
## Sample size (n)                    =  2690
## # of distinct values (Ndist)       =  2690
## # of clusters (Nclust)             =  NA
## dots, degree (p)                   =  0
## dots, smoothness (s)               =  0
## # of bins (nbins)                  =  21
```

---

# Modeling nonlinear relationships

So then... **what's wrong with running the following regression?**

```r
linear = feols(ln_price ~ longitude, data = london_sample)
tidy(linear)
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     4.15    0.0231    180.   0       
## 2 longitude      -1.13    0.147      -7.67 2.35e-14
```

--
The plot showed us that a linear specification of longitude is a bad fit.
* It still tells us an average derivative.
* But that's not an average that's very useful.

Instead, let's choose a **nonlinear specification** that gets us closer to the true CEF.

--
1. Bin regression (step functions).
2. Polynomial regression.
3. Piecewise functions (splines).

---

# 1. Bin regression (step functions)

Just like a binscatter. Use a regression to estimate means by quantile:

```r
binned = feols(ln_price ~ factor(ntile(longitude, 10)), data = london_sample)
tidy(binned)
```

```
## # A tibble: 10 × 5
##    term                           estimate std.error statistic  p.value
##    <chr>                             <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                      4.14      0.0408    102.   0       
##  2 factor(ntile(longitude, 10))2    0.263     0.0577      4.55 5.51e- 6
##  3 factor(ntile(longitude, 10))3    0.394     0.0577      6.83 1.06e-11
##  4 factor(ntile(longitude, 10))4    0.444     0.0577      7.70 1.96e-14
##  5 factor(ntile(longitude, 10))5    0.322     0.0577      5.58 2.62e- 8
##  6 factor(ntile(longitude, 10))6    0.206     0.0577      3.57 3.60e- 4
##  7 factor(ntile(longitude, 10))7    0.150     0.0577      2.60 9.46e- 3
##  8 factor(ntile(longitude, 10))8    0.0692    0.0577      1.20 2.30e- 1
##  9 factor(ntile(longitude, 10))9   -0.115     0.0577     -1.99 4.63e- 2
## 10 factor(ntile(longitude, 10))10  -0.213     0.0577     -3.70 2.23e- 4
```

**How do we interpret these coefficients?**

---

# 1. Bin regression (step functions)

To visualize the estimated regression model, generate **fitted values** using `predict`:

```r
binned_predict = predict(binned, data = london_sample, interval = "confidence")
binned_fitted = cbind(london_sample, binned_predict)
head(binned_predict)
```

```
##        fit     se.fit   ci_low  ci_high
## 1 4.213814 0.04077622 4.133858 4.293770
## 2 3.931437 0.04077622 3.851481 4.011394
## 3 4.466514 0.04077622 4.386558 4.546470
## 4 4.538370 0.04077622 4.458414 4.618326
## 5 4.213814 0.04077622 4.133858 4.293770
## 6 4.144617 0.04077622 4.064661 4.224573
```

---

# 1. Bin regression (step functions)

```r
ggplot(binned_fitted, aes(x = longitude)) + theme_light() +
  geom_point(aes(y = ln_price), alpha = 0.1) +
  geom_ribbon(aes(ymin = ci_low, ymax = ci_high), fill = "coral") +
  geom_line(aes(y = fit), size = 1, color = "orangered4")
```

---

# 1. Bin regression (step functions)

**Pros:**
- Extremely flexible: Gets us close to the true CEF without having to specify a functional form.

**Cons:**

(Why might they not meet our needs?)

- Estimates may be too noisy (high variance).
- Fitted values have sudden jumps even if the true DGP is smooth.

---

# 2. Polynomial regression

Another approach is **polynomial regression.**

The simplest is quadratic -- just two terms for longitude.

```r
poly2 = feols(ln_price ~ longitude + longitude^2, data = london_sample)
tidy(poly2)
```

```
## # A tibble: 3 × 5
##   term           estimate std.error statistic  p.value
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)        4.05    0.0245     165.  0       
## 2 longitude         -3.67    0.280      -13.1 3.82e-38
## 3 I(longitude^2)    -9.24    0.872      -10.6 9.56e-26
```

These terms are hard to interpret until we visualize the model.

---

# 2. Polynomial regression

The quadratic model:
<img src="10-Regression_files/figure-html/unnamed-chunk-15-1.png" width="80%" style="display: block; margin: auto;" />

---

# 2. Polynomial regression

Cubic:
<img src="10-Regression_files/figure-html/unnamed-chunk-16-1.png" width="80%" style="display: block; margin: auto;" />

---

# 2. Polynomial regression

Quartic:
<img src="10-Regression_files/figure-html/unnamed-chunk-17-1.png" width="80%" style="display: block; margin: auto;" />

---

# 2. Polynomial regression

**Pro:**
- Flexible -- easily capture many types of smooth nonlinear relationships.
- Parsimonious -- fewer coefficients to estimate = lower variance.

**Cons:**

(What do you think?)

--
- Not so good if CEF is not smooth.
- Poorly behaved in the tails.

---

# 3. Piecewise functions (splines)

Piecewise linear (1 knot at `$-0.17$`):
<img src="10-Regression_files/figure-html/unnamed-chunk-18-1.png" width="80%" style="display: block; margin: auto;" />

---

# 3. Piecewise functions (splines)

Piecewise cubic (7 knots):
<img src="10-Regression_files/figure-html/unnamed-chunk-19-1.png" width="80%" style="display: block; margin: auto;" />

---
class: inverse, middle
name: prediction

# Using regression models for prediction

---

# Using regression models for prediction

Once we estimate a regression model, what might we want to do with it?

1. Learn about the **relationship** between variables *(descriptive analysis, causal inference)*.

2. Use it to **predict** the outcome variable in new data *(predictive analysis)*.

Both of these activities can involve calculating *predicted values* from the model... but we need to be careful.

---

# Point 1: Counterfactuals != prediction

**Common point of confusion:** "Isn't the point of estimating a causal effect so that we can use it to *predict* what will happen?"
- The effect of legalizing cannabis on police arrests.
- The effect of climate change on economic growth.

Yes! But:

In causal inference, we are only making a prediction **holding everything else constant.**
- We are never claiming to predict what the future looks like (the overall level of the outcome).

We only want to make a *counterfactual projection:*
- How different the outcome will be **relative to what it would be without the treatment.**

In predictive analysis, we *do* want to predict the value of the outcome.

---

# Point 2: Confidence/prediction intervals

**Confidence intervals** characterize uncertainty about the **relationships.**
- If we gathered more data, where would the "true" line of best fit lie?

---

# Point 2: Confidence/prediction intervals

**Prediction intervals** characterize uncertainty about **individual observations.**
- If we use this model to predict the outcome for new data points, how close would those predictions be to the truth?

---
class: inverse, middle
name: functions

# Econometrics packages in R

---

# Robust standard errors

`lm()` uses classical/homoskedastic standard errors, which you basically should **never** use.

You can obtain heteroskedasticity-consistent "robust" standard errors using `estimatr::lm_robust()`.

```r
ols1_robust = lm_robust(mass ~ height, data = starwars)
ols1_robust
```

```
##                Estimate  Std. Error   t value     Pr(>|t|)    CI Lower
## (Intercept) -11.4868157 24.43150661 -0.470164 6.400318e-01 -60.4100638
## height        0.6240033  0.08824021  7.071644 2.417174e-09   0.4473053
##               CI Upper DF
## (Intercept) 37.4364324 57
## height       0.8007013 57
```

---

# Clustered standard errors

If you haven't learned about clustering yet...

You want to **cluster** your standard errors whenever your observations are correlated within unit/group/geography (i.e., not fully independent draws from a population).

![](img/slope_hypothesis_testing.png)

Clustering is for cross-sectional or panel data. (Time series data calls for other solutions.)

</br>

.small[Image from xkcd ([source](https://xkcd.com/2533/)) and is not included in this resource's CC license.]

---

# Clustered standard errors

To cluster by `homeworld`:

```r
lm_robust(mass ~ height, data = starwars, clusters = homeworld)
```

```
##               Estimate Std. Error    t value     Pr(>|t|)    CI Lower
## (Intercept) -6.4571556 30.4270813 -0.2122174 0.8376015708 -77.4711944
## height       0.5959504  0.1066112  5.5899430 0.0004845487   0.3508593
##               CI Upper       DF
## (Intercept) 64.5568831 7.485115
## height       0.8410415 8.144022
```

---

# Fixed effect regressions

Most "real" regressions you run in economics will involve fixed effects of some kind (i.e., many group- or unit-specific indicators).

`fixest::feols()` **should be your new best friend.**
- It can handle large numbers of fixed effects quickly and quietly.
- It has tons of options and is *orders of magnitude* faster than alternatives (`lfe` in R or `reghdfe` in Stata).
- It can also handle simpler regressions, so it deserves to be your default everyday regression function.

Syntax to include species FE:

`$$Mass_{i} = \beta_0 + \beta_1 Height_{i} + \lambda_{species} + u_i$$`

```r
ols_fe = feols(mass ~ height | species, data = starwars)
```

---

# Regression tables

Get **quick regression tables** with `fixest::etable()`:

```r
etable(ols_fe, ols_fe)
```

```
##                             ols_fe           ols_fe.1
## Dependent Var.:               mass               mass
##                                                      
## height          0.9256*** (0.0191) 0.9256*** (0.0191)
## Fixed-Effects:  ------------------ ------------------
## species                        Yes                Yes
## _______________ __________________ __________________
## S.E.: Clustered        by: species        by: species
## Observations                    56                 56
## R2                         0.99645            0.99645
## Within R2                  0.61556            0.61556
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

</br>

Try package `modelsummary` for publication-quality tables that you can output with formatting to LaTeX, Markdown, etc.

---

# Other models in R

Instrumental variables (IV) regressions:

- `fixest::feols()` (see, your new best friend!)
  - Syntax: `y ~ controls | fe | endogenous ~ instrument`.
- `estimatr::iv_robust()` 
- `ivreg::ivreg()`

Logit regression:

- `glm(formula, data, family = binomial(link = "logit"))`
- `feglm(formula, data, family = binomial(link = "logit"))`
- To get marginal effects, try package `mfx`.

"glm" stands for generalized linear models.

---

# Summary

1. [Basic regression in R](#basics)

1. [Indicator and interaction terms](#indicators)

1. [Handling missing data](#missing)

1. [Modeling nonlinear relationships](#nonlinear)

1. [Using regression models for prediction](#prediction)

1. [Econometrics packages in R](#functions)

1. [Appendix: Interpreting coefficients](#interpret)

---
class: inverse, middle
name: interpret

# Appendix: Interpreting coefficients

---

# Interpreting regression coefficients

`$$y_i = \alpha + \beta x_{i} + u_i$$`

- `$\beta$`: The "effect" of `$x$` on `$y$`.
  - How much `$y$` changes in response to a one-unit change in `$x$`.
  - The average derivative: `$dy/dx$`.
  - Not necessarily causal! Only if you can assume away selection bias.

* `$\alpha$`: The y-intercept.
  - The average value of `$y$` when `$x = 0$`.

```r
lm(mass ~ height, no_jabba) |> tidy()
```

Put into a complete sentence with units:

--
- "For every **1 cm** increase in height, average mass increases by **0.6 kg.**"

---

# Interpreting log-transformed variables

`$$\log(y_i) = \alpha + \beta x_{i} + u_i$$`

```r
lm(log(mass) ~ height, no_jabba) |> tidy()
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   2.30    0.164         14.0 5.49e-20
## 2 height        0.0111  0.000921      12.1 3.45e-17
```

Still true: **"A one-unit change in `$x$` is associated with a `$\beta$`-unit change in `$y$`."**
- Here: "A **1 cm** change in height is associated with a **0.01 log unit** change in mass."

But log units have a convenient **proportional change** or **percentage change** interpretation:
- "A **1-cm** change in height is associated with a **0.01 times** change in mass."
- "A **1-cm** change in height is associated with a **1-percent** change in mass."

---

# Interpreting log-transformed variables

**Why?** Here's the algebra:

Original values `$x_0$` and `$y_0$`. Percentage change in `$y$` is `$\% \Delta y = \Delta y / y_0$`.

Model:
`$$\ln(y_0) = \alpha + \beta x_0$$`
`$$\ln(y_0 + \Delta y) = \alpha + \beta (x_0 + 1)$$`

Solve for `$\% \Delta y$`:
`$$\ln(y_0 + \Delta y) = \ln(y_0) + \beta$$`
`$$\ln(\frac{y_0 + \Delta y}{y_0}) = \beta$$`
`$$\ln(1 + \% \Delta y) = \beta$$`

Using the approximation `$\ln(a+1) \approx a$` for small `$a$`:
`$$\beta \approx \% \Delta y.$$`

---

# Interpreting log-transformed variables

`$$y_i = \alpha + \beta \log(x_{i}) + u_i$$`

```r
lm(mass ~ log(height), no_jabba) |> tidy()
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -341.       53.1     -6.43 3.03e- 8
## 2 log(height)     81.2      10.3      7.86 1.31e-10
```

Still true: **"A one-unit change in `$x$` is associated with a `$\beta$`-unit change in `$y$`."**

--
- Here: "A **1 log unit** change in height is associated with an **82-kg** change in mass."

--
- "A **100%** change in height is associated with a **82-kg** change in mass."
- "A **10%** change in height is associated with an **8.2-kg** change in mass."

---

# Interpreting log-transformed variables

`$$\log(y_i) = \alpha + \beta \log(x_{i}) + u_i$$`

```r
lm(log(mass) ~ log(height), no_jabba) |> tidy()
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    -3.68     0.630     -5.84 2.73e- 7
## 2 log(height)     1.54     0.123     12.6  5.97e-18
```

Still true: **"A one-unit change in `$x$` is associated with a `$\beta$`-unit change in `$y$`."**

--
- Here: "A **1 log unit** change in height is associated with a **1.6 log unit** change in mass."

--
- "A **100%** change in height is associated with a **1.6 times** change in mass."
- "A **100%** change in height is associated with a **160 percent** change in mass."

--
- "A **10%** change in height is associated with a **16 percent** change in mass."
- "A **1%** change in height is associated with a **1.6 percent** change in mass."