class: center, middle, inverse, title-slide .title[ # Topic 10: Regression Modeling ] .author[ ### Nick Hagerty
ECNS 460/560
Montana State University ] .date[ ### .small[
*Parts of these slides are adapted from
“Data Science for Economists”
by Grant R. McDermott, used under the
MIT License
.] ] --- <style type="text/css"> # CSS for including pauses in printed PDF output (see bottom of lecture) @media print { .has-continuation { display: block !important; } } .remark-code-line { font-size: 95%; } .small { font-size: 75%; } .scroll-output-full { height: 90%; overflow-y: scroll; } .scroll-output-75 { height: 75%; overflow-y: scroll; } </style> # Table of contents 1. [Basic regression in R](#basics) 1. [Indicator and interaction terms](#indicators) 1. [Handling missing data](#missing) 1. [Modeling nonlinear relationships](#nonlinear) 1. [Using regression models for prediction](#prediction) 1. [Econometrics packages in R](#functions) 1. [Appendix: Interpreting coefficients](#interpret) --- class: inverse, middle name: basics # Regression in R --- # Prep work Load the packages we'll be using: ```r if (!require("pacman")) install.packages("pacman") pacman::p_load(tidyverse, estimatr, broom, summarytools, fixest, binsreg) ``` We'll mostly be working with the `starwars` data frame that comes with `dplyr`. ```r head(starwars) ``` ``` ## # A tibble: 6 × 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke Sky… 172 77 blond fair blue 19 male mascu… ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… ## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… ## 4 Darth Va… 202 136 none white yellow 41.9 male mascu… ## 5 Leia Org… 150 49 brown light brown 19 fema… femin… ## 6 Owen Lars 178 120 brown, gr… light blue 52 male mascu… ## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, ## # vehicles <list>, starships <list> ``` --- # Basic regression in R Base R's command for running regression models is `lm()` ("linear models"). ```r lm(y ~ x1 + x2 + x3, data = df) ``` It estimates this regression: `$$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} + u_i$$` Using the variables (columns) `y` and `x` from the object `df`. --- # Basic regression in R Let's run a simple bivariate regression of mass on height using our dataset of Star Wars characters. ```r ols1 = lm(mass ~ height, data = starwars) ols1 ``` ``` ## ## Call: ## lm(formula = mass ~ height, data = starwars) ## ## Coefficients: ## (Intercept) height ## -11.487 0.624 ``` Most valuable information is buried within this object's internal list structure. Try `View(ols1)`. --- # Basic regression in R To summarise the key pieces of information, we can use the generic summary() function. This will look pretty similar to the default regression output from Stata. ```r summary(ols1) ``` ``` ## ## Call: ## lm(formula = mass ~ height, data = starwars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -60.95 -29.51 -20.83 -17.65 1260.29 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -11.4868 111.3842 -0.103 0.918 ## height 0.6240 0.6262 0.997 0.323 ## ## Residual standard error: 169.5 on 57 degrees of freedom ## (28 observations deleted due to missingness) ## Multiple R-squared: 0.01712, Adjusted R-squared: -0.0001194 ## F-statistic: 0.9931 on 1 and 57 DF, p-value: 0.3232 ``` --- # Basic regression in R To just show the coefficients: ```r summary(ols1)$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -11.4868157 111.3841576 -0.1031279 0.9182234 ## height 0.6240033 0.6261744 0.9965328 0.3232031 ``` --- # Get tidy regression coefficients Note advantage over Stata: the regression is an object, not an action. - Results are stored for later use -- they don't just disappear. In practice you'll often want to convert your regression output to a tidy data frame (aka tibble). You can do this using `tidy()` from the `broom` package. ```r tidy(ols1, conf.int = TRUE) ``` ``` ## # A tibble: 2 × 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -11.5 111. -0.103 0.918 -235. 212. ## 2 height 0.624 0.626 0.997 0.323 -0.630 1.88 ``` --- # Subsetting data for a regression What data analysis mistake have we made already? -- **We didn't plot our data.** 😱 ```r ggplot(starwars, aes(x=height, y=mass)) + geom_point() ``` <img src="10-Regression_files/figure-html/jabba-1.png" width="80%" style="display: block; margin: auto;" /> --- # Subsetting data for a regression If we think Jabba is not part of the same data generating process we want to study, we can filter him out. ```r no_jabba = starwars |> filter(!str_detect(name, "Jabba")) ols2 = lm(mass ~ height, no_jabba) summary(ols2) |> tidy() ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -31.3 12.8 -2.44 1.79e- 2 ## 2 height 0.613 0.0720 8.51 1.14e-11 ``` --- class: inverse, middle name: indicators # Indicator and interaction terms --- # Indicator variables For the next example, let's use only the human characters. ```r humans = filter(starwars, species=="Human") freq(humans$gender) ``` ``` ## Frequencies ## humans$gender ## Type: Character ## ## Freq % Valid % Valid Cum. % Total % Total Cum. ## --------------- ------ --------- -------------- --------- -------------- ## feminine 9 25.71 25.71 25.71 25.71 ## masculine 26 74.29 100.00 74.29 100.00 ## <NA> 0 0.00 100.00 ## Total 35 100.00 100.00 100.00 100.00 ``` --- # Indicator variables ```r humans = humans |> mutate(masculine = as.integer(gender == "masculine")) lm(mass ~ masculine, data = humans) |> tidy() ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 56.3 9.52 5.91 0.0000134 ## 2 masculine 29.4 10.3 2.84 0.0108 ``` How do we interpret the `masculine` coefficient? -- - The variable is coded as `\(0\)` for `feminine` and `\(1\)` for `masculine`. - So intercept gives the average mass of `feminine` characters: `\(\alpha = 56\)` kg. - Relative to that, the average effect of being `masculine` on mass is `\(30.6\)` kg. - So the average mass of `masculine` characters is `\(56 + 31 = 87\)` kg. --- # Regressions just estimate means If we don't want to have to add up terms we can estimate the mean for each group separately! (Just omit the intercept.) ```r humans = humans |> mutate(feminine = 1 - masculine) lm(mass ~ feminine + masculine + 0, data = humans) |> tidy() ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 feminine 56.3 9.52 5.91 1.34e- 5 ## 2 masculine 85.7 4.00 21.4 2.93e-14 ``` ```r group_by(humans, gender) |> summarize(mean(mass, na.rm=T)) ``` ``` ## # A tibble: 2 × 2 ## gender `mean(mass, na.rm = T)` ## <chr> <dbl> ## 1 feminine 56.3 ## 2 masculine 85.7 ``` --- # Factor variables as group indicators We actually didn't need to specifically create new binary variables. R will automatically interpret any factor (or string!) variables as indicators: ```r lm(mass ~ gender, data = humans) |> tidy() ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 56.3 9.52 5.91 0.0000134 ## 2 gendermasculine 29.4 10.3 2.84 0.0108 ``` --- # Interaction terms To find the effect of height separately for each gender, we can estimate this regression: `$$Mass_i = \beta_0 + \beta_1 Masc_i + \beta_2 Height_i + \beta_3 Masc_i \times Height_i + u_i$$` Two equivalent syntaxes: ```r lm(mass ~ gender * height, data = humans) ``` ```r lm(mass ~ height + gender + height:gender, data = humans) ``` Using `*` is a good default. In most cases, you need to include all "parent" terms in order to interpret the interaction correctly. --- # Interaction terms To find the effect of height separately for each gender, we can estimate this regression: `$$Mass_i = \beta_0 + \beta_1 Masc_i + \beta_2 Height_i + \beta_3 Masc_i \times Height_i + u_i$$` ```r lm(mass ~ gender * height, data = humans) |> tidy() ``` ``` ## # A tibble: 4 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 87.9 103. 0.852 0.407 ## 2 gendermasculine -177. 130. -1.36 0.193 ## 3 height -0.189 0.616 -0.307 0.763 ## 4 gendermasculine:height 1.15 0.755 1.52 0.148 ``` How do we interpret the two height coefficients? -- - For each 1-cm increase in height, mass increases for feminine characters by `\(0.73\)` kg. - For each 1-cm increase in height, mass increases by `\(0.16\)` kg *more* (or *faster*) for masculine characters than it does for feminine characters. - Or: For each 1-cm increase in height, mass increases for masculine characters by `\(0.73 + 0.16 = 0.90\)` kg. --- # Transformations Conveniently, you can make transformations inline, without having to `mutate()`. - Arithmetic transformations: `x^2`, `I(x/3)`, `I((x - mean(x))/sd(x))` - Log/exponential transformations: `log(x)`, `exp(x)` - Indicators: `I(x < 100)`, `I(x == "Montana")` --- # Challenge Estimate this regression in the original `starwars` dataset: `$$\begin{eqnarray} Mass_{i} &=& \beta_0 + \beta_1 Height_i + \beta_2 Height^2_i + \beta_3 \log(Age_i) + \\ & & \beta_4 Height_i \times \log(Age_i) + \beta_5 1(Jabba_i) + u_i \end{eqnarray}$$` (Remember the variable `birth_year` encodes age, for some reason.) <!-- !!! Don't look at the next slide until you try coding it yourself !!! --> -- ``` ## # A tibble: 6 × 4 ## estimate std.error statistic p.value ## <dbl> <dbl> <dbl> <dbl> ## 1 -0.688 64.4 -0.0107 9.92e- 1 ## 2 -0.114 0.699 -0.163 8.71e- 1 ## 3 8.44 9.00 0.937 3.56e- 1 ## 4 0.00432 0.00232 1.86 7.25e- 2 ## 5 1309. 22.5 58.2 2.07e-32 ## 6 -0.104 0.0641 -1.63 1.14e- 1 ``` --- # Challenge Estimate this regression in the original `starwars` dataset: `$$\begin{eqnarray} Mass_{i} &=& \beta_0 + \beta_1 Height_i + \beta_2 Height^2_i + \beta_3 \log(Age_i) + \\ & & \beta_4 Height_i \times \log(Age_i) + \beta_5 1(Jabba_i) + u_i \end{eqnarray}$$` ```r reg_challenge = lm( mass ~ height * log(birth_year) + I(height^2) + I(str_detect(name, "Jabba")), data = starwars ) tidy(summary(reg_challenge)) ``` ``` ## # A tibble: 6 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 "(Intercept)" -0.688 64.4 -0.0107 9.92e- 1 ## 2 "height" -0.114 0.699 -0.163 8.71e- 1 ## 3 "log(birth_year)" 8.44 9.00 0.937 3.56e- 1 ## 4 "I(height^2)" 0.00432 0.00232 1.86 7.25e- 2 ## 5 "I(str_detect(name, \"Jabba\"))TRUE" 1309. 22.5 58.2 2.07e-32 ## 6 "height:log(birth_year)" -0.104 0.0641 -1.63 1.14e- 1 ``` --- # More challenge 1. Estimate this regression in the original `starwars` dataset: `$$\begin{eqnarray} Mass_{i} &=& \beta_0 + \beta_1 Height_i + \beta_2 Height^2_i + \beta_3 \log(Age_i) + \\ & & \beta_4 Height_i \times \log(Age_i) + \beta_5 1(Jabba_i) + u_i \end{eqnarray}$$` 2. Run the same regression, but instead of creating an indicator just for Jabba, filter him out beforehand. Do you get the same results? 3. There was a lot of missing data for `birth_year` (i.e., age). How many observations did `lm()` actually use? (Hint: Explore the resulting object some more.) --- class: inverse, middle name: missing # Handling missing data --- # Missing data Best option: **Find it.** (The alternatives are not great...) -- Try to figure out why the values are missing. Then choose one of these options: **Delete** the observations with missing values. - Most common choice in economics research. - Caveat results as valid *for the nonmissing sample*. - No problem at all under a "Missing Completely at Random" assumption. (Plausible?) - Usually the only option if your outcome variable is missing. -- **Impute** the values (with means/medians, group means, or "multiple imputation" methods). - Fill in the missing value, predicting it based on the values of other variables. - Still relies on a "Missing at Random" assumption. - Common in prediction problems. - Rare in economics. For causal inference, can introduce new problems. --- # Missing data **Interpolate** the values between its neighbors in time or space. - Time: Assume values change linearly (or exponentially) between observations. - Useful in panel data for control variables / predictors. - Example: Control for population in each year, but only have data every 10 years. - Space: Assume values change smoothly in two spatial dimensions. - Inverse distance weighting; kriging (can be done using GIS tools) - Example: Fill in groundwater levels at every farm, but only have measurements at certain places. -- **Create a flag** (binary variable) to indicate observations with missing values. - Then replace the `NA`'s in the original variable with a consistent value, such as `0` or the sample mean. This works so long as you always include both variables in the analysis. - Allows you to keep the observation and not lose the rest of its information. - Easy. Works well for both prediction and causal inference. --- # More challenge Let's not throw out all those observations just because we can't control for age. Use the flag approach to deal with this variable: 1. Create a new binary indicator variable that =1 if `birth_year` is missing and =0 otherwise. 1. Create a new variable that equals `log(birth_year)` except a specific value is filled in for all rows with missing values (choose a value outside the existing range of `log(birth_year)`). 1. Verify that `mass ~ log(birth_year)` and `mass ~ new_variable + flag_variable` yield the same coefficient on the continuous variable. 1. Re-run the same regression on the previous slide, using this flag approach. (Make sure to treat the flag variable *exactly* the same way as the continuous variable -- if you were interacting both with X before, interact both with X now.) --- class: inverse, middle name: nonlinear # Modeling nonlinear relationships --- # Some preliminaries Load the London Airbnb data we used in the Exploratory Analysis lecture, and take a 5% random sample: ```r set.seed(123) london = read_csv("https://osf.io/ey5p7/download") london_sample = london |> mutate(ln_price = log(price)) |> filter(!is.na(price)) |> slice_sample(prop = 0.05) ``` --- # Modeling nonlinear relationships **How should we model the relationship between longitude and price?** -- First, we need to check the CEF with a quick binscatter: ```r binsreg(y = london_sample$ln_price, x = london_sample$longitude) ``` <img src="10-Regression_files/figure-html/unnamed-chunk-9-1.png" width="80%" style="display: block; margin: auto;" /> ``` ## Call: binsreg ## ## Binscatter Plot ## Bin/Degree selection method (binsmethod) = IMSE direct plug-in (select # of bins) ## Placement (binspos) = Quantile-spaced ## Derivative (deriv) = 0 ## ## Group (by) = Full Sample ## Sample size (n) = 2690 ## # of distinct values (Ndist) = 2690 ## # of clusters (Nclust) = NA ## dots, degree (p) = 0 ## dots, smoothness (s) = 0 ## # of bins (nbins) = 21 ``` --- # Modeling nonlinear relationships So then... **what's wrong with running the following regression?** ```r linear = feols(ln_price ~ longitude, data = london_sample) tidy(linear) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.15 0.0231 180. 0 ## 2 longitude -1.13 0.147 -7.67 2.35e-14 ``` -- The plot showed us that a linear specification of longitude is a bad fit. * It still tells us an average derivative. * But that's not an average that's very useful. Instead, let's choose a **nonlinear specification** that gets us closer to the true CEF. -- 1. Bin regression (step functions). 2. Polynomial regression. 3. Piecewise functions (splines). --- # 1. Bin regression (step functions) Just like a binscatter. Use a regression to estimate means by quantile: ```r binned = feols(ln_price ~ factor(ntile(longitude, 10)), data = london_sample) tidy(binned) ``` ``` ## # A tibble: 10 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.14 0.0408 102. 0 ## 2 factor(ntile(longitude, 10))2 0.263 0.0577 4.55 5.51e- 6 ## 3 factor(ntile(longitude, 10))3 0.394 0.0577 6.83 1.06e-11 ## 4 factor(ntile(longitude, 10))4 0.444 0.0577 7.70 1.96e-14 ## 5 factor(ntile(longitude, 10))5 0.322 0.0577 5.58 2.62e- 8 ## 6 factor(ntile(longitude, 10))6 0.206 0.0577 3.57 3.60e- 4 ## 7 factor(ntile(longitude, 10))7 0.150 0.0577 2.60 9.46e- 3 ## 8 factor(ntile(longitude, 10))8 0.0692 0.0577 1.20 2.30e- 1 ## 9 factor(ntile(longitude, 10))9 -0.115 0.0577 -1.99 4.63e- 2 ## 10 factor(ntile(longitude, 10))10 -0.213 0.0577 -3.70 2.23e- 4 ``` **How do we interpret these coefficients?** --- # 1. Bin regression (step functions) To visualize the estimated regression model, generate **fitted values** using `predict`: ```r binned_predict = predict(binned, data = london_sample, interval = "confidence") binned_fitted = cbind(london_sample, binned_predict) head(binned_predict) ``` ``` ## fit se.fit ci_low ci_high ## 1 4.213814 0.04077622 4.133858 4.293770 ## 2 3.931437 0.04077622 3.851481 4.011394 ## 3 4.466514 0.04077622 4.386558 4.546470 ## 4 4.538370 0.04077622 4.458414 4.618326 ## 5 4.213814 0.04077622 4.133858 4.293770 ## 6 4.144617 0.04077622 4.064661 4.224573 ``` --- # 1. Bin regression (step functions) ```r ggplot(binned_fitted, aes(x = longitude)) + theme_light() + geom_point(aes(y = ln_price), alpha = 0.1) + geom_ribbon(aes(ymin = ci_low, ymax = ci_high), fill = "coral") + geom_line(aes(y = fit), size = 1, color = "orangered4") ``` <img src="10-Regression_files/figure-html/unnamed-chunk-13-1.png" width="80%" style="display: block; margin: auto;" /> --- # 1. Bin regression (step functions) **Pros:** - Extremely flexible: Gets us close to the true CEF without having to specify a functional form. **Cons:** (Why might they not meet our needs?) -- - Estimates may be too noisy (high variance). - Fitted values have sudden jumps even if the true DGP is smooth. --- # 2. Polynomial regression Another approach is **polynomial regression.** The simplest is quadratic -- just two terms for longitude. ```r poly2 = feols(ln_price ~ longitude + longitude^2, data = london_sample) tidy(poly2) ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.05 0.0245 165. 0 ## 2 longitude -3.67 0.280 -13.1 3.82e-38 ## 3 I(longitude^2) -9.24 0.872 -10.6 9.56e-26 ``` These terms are hard to interpret until we visualize the model. --- # 2. Polynomial regression The quadratic model: <img src="10-Regression_files/figure-html/unnamed-chunk-15-1.png" width="80%" style="display: block; margin: auto;" /> --- # 2. Polynomial regression Cubic: <img src="10-Regression_files/figure-html/unnamed-chunk-16-1.png" width="80%" style="display: block; margin: auto;" /> --- # 2. Polynomial regression Quartic: <img src="10-Regression_files/figure-html/unnamed-chunk-17-1.png" width="80%" style="display: block; margin: auto;" /> --- # 2. Polynomial regression **Pro:** - Flexible -- easily capture many types of smooth nonlinear relationships. - Parsimonious -- fewer coefficients to estimate = lower variance. **Cons:** (What do you think?) -- - Not so good if CEF is not smooth. - Poorly behaved in the tails. --- # 3. Piecewise functions (splines) Piecewise linear (1 knot at `\(-0.17\)`): <img src="10-Regression_files/figure-html/unnamed-chunk-18-1.png" width="80%" style="display: block; margin: auto;" /> --- # 3. Piecewise functions (splines) Piecewise cubic (7 knots): <img src="10-Regression_files/figure-html/unnamed-chunk-19-1.png" width="80%" style="display: block; margin: auto;" /> --- class: inverse, middle name: prediction # Using regression models for prediction --- # Using regression models for prediction Once we estimate a regression model, what might we want to do with it? 1. Learn about the **relationship** between variables *(descriptive analysis, causal inference)*. 2. Use it to **predict** the outcome variable in new data *(predictive analysis)*. Both of these activities can involve calculating *predicted values* from the model... but we need to be careful. --- # Point 1: Counterfactuals != prediction **Common point of confusion:** "Isn't the point of estimating a causal effect so that we can use it to *predict* what will happen?" - The effect of legalizing cannabis on police arrests. - The effect of climate change on economic growth. Yes! But: In causal inference, we are only making a prediction **holding everything else constant.** - We are never claiming to predict what the future looks like (the overall level of the outcome). We only want to make a *counterfactual projection:* - How different the outcome will be **relative to what it would be without the treatment.** In predictive analysis, we *do* want to predict the value of the outcome. --- # Point 2: Confidence/prediction intervals **Confidence intervals** characterize uncertainty about the **relationships.** - If we gathered more data, where would the "true" line of best fit lie? <img src="10-Regression_files/figure-html/unnamed-chunk-20-1.png" width="80%" style="display: block; margin: auto;" /> --- # Point 2: Confidence/prediction intervals **Prediction intervals** characterize uncertainty about **individual observations.** - If we use this model to predict the outcome for new data points, how close would those predictions be to the truth? <img src="10-Regression_files/figure-html/unnamed-chunk-21-1.png" width="80%" style="display: block; margin: auto;" /> --- class: inverse, middle name: functions # Econometrics packages in R --- # Robust standard errors `lm()` uses classical/homoskedastic standard errors, which you basically should **never** use. You can obtain heteroskedasticity-consistent "robust" standard errors using `estimatr::lm_robust()`. ```r ols1_robust = lm_robust(mass ~ height, data = starwars) ols1_robust ``` ``` ## Estimate Std. Error t value Pr(>|t|) CI Lower ## (Intercept) -11.4868157 24.43150661 -0.470164 6.400318e-01 -60.4100638 ## height 0.6240033 0.08824021 7.071644 2.417174e-09 0.4473053 ## CI Upper DF ## (Intercept) 37.4364324 57 ## height 0.8007013 57 ``` --- # Clustered standard errors If you haven't learned about clustering yet... You want to **cluster** your standard errors whenever your observations are correlated within unit/group/geography (i.e., not fully independent draws from a population).  Clustering is for cross-sectional or panel data. (Time series data calls for other solutions.) </br> .small[Image from xkcd ([source](https://xkcd.com/2533/)) and is not included in this resource's CC license.] --- # Clustered standard errors To cluster by `homeworld`: ```r lm_robust(mass ~ height, data = starwars, clusters = homeworld) ``` ``` ## Estimate Std. Error t value Pr(>|t|) CI Lower ## (Intercept) -6.4571556 30.4270813 -0.2122174 0.8376015708 -77.4711944 ## height 0.5959504 0.1066112 5.5899430 0.0004845487 0.3508593 ## CI Upper DF ## (Intercept) 64.5568831 7.485115 ## height 0.8410415 8.144022 ``` --- # Fixed effect regressions Most "real" regressions you run in economics will involve fixed effects of some kind (i.e., many group- or unit-specific indicators). `fixest::feols()` **should be your new best friend.** - It can handle large numbers of fixed effects quickly and quietly. - It has tons of options and is *orders of magnitude* faster than alternatives (`lfe` in R or `reghdfe` in Stata). - It can also handle simpler regressions, so it deserves to be your default everyday regression function. Syntax to include species FE: `$$Mass_{i} = \beta_0 + \beta_1 Height_{i} + \lambda_{species} + u_i$$` ```r ols_fe = feols(mass ~ height | species, data = starwars) ``` --- # Regression tables Get **quick regression tables** with `fixest::etable()`: ```r etable(ols_fe, ols_fe) ``` ``` ## ols_fe ols_fe.1 ## Dependent Var.: mass mass ## ## height 0.9256*** (0.0191) 0.9256*** (0.0191) ## Fixed-Effects: ------------------ ------------------ ## species Yes Yes ## _______________ __________________ __________________ ## S.E.: Clustered by: species by: species ## Observations 56 56 ## R2 0.99645 0.99645 ## Within R2 0.61556 0.61556 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` </br> Try package `modelsummary` for publication-quality tables that you can output with formatting to LaTeX, Markdown, etc. --- # Other models in R Instrumental variables (IV) regressions: - `fixest::feols()` (see, your new best friend!) - Syntax: `y ~ controls | fe | endogenous ~ instrument`. - `estimatr::iv_robust()` - `ivreg::ivreg()` Logit regression: - `glm(formula, data, family = binomial(link = "logit"))` - `feglm(formula, data, family = binomial(link = "logit"))` - To get marginal effects, try package `mfx`. "glm" stands for generalized linear models. --- # Summary 1. [Basic regression in R](#basics) 1. [Indicator and interaction terms](#indicators) 1. [Handling missing data](#missing) 1. [Modeling nonlinear relationships](#nonlinear) 1. [Using regression models for prediction](#prediction) 1. [Econometrics packages in R](#functions) 1. [Appendix: Interpreting coefficients](#interpret) --- class: inverse, middle name: interpret # Appendix: Interpreting coefficients --- # Interpreting regression coefficients `$$y_i = \alpha + \beta x_{i} + u_i$$` - `\(\beta\)`: The "effect" of `\(x\)` on `\(y\)`. - How much `\(y\)` changes in response to a one-unit change in `\(x\)`. - The average derivative: `\(dy/dx\)`. - Not necessarily causal! Only if you can assume away selection bias. * `\(\alpha\)`: The y-intercept. - The average value of `\(y\)` when `\(x = 0\)`. ```r lm(mass ~ height, no_jabba) |> tidy() ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -31.3 12.8 -2.44 1.79e- 2 ## 2 height 0.613 0.0720 8.51 1.14e-11 ``` Put into a complete sentence with units: -- - "For every **1 cm** increase in height, average mass increases by **0.6 kg.**" --- # Interpreting log-transformed variables `$$\log(y_i) = \alpha + \beta x_{i} + u_i$$` ```r lm(log(mass) ~ height, no_jabba) |> tidy() ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 2.30 0.164 14.0 5.49e-20 ## 2 height 0.0111 0.000921 12.1 3.45e-17 ``` Still true: **"A one-unit change in `\(x\)` is associated with a `\(\beta\)`-unit change in `\(y\)`."** - Here: "A **1 cm** change in height is associated with a **0.01 log unit** change in mass." -- But log units have a convenient **proportional change** or **percentage change** interpretation: - "A **1-cm** change in height is associated with a **0.01 times** change in mass." - "A **1-cm** change in height is associated with a **1-percent** change in mass." --- # Interpreting log-transformed variables **Why?** Here's the algebra: Original values `\(x_0\)` and `\(y_0\)`. Percentage change in `\(y\)` is `\(\% \Delta y = \Delta y / y_0\)`. Model: `$$\ln(y_0) = \alpha + \beta x_0$$` `$$\ln(y_0 + \Delta y) = \alpha + \beta (x_0 + 1)$$` Solve for `\(\% \Delta y\)`: `$$\ln(y_0 + \Delta y) = \ln(y_0) + \beta$$` `$$\ln(\frac{y_0 + \Delta y}{y_0}) = \beta$$` `$$\ln(1 + \% \Delta y) = \beta$$` Using the approximation `\(\ln(a+1) \approx a\)` for small `\(a\)`: `$$\beta \approx \% \Delta y.$$` --- # Interpreting log-transformed variables `$$y_i = \alpha + \beta \log(x_{i}) + u_i$$` ```r lm(mass ~ log(height), no_jabba) |> tidy() ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -341. 53.1 -6.43 3.03e- 8 ## 2 log(height) 81.2 10.3 7.86 1.31e-10 ``` Still true: **"A one-unit change in `\(x\)` is associated with a `\(\beta\)`-unit change in `\(y\)`."** -- - Here: "A **1 log unit** change in height is associated with an **82-kg** change in mass." -- - "A **100%** change in height is associated with a **82-kg** change in mass." - "A **10%** change in height is associated with an **8.2-kg** change in mass." --- # Interpreting log-transformed variables `$$\log(y_i) = \alpha + \beta \log(x_{i}) + u_i$$` ```r lm(log(mass) ~ log(height), no_jabba) |> tidy() ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -3.68 0.630 -5.84 2.73e- 7 ## 2 log(height) 1.54 0.123 12.6 5.97e-18 ``` Still true: **"A one-unit change in `\(x\)` is associated with a `\(\beta\)`-unit change in `\(y\)`."** -- - Here: "A **1 log unit** change in height is associated with a **1.6 log unit** change in mass." -- - "A **100%** change in height is associated with a **1.6 times** change in mass." - "A **100%** change in height is associated with a **160 percent** change in mass." -- - "A **10%** change in height is associated with a **16 percent** change in mass." - "A **1%** change in height is associated with a **1.6 percent** change in mass."