class: center, middle, inverse, title-slide # Regression with .mono[R] ## EC 425/525, Lab 4 ### Edward Rubin ### 03 May 2019 --- class: inverse, middle # Prologue --- name: schedule # Schedule ## Last time 1. .mono[RStudio] basics 2. Getting data in and out of .mono[R]. ## Today Regression! --- name: review # Review ## Data i/o `readr` and `haven` have you covered for most of your i/o needs. ## Best practices 1. Write code in .mono[R] scripts. Troubleshoot in .mono[RStudio]. Then run the scripts. 1. Comment your code. (`# This is a comment`) 1. Name objects and variables with intelligible, standardized names. - .hi-purple[BAD] `ALLCARS`, `Vl123a8`, `a.fun`, `cens.12931`, `cens.12933` - .hi-pink[GOOD] `unique_cars`, `health_df`, `sim_fun`, `is_female`, `age` 1. Set seeds when generating randomness, _e.g._, `set.seed(123)`. 1. Parallelize when possible. (Packages: `parallel`, `purrr`, `foreach`, *etc.*) 1. Use projects in .mono[RStudio] (next). And organize your projects. --- name: reg class: inverse, middle # Regression in .mono[R] --- name: comics class: clear, middle, center <img src="coffee_economics.jpeg" width="85%" style="display: block; margin: auto;" /> --- class: clear, middle, center <img src="coffee_original.jpeg" width="85%" style="display: block; margin: auto;" /> --- class: clear, middle .b[Original credit] Tommy Siegel [@TommySiegel](https://twitter.com/TommySiegel) .b[Econ update] David Clingingsmith [@dclingi](https://twitter.com/dclingi) --- layout: true # Regression in .mono[R] --- name: lm ## The default option: `lm()` .mono[R]'s `base`.super[.pink[†]] option for estimating .hi[l]inear regression .hi[m]odels is `lm()`. .footnote[.pink[†] `base` is .mono[R]'s basic (default) package—loaded on opening. <br>.pink[††] You can remove the intercept by adding `-1` into the formula, _e.g._, `lm(y ~ -1 + x)`.] You use a formula to specify your linear regression model in `lm()`, _e.g._, `lm(y ~ x)` - Estimates `\(y_i = \beta_0 + \beta_1 x_i + u_i\)` (.mono[R] automatically includes an intercept).super[.pink[††]] - using the data stored in the objects `y` and `x`. `lm(y ~ x, data = amazing_df)` - Estimates `\(y_i = \beta_0 + \beta_1 x_i + u_i\)` - using the variables (columns) `y` and `x` from the object `amazing_df`. --- ## More `lm()` Need include more variables? Easy. `lm(y ~ x1 + x2 + x3, data = some_df)` - Estimates `\(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} + u_i\)` - referencing the object `some_df` for the named variables. --- ## Even more `lm()` Do you want to transform/interact variables? Also easy: use `I()`. `lm(y ~ x1 + x2 + I(x1^2) + I(x2^2) + I(x1*x2), data = poly_df)` - Estimates `\(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{1i}^2 + \beta_4 x_{2i}^2 + \beta_5 x_{1i} x_{2i} + u_i\)` - using variables named in object `poly_df` - or created via `I()` .note[Note] The following are equivalent - `lm(y ~ x1 + x2 + I(x1*x2))` - `lm(y ~ x1 + x2 + x1:x2)` - `lm(y ~ x1*x2)` --- name: transformations ## Transforming variables with `lm()` Notice that in our call `lm(y ~ x1 + x2 + I(x1^2) + I(x2^2) + I(x1*x2), data = poly_df)` we did not need to create `\(x_1^2\)`, `\(x_2^2\)`, and `\(x_1\times x_2\)` in the dataset. .mono[R] will do the calculation (as long as `x1` and `x2` exist somewhere). This is true for any transformation of variables/objects. - Math./stat. transformations: `I(x^2)`, `I(x/3)`, `I((x - mean(x))/sd(x))` - Log/exponential transformations: `log(x)`, `exp(x)` - Indicators: `I(x < 100)`, `I(x == "Oregon")` --- name: lalonde ## Need data Before we can go any further, we need data. Let's use data from .hi[[LaLonde (1986)](http://people.hbs.edu/nashraf/LaLonde_1986.pdf)]. This (famous) paper compared experimental and non-experimental estimates of the effect of a randomized jobs program called the National Supported Work Demonstration (NSW). The data are [available online](http://users.nber.org/~rdehejia/data/nswdata3.html) as a `.dta` file. ```r # Load the 'haven' package p_load(haven) # Load treatment data lalonde_df <- read_dta("http://users.nber.org/~rdehejia/data/nsw.dta") ``` --- layout: false class: clear, center, middle
--- layout: false class: clear, center, middle
--- layout: true # Regression in .mono[R] --- ## Output from `lm()` Back to `lm()`. The information that `lm()` prints to screen is underwhelming. ```r lm(re78 ~ treat, data = lalonde_df) ``` ``` #> #> Call: #> lm(formula = re78 ~ treat, data = lalonde_df) #> #> Coefficients: #> (Intercept) treat #> 5090.0 886.3 ``` But there's a lot more under the hood. --- ## Hidden information Let's save the output of our call to `lm()`. ```r est_lalonde <- lm(re78 ~ treat, data = lalonde_df) ``` What .hi-slate[class] is `est_lalonde`? ```r est_lalonde %>% class() ``` ``` #> [1] "lm" ``` which means there's probably a lot more going on than what printed. --- Does it have .hi-slate[names]? ```r est_lalonde %>% names() ``` ``` #> [1] "coefficients" "residuals" "effects" "rank" #> [5] "fitted.values" "assign" "qr" "df.residual" #> [9] "xlevels" "call" "terms" "model" ``` Can we .hi-slate[tidy] it? ```r est_lalonde %>% tidy() ``` ``` #> # A tibble: 2 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 5090. 303. 16.8 9.54e-54 #> 2 treat 886. 472. 1.88 6.09e- 2 ``` --- layout: false class: clear, middle You'll generally see folks take the `summary()` of an `lm` object. --- class: clear ```r est_lalonde %>% summary() ``` ``` #> #> Call: #> lm(formula = re78 ~ treat, data = lalonde_df) #> #> Residuals: #> Min 1Q Median 3Q Max #> -5976 -5090 -1519 3361 54332 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 5090.0 302.8 16.811 <2e-16 *** #> treat 886.3 472.1 1.877 0.0609 . #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 6242 on 720 degrees of freedom #> Multiple R-squared: 0.004872, Adjusted R-squared: 0.003489 #> F-statistic: 3.525 on 1 and 720 DF, p-value: 0.06086 ``` --- layout: true # Regression in .mono[R] --- ## `summary()` Interestingly, `summary()` contains additional information. ```r est_lalonde %>% names() ``` ``` #> [1] "coefficients" "residuals" "effects" "rank" #> [5] "fitted.values" "assign" "qr" "df.residual" #> [9] "xlevels" "call" "terms" "model" ``` ```r est_lalonde %>% summary() %>% names() ``` ``` #> [1] "call" "terms" "residuals" "coefficients" #> [5] "aliased" "sigma" "df" "r.squared" #> [9] "adj.r.squared" "fstatistic" "cov.unscaled" ``` --- ## `tidy()` That said, `summary()`'s output is a bit overwhelming. As we saw, the output from `tidy()` contained everything we generally want. ```r est_lalonde %>% tidy() ``` ``` #> # A tibble: 2 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 5090. 303. 16.8 9.54e-54 #> 2 treat 886. 472. 1.88 6.09e- 2 ``` --- ## Non-standard standard errors .qa[Q] Which estimator does `lm()` use for its standard errors? .qa[A] Spherical/classical/homoskedastic. .qa[Q] What if we want something else? --- name: outside ## Outside options .qa[Q] What if we want something else? .qa[A] This is why you'll end up replacing `lm()` with one of two different linear-regression packages. 1. `felm()` from the [`lfe`](https://cran.r-project.org/web/packages/lfe/index.html) package is incredibly fast with high-dimensional fixed effects, easily implements IV/2SLS, and offers heteroskedasticity- and cluster-robust standard errors estimators. 1. `lm_robust()` from the [`estimatr`](https://declaredesign.org/r/estimatr/) package is fast, has a sister function named `iv_robust()` for IV/2SLS, and allows for a wide range of heteroskedasticity- and cluster-robust standard error estimators. Both packages maintain the same `y ~ x` formula (plus additional features). --- name: estimatr ## `estimatr` Let's check out `lm_robust()` from `estimatr`. We still need input a `formula` and `data`, but now we have options for the type of standard error estimator (`se_type`). > The sort of standard error sought. If `clusters` is not specified the options are "HC0", "HC1" (or "stata", the equivalent), "HC2" (default), "HC3", or "classical". If `clusters` is specified the options are "CR0", "CR2" (default), or "stata". Can also specify "none", which may speed up estimation of the coefficients. .grey-light[`estimatr` documentation for `lm_robust()`] --- ## `estimatr` Now for the magic. ```r # Load 'estimatr' package p_load(estimatr) # Estimate robust_lalonde <- lm_robust(re78 ~ treat, data = lalonde_df) # Tidy results tidy(robust_lalonde) ``` ``` #> term estimate std.error statistic p.value conf.low #> 1 (Intercept) 5090.0483 277.3680 18.351243 5.335543e-62 4545.50153 #> 2 treat 886.3037 488.2045 1.815435 6.987292e-02 -72.17078 #> conf.high df outcome #> 1 5634.595 720 re78 #> 2 1844.778 720 re78 ``` --- ## Prediction So what if we want to get the predictions from a regression? You have *a lot* of options. Here are two (for different scenarios). .note[Scenario 1] You want predictions for the original `\(X\)`, ```r robust_lalonde$fitted.values ``` .note[Scenario 2] You want predictions (with prediction intervals) for new data, ```r predict( object = robust_lalonde, newdata = new_df, interval = "prediction" ) ``` --- name: logit ## Other regressions Ordinary least squares (OLS) is only one of many types of regressions. If you can run one regression in .mono[R], you can run any regression in .mono[R]..super[.pink[†]] .footnote[.pink[†] This is not to say that you should or that you'll know what you're doing. <br>.pink[††] Logit models are a popular nonlinear regression model for binary outcomes.] _E.g._, logistic regression ("logit").super[.pink[††]] in .mono[R] uses the `glm()` function. To estimate the probability of treatment on observables in LaLonde's data, ```r glm( treat ~ age + education + black + hispanic + married, * family = "binomial", data = lalonde_df ) ``` --- layout: false class: clear ```r # Estimate logit model trt_logit <- glm( treat ~ age + education + black + hispanic + married, family = "binomial", data = lalonde_df ) # Tidy logit results trt_logit %>% tidy() ``` ``` #> # A tibble: 6 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) -0.888 0.611 -1.45 0.146 #> 2 age 0.00229 0.0119 0.193 0.847 #> 3 education 0.0606 0.0456 1.33 0.183 #> 4 black -0.163 0.259 -0.629 0.529 #> 5 hispanic -0.289 0.346 -0.835 0.404 #> 6 married 0.0600 0.210 0.285 0.775 ``` (Not too surprising, since treatment was randomly assigned.) --- layout: false name: resources # Additional resources ## There's more General resources - The [`swirl` package](https://swirlstats.com/) will teach you .mono[R] in .mono[R]. Regression with `estimatr` - Getting started with `estimatr` - A [cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/estimatr.pdf) for `estimatr` Logit models (logistic regression) - [Examples and discussion](https://stats.idre.ucla.edu/r/dae/logit-regression/) - [Interpretting results from logit models](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/) --- layout: false # Table of contents .pull-left[ ### Regression in .mono[R] .smaller[ 1. [Schedule](#schedule) 1. [Review](#review) 1. [Regression](#reg) - [Coffee comics](#comics) - [The `base` option: `lm`](#lm) - [Transformations](#transformations) - [LaLonde (1986)](#lalonde) - [Other options](#outside) - [`estimatr` and `lm_robust()`](#estimatr) - [Prediction](#predict) - [Logistic regression](#logit) 1. [More resources](#resources) ]] --- exclude: true