MKT615: Data Storytelling for Marketers

class: center, middle, inverse, title-slide

.title[
# MKT615: Data Storytelling for Marketers
]
.subtitle[
## Lecture 6: Regressions
]
.author[
### Davide Proserpio
]
.institute[
### Marshall School of Business
]

---

# Table of contents

1. [Checklist](#check)

2. [Basics](#basics)

3. [Robust SE](#robust)

4. [Clustered SE](#cluster)

5. [Dummy variables](#dummy)

6. [Interactions](#interactions)

7. [Panel data](#fixed)
---
name: check

# Checklist

We will need  **data.table**, **tidyverse**, **ggplot2**
---
class: inverse, center, middle
name: basics

# Regression basics

---
# Linear models with `lm()`

- `lm()` is the main command to run linear regressions 
- Syntax: lm(y ~ x_1 + x_2 + ... + x_n, data = yourdataset)
- To summarize the output, you should use `summary`. Let's see an example the ggplot2's `diamonds` dataset.

---
# Linear models with `lm()` (cont.)

```r
  summary(lm(price ~ carat, data = diamonds))
  ```
  
  ```
  ## 
  ## Call:
  ## lm(formula = price ~ carat, data = diamonds)
  ## 
  ## Residuals:
  ##      Min       1Q   Median       3Q      Max 
  ## -18585.3   -804.8    -18.9    537.4  12731.7 
  ## 
  ## Coefficients:
  ##             Estimate Std. Error t value Pr(>|t|)    
  ## (Intercept) -2256.36      13.06  -172.8   <2e-16 ***
  ## carat        7756.43      14.07   551.4   <2e-16 ***
  ## ---
  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 1549 on 53938 degrees of freedom
  ## Multiple R-squared:  0.8493,	Adjusted R-squared:  0.8493 
  ## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16
  ```
---
# Linear models with `lm()` (cont.)

```r
  ols = lm(price ~ carat, data = diamonds) # save the model
  summary(ols)
  ```
  
  ```
  ## 
  ## Call:
  ## lm(formula = price ~ carat, data = diamonds)
  ## 
  ## Residuals:
  ##      Min       1Q   Median       3Q      Max 
  ## -18585.3   -804.8    -18.9    537.4  12731.7 
  ## 
  ## Coefficients:
  ##             Estimate Std. Error t value Pr(>|t|)    
  ## (Intercept) -2256.36      13.06  -172.8   <2e-16 ***
  ## carat        7756.43      14.07   551.4   <2e-16 ***
  ## ---
  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 1549 on 53938 degrees of freedom
  ## Multiple R-squared:  0.8493,	Adjusted R-squared:  0.8493 
  ## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16
  ```

---
# Linear models with `lm()` (cont.)

You can apply operators to the variables, subset the data within `lm`, e.g.:

```r
  summary(lm(log(price) ~ log(carat), data = diamonds))
  summary(lm(log(price) ~ log(carat), data = diamonds %>% filter(cut == "Good")))
  ```
  
---
name: robust
# Robust standard errors

- `lm()` does not produce heteroskedasticity-consistent (HC) “robust” standard errors
- but `estimatr::lm_robust()` does:
  
  ```r
  ols_robust = lm_robust(price ~ carat, data = diamonds) # save the model
  summary(ols_robust)
  ```
  
  ```
  ## 
  ## Call:
  ## lm_robust(formula = price ~ carat, data = diamonds)
  ## 
  ## Standard error type:  HC2 
  ## 
  ## Coefficients:
  ##             Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper    DF
  ## (Intercept)    -2256      16.13  -139.9        0    -2288    -2225 53938
  ## carat           7756      25.40   305.3        0     7707     7806 53938
  ## 
  ## Multiple R-squared:  0.8493 ,	Adjusted R-squared:  0.8493 
  ## F-statistic: 9.323e+04 on 1 and 53938 DF,  p-value: < 2.2e-16
  ```

---
name: cluster
# Clustered standard errors

Useful when we have serial correlation (panel data). (Note that I am using a smaller dataset, starwars)

```r
  ols_cluster = lm_robust(mass ~ height, 
                data = starwars %>% filter(!is.na(homeworld)), clusters = homeworld)
  summary(ols_cluster)
  ```
  
  ```
  ## 
  ## Call:
  ## lm_robust(formula = mass ~ height, data = starwars %>% filter(!is.na(homeworld)), 
  ##     clusters = homeworld)
  ## 
  ## Standard error type:  CR2 
  ## 
  ## Coefficients:
  ##             Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper    DF
  ## (Intercept)   -6.457    30.4271 -0.2122 0.8376016 -77.4712   64.557 7.485
  ## height         0.596     0.1066  5.5899 0.0004845   0.3509    0.841 8.144
  ## 
  ## Multiple R-squared:  0.01295 ,	Adjusted R-squared:  -0.005328 
  ## F-statistic: 31.25 on 1 and 38 DF,  p-value: 2.079e-06
  ```
---
name: dummy
# Dummy variables

Strings are automatically converted to factors; otherwise you need to use `factors()`, e.g.:

```r
  diamonds_dt = data.table(diamonds)
  diamonds_dt[, cut := as.character(cut)]
  ols = lm(price ~ cut, data = diamonds_dt) # save the model
  summary(ols)
  ```
  
  ```
  ## 
  ## Call:
  ## lm(formula = price ~ cut, data = diamonds_dt)
  ## 
  ## Residuals:
  ##    Min     1Q Median     3Q    Max 
  ##  -4258  -2741  -1494   1360  15348 
  ## 
  ## Coefficients:
  ##              Estimate Std. Error t value Pr(>|t|)    
  ## (Intercept)   4358.76      98.79  44.122  < 2e-16 ***
  ## cutGood       -429.89     113.85  -3.776 0.000160 ***
  ## cutIdeal      -901.22     102.41  -8.800  < 2e-16 ***
  ## cutPremium     225.50     104.40   2.160 0.030772 *  
  ## cutVery Good  -377.00     105.16  -3.585 0.000338 ***
  ## ---
  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 3964 on 53935 degrees of freedom
  ## Multiple R-squared:  0.01286,	Adjusted R-squared:  0.01279 
  ## F-statistic: 175.7 on 4 and 53935 DF,  p-value: < 2.2e-16
  ```
  
---

# Dummy variables (cont.)

Change factor base level or levels order

```r
  diamonds_dt[, cut:=relevel(factor(cut), ref = "Ideal")]
  ols = lm(price ~ cut, data = diamonds_dt) # save the model
  summary(ols)
  ```
  
  ```
  ## 
  ## Call:
  ## lm(formula = price ~ cut, data = diamonds_dt)
  ## 
  ## Residuals:
  ##    Min     1Q Median     3Q    Max 
  ##  -4258  -2741  -1494   1360  15348 
  ## 
  ## Coefficients:
  ##              Estimate Std. Error t value Pr(>|t|)    
  ## (Intercept)   3457.54      27.00 128.051  < 2e-16 ***
  ## cutFair        901.22     102.41   8.800  < 2e-16 ***
  ## cutGood        471.32      62.70   7.517  5.7e-14 ***
  ## cutPremium    1126.72      43.22  26.067  < 2e-16 ***
  ## cutVery Good   524.22      45.05  11.636  < 2e-16 ***
  ## ---
  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 3964 on 53935 degrees of freedom
  ## Multiple R-squared:  0.01286,	Adjusted R-squared:  0.01279 
  ## F-statistic: 175.7 on 4 and 53935 DF,  p-value: < 2.2e-16
  ```

---

# Dummy variables (cont.)

Change factor base level or levels order

```r
  diamonds_dt[, cut := factor(cut, levels = c("Fair", "Good", "Very Good", "Premium", "Ideal"))]
  ols = lm(price ~ cut, data = diamonds_dt) # save the model
  summary(ols)
  ```
  
  ```
  ## 
  ## Call:
  ## lm(formula = price ~ cut, data = diamonds_dt)
  ## 
  ## Residuals:
  ##    Min     1Q Median     3Q    Max 
  ##  -4258  -2741  -1494   1360  15348 
  ## 
  ## Coefficients:
  ##              Estimate Std. Error t value Pr(>|t|)    
  ## (Intercept)   4358.76      98.79  44.122  < 2e-16 ***
  ## cutGood       -429.89     113.85  -3.776 0.000160 ***
  ## cutVery Good  -377.00     105.16  -3.585 0.000338 ***
  ## cutPremium     225.50     104.40   2.160 0.030772 *  
  ## cutIdeal      -901.22     102.41  -8.800  < 2e-16 ***
  ## ---
  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 3964 on 53935 degrees of freedom
  ## Multiple R-squared:  0.01286,	Adjusted R-squared:  0.01279 
  ## F-statistic: 175.7 on 4 and 53935 DF,  p-value: < 2.2e-16
  ```
  
---
name: interactions
# Interactions

- Full interaction: `*`
- Given vars `a` and `b`, `*` creates 3 vars: `a, b, a x b`

```r
  ols = lm(price ~ carat*depth, data = diamonds_dt) # save the model
  summary(ols)
  ```
  
  ```
  ## 
  ## Call:
  ## lm(formula = price ~ carat * depth, data = diamonds_dt)
  ## 
  ## Residuals:
  ##      Min       1Q   Median       3Q      Max 
  ## -15039.2   -799.3    -18.1    539.6  12666.5 
  ## 
  ## Coefficients:
  ##              Estimate Std. Error t value Pr(>|t|)    
  ## (Intercept) -7823.738    592.049 -13.215   <2e-16 ***
  ## carat       20742.600    567.672  36.540   <2e-16 ***
  ## depth          90.043      9.588   9.391   <2e-16 ***
  ## carat:depth  -210.075      9.187 -22.868   <2e-16 ***
  ## ---
  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 1534 on 53936 degrees of freedom
  ## Multiple R-squared:  0.8521,	Adjusted R-squared:  0.8521 
  ## F-statistic: 1.036e+05 on 3 and 53936 DF,  p-value: < 2.2e-16
  ```

---
# Interactions (cont.)

- Simple interaction: `:`
- Given vars `a` and `b`, `:` creates 1 var: `a x b`

```r
  ols = lm(price ~ carat:depth, data = diamonds_dt) # save the model
  summary(ols)
  ```
  
  ```
  ## 
  ## Call:
  ## lm(formula = price ~ carat:depth, data = diamonds_dt)
  ## 
  ## Residuals:
  ##      Min       1Q   Median       3Q      Max 
  ## -20696.6   -802.8    -21.7    529.0  12805.6 
  ## 
  ## Coefficients:
  ##               Estimate Std. Error t value Pr(>|t|)    
  ## (Intercept) -2215.1687    13.3261  -166.2   <2e-16 ***
  ## carat:depth   124.7268     0.2323   537.0   <2e-16 ***
  ## ---
  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 1584 on 53938 degrees of freedom
  ## Multiple R-squared:  0.8424,	Adjusted R-squared:  0.8424 
  ## F-statistic: 2.883e+05 on 1 and 53938 DF,  p-value: < 2.2e-16
  ```

---
name: fixed

# Panel data

- If you have panel data (especially large ones) and want to run a regression with one (or multiple) fixed effects, `lm()` is not ideal and slow (it requires a dummy for each fixed effect)
- There are packages dedicated to faster panel data analysis such as `lfe` and `fixest`
- In this class, we are going to see how `lfe` works, so let's install it and load it:
  
  ```r
  #install.packages("lfe")
  library(lfe)
  ```
  
  ```
  ## Loading required package: Matrix
  ```
  
  ```
  ## 
  ## Attaching package: 'Matrix'
  ```
  
  ```
  ## The following objects are masked from 'package:tidyr':
  ## 
  ##     expand, pack, unpack
  ```

---

# lfe::felm

- The function `felm` fits a linear model with multiple group fixed effects
- (It does additional things, such as Instrumental Variables regressions but we won't cover them)

- Syntax:

`felm(formula | fixed effects | IV part | which var to cluster on)`
  
---

# lfe::felm (cont.)

1. Simple linear model w/o fixed effects:
  
  ```r
  felm1 = felm(price ~ carat | 0 | 0 | 0, data = diamonds)
  summary(felm1)
  ```
  
  ```
  ## 
  ## Call:
  ##    felm(formula = price ~ carat | 0 | 0 | 0, data = diamonds) 
  ## 
  ## Residuals:
  ##      Min       1Q   Median       3Q      Max 
  ## -18585.3   -804.8    -18.9    537.4  12731.7 
  ## 
  ## Coefficients:
  ##             Estimate Std. Error t value Pr(>|t|)    
  ## (Intercept) -2256.36      13.06  -172.8   <2e-16 ***
  ## carat        7756.43      14.07   551.4   <2e-16 ***
  ## ---
  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 1549 on 53938 degrees of freedom
  ## Multiple R-squared(full model): 0.8493   Adjusted R-squared: 0.8493 
  ## Multiple R-squared(proj model): 0.8493   Adjusted R-squared: 0.8493 
  ## F-statistic(full model):3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16 
  ## F-statistic(proj model): 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
  ```

---

# lfe::felm (cont.)

2. Simple linear model w/o fixed effects and robust standard errors:
  
  ```r
  felm2 = felm(price ~ carat | 0 | 0 | 0, data = diamonds)
  summary(felm2, robust = T)
  ```
  
  ```
  ## 
  ## Call:
  ##    felm(formula = price ~ carat | 0 | 0 | 0, data = diamonds) 
  ## 
  ## Residuals:
  ##      Min       1Q   Median       3Q      Max 
  ## -18585.3   -804.8    -18.9    537.4  12731.7 
  ## 
  ## Coefficients:
  ##             Estimate Robust s.e t value Pr(>|t|)    
  ## (Intercept) -2256.36      16.13  -139.9   <2e-16 ***
  ## carat        7756.43      25.40   305.4   <2e-16 ***
  ## ---
  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 1549 on 53938 degrees of freedom
  ## Multiple R-squared(full model): 0.8493   Adjusted R-squared: 0.8493 
  ## Multiple R-squared(proj model): 0.8493   Adjusted R-squared: 0.8493 
  ## F-statistic(full model, *iid*):3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16 
  ## F-statistic(proj model): 9.326e+04 on 1 and 53938 DF, p-value: < 2.2e-16
  ```

---
# lfe::felm (cont.)

3. Adding fixed effects:
  
  ```r
  felm3 = felm(price ~ carat | cut | 0 | 0, data = diamonds)
  summary(felm3, robust = T)
  ```
  
  ```
  ## 
  ## Call:
  ##    felm(formula = price ~ carat | cut | 0 | 0, data = diamonds) 
  ## 
  ## Residuals:
  ##      Min       1Q   Median       3Q      Max 
  ## -17540.7   -791.6    -37.6    522.1  12721.4 
  ## 
  ## Coefficients:
  ##       Estimate Robust s.e t value Pr(>|t|)    
  ## carat  7871.08      24.88   316.3   <2e-16 ***
  ## ---
  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 1511 on 53934 degrees of freedom
  ## Multiple R-squared(full model): 0.8565   Adjusted R-squared: 0.8565 
  ## Multiple R-squared(proj model): 0.8546   Adjusted R-squared: 0.8546 
  ## F-statistic(full model, *iid*):6.437e+04 on 5 and 53934 DF, p-value: < 2.2e-16 
  ## F-statistic(proj model): 1.001e+05 on 1 and 53934 DF, p-value: < 2.2e-16
  ```

--
Note that fixed effect are not reported!

---
# lfe::felm (cont.)

4. Adding fixed effects and cluster standard errors:
  
  ```r
  felm4 = felm(price ~ carat | cut | 0 | cut, data = diamonds)
  summary(felm4) # we don't need robust = T in this case
  ```
  
  ```
  ## 
  ## Call:
  ##    felm(formula = price ~ carat | cut | 0 | cut, data = diamonds) 
  ## 
  ## Residuals:
  ##      Min       1Q   Median       3Q      Max 
  ## -17540.7   -791.6    -37.6    522.1  12721.4 
  ## 
  ## Coefficients:
  ##       Estimate Cluster s.e. t value Pr(>|t|)    
  ## carat   7871.1        154.8   50.85 8.95e-07 ***
  ## ---
  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 1511 on 53934 degrees of freedom
  ## Multiple R-squared(full model): 0.8565   Adjusted R-squared: 0.8565 
  ## Multiple R-squared(proj model): 0.8546   Adjusted R-squared: 0.8546 
  ## F-statistic(full model, *iid*):6.437e+04 on 5 and 53934 DF, p-value: < 2.2e-16 
  ## F-statistic(proj model):  2586 on 1 and 4 DF, p-value: 8.949e-07
  ```

---
# lfe::felm (cont.)

5. More fixed effects:
  
  ```r
  felm5 = felm(price ~ carat | cut + color + clarity | 0 | cut, data = diamonds)
  summary(felm5) # we don't need robust = T in this case
  ```
  
  ```
  ## 
  ## Call:
  ##    felm(formula = price ~ carat | cut + color + clarity | 0 | cut,      data = diamonds) 
  ## 
  ## Residuals:
  ##      Min       1Q   Median       3Q      Max 
  ## -16813.5   -680.4   -197.6    466.4  10394.9 
  ## 
  ## Coefficients:
  ##       Estimate Cluster s.e. t value Pr(>|t|)    
  ## carat   8886.1        169.7   52.38 7.95e-07 ***
  ## ---
  ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  ## 
  ## Residual standard error: 1157 on 53921 degrees of freedom
  ## Multiple R-squared(full model): 0.9159   Adjusted R-squared: 0.9159 
  ## Multiple R-squared(proj model):  0.91   Adjusted R-squared:  0.91 
  ## F-statistic(full model, *iid*):3.264e+04 on 18 and 53921 DF, p-value: < 2.2e-16 
  ## F-statistic(proj model):  2743 on 1 and 4 DF, p-value: 7.954e-07 
  ## *** Standard errors may be too high due to more than 2 groups and exactDOF=FALSE
  ```

---
# lfe:felm (cont.)

- `felm` uses an approximation algorithm to estimate a model with many fixed effects. The centering on the means is done with a tolerance which by default is 1e-8.
- So you can set the *tollerance* parameter epsilon (the smaller the epsilon, the more accurate are the estimates but the slower is the computation):
  
    `options(lfe.eps=1e-4) #faster results, less accurate`

- By default `lfe` use all the available cores, to limit them: `options(lfe.threads = 2)`

- If you are interested in learning more about `lfe`: https://cran.r-project.org/web/packages/lfe/lfe.pdf