The package fixest provides a family of functions to perform estimations with multiple fixed-effects, instrumental variables and provides clustered standard errors without the need to use a third-party package. The two main functions are feols for linear models and feglm for generalized linear models.

Installation and dataset ingestion

Install and load the package in order to use it.

# install.packages("fixest")
library(data.table)
library(fixest)

We will use for this class a dataset containing a sample of 12,834 individuals in the labor force extracted from the 2019 annual supplement of the 2019 US Current Population Survey.

dt <- fread("Data/cps_union_data.csv")

Cleaning

Before doing any analysis or regressions we must first check what kind of data we have been provided and make the appropriate pre-processing in order to have a “ready to regress” dataset. Usually, one would drop any columns that: i) have no variation among observations, ii) have too many missing values. Observations that eventually have a missing value for some variable should be treated case-by-case. Can you impute those values? Is that variable relevant for this regression? Should be allowed only complete cases in the dataset?

summary(dt)
##        V1             CPSID               CPSIDP               public_housing 
##  Min.   :    12   Min.   :2.017e+13   Min.   :20171200000302   Min.   :0.000  
##  1st Qu.: 40891   1st Qu.:2.017e+13   1st Qu.:20171204151101   1st Qu.:0.000  
##  Median : 84176   Median :2.018e+13   Median :20181200737602   Median :0.000  
##  Mean   : 86055   Mean   :2.018e+13   Mean   :20177009551114   Mean   :0.041  
##  3rd Qu.:131144   3rd Qu.:2.018e+13   3rd Qu.:20181204062606   3rd Qu.:0.000  
##  Max.   :179189   Max.   :2.019e+13   Max.   :20190307097802   Max.   :1.000  
##                                                                NA's   :8701   
##       age            female            race       marital_status 
##  Min.   :15.00   Min.   :0.0000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:30.00   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :42.00   Median :0.0000   Median :1.000   Median :1.000  
##  Mean   :42.49   Mean   :0.4963   Mean   :1.557   Mean   :2.971  
##  3rd Qu.:54.00   3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:6.000  
##  Max.   :85.00   Max.   :1.0000   Max.   :8.000   Max.   :6.000  
##                                                                  
##     veteran           employed   education     worked_last_year
##  Min.   :0.00000   Min.   :1   Min.   :  2.0   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:1   1st Qu.: 73.0   1st Qu.:1.0000  
##  Median :0.00000   Median :1   Median : 81.0   Median :1.0000  
##  Mean   :0.05756   Mean   :1   Mean   : 91.1   Mean   :0.9758  
##  3rd Qu.:0.00000   3rd Qu.:1   3rd Qu.:111.0   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1   Max.   :125.0   Max.   :1.0000  
##  NA's   :83                                                    
##  total_income_last_year wage_income_last_year own_farm_income_last_year
##  Min.   :      0        Min.   :      0       Min.   :   -987.0        
##  1st Qu.:  25009        1st Qu.:  23000       1st Qu.:      0.0        
##  Median :  44474        Median :  40000       Median :      0.0        
##  Mean   :  61237        Mean   :  55788       Mean   :    113.9        
##  3rd Qu.:  73005        3rd Qu.:  68000       3rd Qu.:      0.0        
##  Max.   :1755499        Max.   :1314999       Max.   :1099999.0        
##                                                                        
##  private_health_insurance    medicaid       class_of_worker
##  Min.   :0.0000           Min.   :0.00000   Min.   :21.00  
##  1st Qu.:1.0000           1st Qu.:0.00000   1st Qu.:21.00  
##  Median :1.0000           Median :0.00000   Median :21.00  
##  Mean   :0.8121           Mean   :0.08446   Mean   :21.98  
##  3rd Qu.:1.0000           3rd Qu.:0.00000   3rd Qu.:21.00  
##  Max.   :1.0000           Max.   :1.00000   Max.   :28.00  
##                                                            
##  class_of_worker_last_year     union           earnings     
##  Min.   : 0.00             Min.   :0.0000   Min.   :   1.0  
##  1st Qu.:22.00             1st Qu.:0.0000   1st Qu.: 480.0  
##  Median :22.00             Median :0.0000   Median : 794.0  
##  Mean   :22.18             Mean   :0.1152   Mean   : 995.6  
##  3rd Qu.:22.00             3rd Qu.:0.0000   3rd Qu.:1308.0  
##  Max.   :29.00             Max.   :1.0000   Max.   :2885.0  
##                                             NA's   :19

Taking a look at the summary table above you can readly see that public_housing has 8701 NA’s and we should drop it. Moreover, employed is useless for regression since it has no variation at all, drop it. Some other NA values are found in veteran and earnings but they are not numerous. In this case you must have in mind what kind of regression you are running. Suppose your goal is to estimate the causal effect of union membership/coverage (variable union) on weekly earnings (variable earnings). Obviously, you should drop any observations where earnings is missing. Also, there are two columns that according to the provided dictionary (dictionary.xlsx) are categorical: class_of_worker, class_of_worker_last_year (the name class should’ve hinted you) and both marital_status and race which you may want to recode to be binary. Let’s do all of that.

dt <- dt[!is.na(earnings)] # Keep only not NA in earnings
dt[, c("V1", "CPSID", "CPSIDP", "public_housing", "employed") := NULL] # drop whole columns
dt[, `:=`(
  marital_status = ifelse(marital_status %in% c(1, 2), 1, 0),
  race = ifelse(race == 1, 1, 0),
  age_2 = age^2
)]
cat_cols <- c("class_of_worker", "class_of_worker_last_year")
dt[, (cat_cols) := lapply(.SD, factor), .SDcols = cat_cols]

Linear Regressions

Simple regression

Now we are ready to perform a simple regression of earnings on union (a binary variable) and interpret it. Is this a causal regression? Why?

simple_reg <- feols(earnings~union, data = dt)

etable(simple_reg, cluster = ~class_of_worker_last_year)
##                       simple_reg
## Dependent Var.:         earnings
##                                 
## (Intercept)     976.5*** (19.40)
## union           165.7*** (15.80)
## _______________ ________________
## S.E.: Clustered by: class_of_w..
## Observations              12,815
## R2                       0.00544
## Adj. R2                  0.00537

Multiple regression

Now let’s add some control variables to our regression. The Mincerian equation stipulate that age, age squared, education and female should play a role in determining earnings, let’s also add race, marital_status and class_of_worker.

mult_reg <- feols(
  earnings~union+age+age_2+education+female+race+marital_status+class_of_worker,
  data = dt
)

etable(mult_reg, cluster = ~class_of_worker_last_year, drop = "class_of_worker")
##                            mult_reg
## Dependent Var.:            earnings
##                                    
## (Intercept)     -1,444.1*** (19.81)
## union              94.22*** (15.78)
## age                60.98*** (1.755)
## age_2           -0.6218*** (0.0211)
## education         13.11*** (0.3877)
## female            -334.7*** (11.27)
## race                 27.01* (9.142)
## marital_status      104.8** (23.24)
## _______________ ___________________
## S.E.: Clustered by: class_of_work..
## Observations                 12,815
## R2                          0.33918
## Adj. R2                     0.33866

Multiple estimations

The fixest package allows for multiple estimations (i.e. changing the dependent variable) at once. Suppose you want to regress not only earnings on union but also wage_income_last_year and total_income_last_year. You can do that in just one call to feols function.

setnames(dt, c("wage_income_last_year", "total_income_last_year"),
         c("wage", "total"))
mult_est <- feols(c(earnings, wage, total)~union, data = dt)
etable(mult_est, se = "hetero")
##                          model 1             model 2             model 3
## Dependent Var.:         earnings                wage               total
##                                                                         
## (Intercept)     976.5*** (6.816) 55,188.6*** (673.4) 60,643.3*** (735.1)
## union           165.7*** (17.82) 5,362.0** (1,650.7) 5,254.6** (1,766.1)
## _______________ ________________ ___________________ ___________________
## S.E. type       Heterosked.-rob. Heteroskedast.-rob. Heteroskedast.-rob.
## Observations              12,815              12,815              12,815
## R2                       0.00544             0.00059             0.00048
## Adj. R2                  0.00537             0.00052             0.00040

Step-wise estimations

You can also incrementally increase the complexity of your model by introducing new regressors. To estimate multiple RHS, you need to use a specific set of functions, the stepwise functions. There are four of them: sw, sw0, csw, csw0.

  • sw: this function is replaced sequentially by each of its arguments,
  • sw0: starts with the empty element,
  • csw: it stands for cumulative stepwise. It adds to the formula each of its arguments sequentially,
  • csw0: cumulative, but starting with the empty element.
step_wise <- feols(earnings~union+csw0(age, age_2),
                   data = dt)
etable(step_wise)
##                          model 1           model 2             model 3
## Dependent Var.:         earnings          earnings            earnings
##                                                                       
## (Intercept)     976.5*** (6.722)  598.0*** (19.13)   -852.2*** (48.65)
## union           165.7*** (19.79)  132.9*** (19.51)    100.2*** (18.80)
## age                              8.998*** (0.4269)    83.45*** (2.348)
## age_2                                              -0.8467*** (0.0263)
## _______________ ________________ _________________ ___________________
## S.E. type               Standard          Standard            Standard
## Observations              12,815            12,815              12,815
## R2                       0.00544           0.03878             0.11076
## Adj. R2                  0.00537           0.03863             0.11055

More details are provided in fixest’s vignette on multiple estimations.

Instrumental Variables

As we all know, if we were interested in the causal effect of education attainment on earnings, we would not be able to run a multiple regression like above and interpret it as causal, education is endogenous with relation to earnings. Thus, we may resort to instrumental variable - IV - estimation to overcome this pitfall. Suppose, just for the sake of this example, that veteran status is a potential instrument to education. How can you estimate a two-stage least squares regression with fixest?

First notice that veteran still have NA values in it. Now you have to decide wether to impute values on it or just drop those observations. Let’s proceed dropping the NA values.

dt2 <- dt[!is.na(veteran)]
iv_reg <- feols(earnings~union+age+age_2+female+race+marital_status+class_of_worker|education~veteran,
  data = dt2)
etable(iv_reg, keep = c("education", "union"), fitstat = ~.+ivf+ivf.p+wh+wh.p)
##                                                 iv_reg
## Dependent Var.:                               earnings
##                                                       
## education                               -10.61 (19.36)
## union                                  81.91** (25.13)
## ______________________________________ _______________
## S.E. type                                     Standard
## Observations                                    12,732
## R2                                            -0.20462
## Adj. R2                                       -0.20556
## F-test (1st stage), education                   3.3921
## F-test (1st stage), p-value, education         0.06553
## Wu-Hausman                                      2.7669
## Wu-Hausman, p-value                            0.09626

So, what have we done here? First notice how the endogenous variable does not appear on the RHS of the formula, it goes into a formula of its own after the vertical | bar delimiter. The formula for endogenous variables and its instruments takes the form: endogenous ~ instrument.

On the table presented above we opt to show only two coefficents by using the argument keep, education, our instrumented variable and union. Whenever using an IV approach to estimate something, it’s useful to perform tests of weak instrument and exclusion restriction. Here we are showing the first-stage F-test to assess instrument weakness and the Wu-Hausman endogeneity test where H0 is the absence of endogeneity of the instrumented variables. No wonder veteran status is a weak instrument, its correlation to education is very small at 0.0170532.

Fixed-effects

Let’s change the dataset, enough of earnings! Who wants to make money after all … In International Trade we have what is called gravity models in which we are interested in finding out the negative effect of geographic distance on trade flows. Gravity models typically include many fixed effects to account for product, period, exporter country and importer country. A very simple gravity equation would take the form:

\[ \log(trade_{nipt})=\beta \log(\text{distance}_{ni})+\gamma_n+\gamma_i+\gamma_p+\gamma_t+\varepsilon_{nipt} \] where we have the indexes \(\{n, i, p, t\}\) respectively for the importer, exporter, product and period.

data("trade")
trade <- as.data.table(trade)
head(trade)
##    Destination Origin Product Year  dist_km    Euros
## 1:          LU     BE       1 2007 139.5719  2966697
## 2:          BE     LU       1 2007 139.5719  6755030
## 3:          LU     BE       2 2007 139.5719 57078782
## 4:          BE     LU       2 2007 139.5719  7117406
## 5:          LU     BE       3 2007 139.5719 17379821
## 6:          BE     LU       3 2007 139.5719  2622254
gravity <- feols(
  log(Euros) ~ log(dist_km) | Destination + Origin + Product + Year, 
  data = trade)
etable(gravity, cluster = ~Origin + Product)
##                            gravity
## Dependent Var.:         log(Euros)
##                                   
## log(dist_km)    -2.170*** (0.1603)
## Fixed-Effects:  ------------------
## Destination                    Yes
## Origin                         Yes
## Product                        Yes
## Year                           Yes
## _______________ __________________
## S.E.: Clustered  by: Orig. & Prod.
## Observations                38,325
## R2                         0.70558
## Within R2                  0.21932

Like the IV approach, we provide the fixed effects in a formula of its own, after the vertical bar. But notice one very important distinction, the formula for fixed-effects is one sided, that is, only the fixed-effects, put together by + signs, are included. Also, you are able to cluster your standard-errors by more than one variable.

Difference-in-Differences

The trade dataset is a panel of several european countries stacked along the years, from 2007 to 1016. Just for demonstration purposes let’s assume the following scenario: countries DE, FR and GB implement a reduction in tariffs, starting in 2010. With the given assumptions, it is possible to estimate the effect of such a measure on trade flows by DID. We have three countries in the treatment group and the treatment happens from 2010 onwards. Let’s create two dummy variables for group and period and then estimate our DID model1.

trade[, `:=`(
  d_g = ifelse(Destination %in% c("DE", "FR", "GB"), 1, 0),
  d_t = ifelse(Year >= 2010, 1, 0)
)]
trade[, d_treat := d_g*d_t]
did <- feols(log(Euros) ~ d_treat+log(dist_km)| Destination + Year,
             data = trade)
etable(did, cluster = ~Origin + Product)
##                                did
## Dependent Var.:         log(Euros)
##                                   
## d_treat            0.0078 (0.0397)
## log(dist_km)    -2.387*** (0.5411)
## Fixed-Effects:  ------------------
## Destination                    Yes
## Year                           Yes
## _______________ __________________
## S.E.: Clustered  by: Orig. & Prod.
## Observations                38,325
## R2                         0.27227
## Within R2                  0.18419

A DID specification can be interpreted and run as a canonical two-way fixed effects (TWFE). You can add control variables, like log(dist_km) either to make the unconfoundedness assumption more plausible or to improve the precision of your estimator.

Event Studies

What if we want to run the entire “event-study” where we estimate the treatment effect for the whole period at hand, even before the actual treatment takes place (placebo-like regression)? In a regression context, TWFE essentially amounts to an interaction between our treatment group dummy and Year variables. This is easily done using the i(fact_var, num_var, reference) syntax:

es <- feols(log(Euros)~log(dist_km) + i(Year, d_g, "2009")|Destination + Year,
            data = trade)
etable(es, cluster = ~Origin + Product)
##                                   es
## Dependent Var.:           log(Euros)
##                                     
## log(dist_km)      -2.387*** (0.5411)
## d_g x Year = 2007    0.0742 (0.0778)
## d_g x Year = 2008    0.1161 (0.0663)
## d_g x Year = 2010   -0.0918 (0.0600)
## d_g x Year = 2011    0.0323 (0.0552)
## d_g x Year = 2012   0.0750* (0.0320)
## d_g x Year = 2013   0.1080* (0.0488)
## d_g x Year = 2014  0.0980** (0.0310)
## d_g x Year = 2015   0.1535* (0.0587)
## d_g x Year = 2016  0.1240** (0.0334)
## Fixed-Effects:    ------------------
## Destination                      Yes
## Year                             Yes
## _________________ __________________
## S.E.: Clustered    by: Orig. & Prod.
## Observations                  38,325
## R2                           0.27235
## Within R2                    0.18428

Here we set the last year before treatment takes place, 2009, as the reference, thus, its coefficient is set to zero and is not reported. Before that year we have placebo regressions, where one should not expect any effect. The coefficients post-reference are treatment effect estimations for each of those years, and the event-study may uncover dynamic effects2.

The Year variable is taken as a factor (i.e. categorical) and interacted with the treatment group dummy, then the coefficient associated with that interaction is interpreted as the differential of the dependent variable between the treatment and control group on that specific year.

You can visualise the results with the command iplot to see only the interacted coefficients and their confidence intervals.

iplot(es, cluster = ~Origin + Product)

Tidying up multiple regressions

You can use data frames (and data.table in particular) to store models, lists, and other data frames in what is called a list-column. This will be a nested data.table, where some columns may be the usual vectors but the list-column will hold a list of more complex data structures. The table will store the products of your data analysis in an organized way, and you can manipulate the table with your familiar tools.

Imagine a data.table created to hold many parameterized models. One column to hold the dependent variable, another for regressors and one more for instrumental variables. Each row will have all the components related to a regression, therefore, the natural place to hold the results of such regression is in the same data.table, there enters the list-column.

r_ed <- "union+education+age+age_2+female+race+marital_status+class_of_worker"
r_iv <- sub("education\\+", "", r_ed)
models_dt <- CJ(
  y = c("earnings", "log(earnings)"),
  x = c("union", r_ed, r_iv))

models_dt[, iv := ifelse(x == r_iv, "|education~veteran", "")]
models_dt[, form := paste0(y, "~", x, iv)]
models_dt[, reg := lapply(form, function(x){
  feols(as.formula(x),
        data = dt2,
        se = "hetero",
        lean = TRUE)
})]
models_dt
##                y
## 1:      earnings
## 2:      earnings
## 3:      earnings
## 4: log(earnings)
## 5: log(earnings)
## 6: log(earnings)
##                                                                       x
## 1:                                                                union
## 2:           union+age+age_2+female+race+marital_status+class_of_worker
## 3: union+education+age+age_2+female+race+marital_status+class_of_worker
## 4:                                                                union
## 5:           union+age+age_2+female+race+marital_status+class_of_worker
## 6: union+education+age+age_2+female+race+marital_status+class_of_worker
##                    iv
## 1:                   
## 2: |education~veteran
## 3:                   
## 4:                   
## 5: |education~veteran
## 6:                   
##                                                                                          form
## 1:                                                                             earnings~union
## 2:      earnings~union+age+age_2+female+race+marital_status+class_of_worker|education~veteran
## 3:              earnings~union+education+age+age_2+female+race+marital_status+class_of_worker
## 4:                                                                        log(earnings)~union
## 5: log(earnings)~union+age+age_2+female+race+marital_status+class_of_worker|education~veteran
## 6:         log(earnings)~union+education+age+age_2+female+race+marital_status+class_of_worker
##             reg
## 1: <fixest[27]>
## 2: <fixest[38]>
## 3: <fixest[27]>
## 4: <fixest[27]>
## 5: <fixest[38]>
## 6: <fixest[27]>

So, now we have a data.table called models_dt which holds parameters and fitted models, everything in one place. The column reg is a list-column, and each element of this list (corresponding to one row) is a fixest regression result. Now it is quite easy to select only a subset of models and present them in a table, for example.

etable(models_dt[y == "log(earnings)", reg], keep = c("education", "union"), 
       fitstat = ~.+ivf+ivf.p+wh+wh.p)
##                                                   model 1            model 2
## Dependent Var.:                             log(earnings)      log(earnings)
##                                                                             
## union                                  0.2782*** (0.0183) 0.1787*** (0.0220)
## education                                                   -0.0027 (0.0181)
## ______________________________________ __________________ __________________
## S.E. type                              Heteroskedas.-rob. Heteroskedas.-rob.
## Observations                                       12,732             12,732
## R2                                                0.01159            0.14682
## Adj. R2                                           0.01152            0.14615
## F-test (1st stage), education                          --             3.3921
## F-test (1st stage), p-value, education                 --            0.06553
## Wu-Hausman                                             --            0.90495
## Wu-Hausman, p-value                                    --            0.34148
##                                                   model 3
## Dependent Var.:                             log(earnings)
##                                                          
## union                                  0.1868*** (0.0179)
## education                              0.0131*** (0.0003)
## ______________________________________ __________________
## S.E. type                              Heteroskedas.-rob.
## Observations                                       12,732
## R2                                                0.32650
## Adj. R2                                           0.32597
## F-test (1st stage), education                          --
## F-test (1st stage), p-value, education                 --
## Wu-Hausman                                             --
## Wu-Hausman, p-value                                    --

This is a very simple, and short, example but sometimes the number of models we want to estimate grows exponentially and, in those cases this approach of holding models inside a data frame proves useful. Would he choose one name for each regression run, Sala-I-Martin (1997) would be in trouble to come up with insightful names…

References

Callaway, Brantly, and Pedro HC Sant’Anna. 2020. “Difference-in-Differences with Multiple Time Periods.” Journal of Econometrics.
De Chaisemartin, Clément, and Xavier d’Haultfoeuille. 2020. “Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects.” American Economic Review 110 (9): 2964–96.
Goodman-Bacon, Andrew. 2021. “Difference-in-Differences with Variation in Treatment Timing.” Journal of Econometrics.
Sala-I-Martin, Xavier X. 1997. “I Just Ran Two Million Regressions.” The American Economic Review 87 (2): 178–83. http://www.jstor.org/stable/2950909.
Sun, Liyang, and Sarah Abraham. 2020. “Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects.” Journal of Econometrics.

  1. Remember, this is just for demonstration purposes. The following exercise is not historically accurate nor should make economic sense. Moreover, I have no clue whether we will find any statistical significance and if so, this is just a data artifact and should not be interpreted in any way.↩︎

  2. There is a growing literature showing that the assumptions under this kind of regression is much stronger than the canonical DID (2x2), therefore, those event-study like estimations may be severely biased. See Callaway and Sant’Anna (2020), De Chaisemartin and d’Haultfoeuille (2020), Goodman-Bacon (2021), Sun and Abraham (2020).↩︎