The package fixest
provides a family of functions to perform estimations with multiple fixed-effects, instrumental variables and provides clustered standard errors without the need to use a third-party package. The two main functions are feols for linear models and feglm for generalized linear models.
Installation and dataset ingestion
Install and load the package in order to use it.
# install.packages("fixest")
library(data.table)
library(fixest)
We will use for this class a dataset containing a sample of 12,834 individuals in the labor force extracted from the 2019 annual supplement of the 2019 US Current Population Survey.
<- fread("Data/cps_union_data.csv") dt
Cleaning
Before doing any analysis or regressions we must first check what kind of data we have been provided and make the appropriate pre-processing in order to have a “ready to regress” dataset. Usually, one would drop any columns that: i) have no variation among observations, ii) have too many missing values. Observations that eventually have a missing value for some variable should be treated case-by-case. Can you impute those values? Is that variable relevant for this regression? Should be allowed only complete cases in the dataset?
summary(dt)
## V1 CPSID CPSIDP public_housing
## Min. : 12 Min. :2.017e+13 Min. :20171200000302 Min. :0.000
## 1st Qu.: 40891 1st Qu.:2.017e+13 1st Qu.:20171204151101 1st Qu.:0.000
## Median : 84176 Median :2.018e+13 Median :20181200737602 Median :0.000
## Mean : 86055 Mean :2.018e+13 Mean :20177009551114 Mean :0.041
## 3rd Qu.:131144 3rd Qu.:2.018e+13 3rd Qu.:20181204062606 3rd Qu.:0.000
## Max. :179189 Max. :2.019e+13 Max. :20190307097802 Max. :1.000
## NA's :8701
## age female race marital_status
## Min. :15.00 Min. :0.0000 Min. :1.000 Min. :1.000
## 1st Qu.:30.00 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:1.000
## Median :42.00 Median :0.0000 Median :1.000 Median :1.000
## Mean :42.49 Mean :0.4963 Mean :1.557 Mean :2.971
## 3rd Qu.:54.00 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:6.000
## Max. :85.00 Max. :1.0000 Max. :8.000 Max. :6.000
##
## veteran employed education worked_last_year
## Min. :0.00000 Min. :1 Min. : 2.0 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:1 1st Qu.: 73.0 1st Qu.:1.0000
## Median :0.00000 Median :1 Median : 81.0 Median :1.0000
## Mean :0.05756 Mean :1 Mean : 91.1 Mean :0.9758
## 3rd Qu.:0.00000 3rd Qu.:1 3rd Qu.:111.0 3rd Qu.:1.0000
## Max. :1.00000 Max. :1 Max. :125.0 Max. :1.0000
## NA's :83
## total_income_last_year wage_income_last_year own_farm_income_last_year
## Min. : 0 Min. : 0 Min. : -987.0
## 1st Qu.: 25009 1st Qu.: 23000 1st Qu.: 0.0
## Median : 44474 Median : 40000 Median : 0.0
## Mean : 61237 Mean : 55788 Mean : 113.9
## 3rd Qu.: 73005 3rd Qu.: 68000 3rd Qu.: 0.0
## Max. :1755499 Max. :1314999 Max. :1099999.0
##
## private_health_insurance medicaid class_of_worker
## Min. :0.0000 Min. :0.00000 Min. :21.00
## 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:21.00
## Median :1.0000 Median :0.00000 Median :21.00
## Mean :0.8121 Mean :0.08446 Mean :21.98
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:21.00
## Max. :1.0000 Max. :1.00000 Max. :28.00
##
## class_of_worker_last_year union earnings
## Min. : 0.00 Min. :0.0000 Min. : 1.0
## 1st Qu.:22.00 1st Qu.:0.0000 1st Qu.: 480.0
## Median :22.00 Median :0.0000 Median : 794.0
## Mean :22.18 Mean :0.1152 Mean : 995.6
## 3rd Qu.:22.00 3rd Qu.:0.0000 3rd Qu.:1308.0
## Max. :29.00 Max. :1.0000 Max. :2885.0
## NA's :19
Taking a look at the summary table above you can readly see that public_housing
has 8701 NA’s and we should drop it. Moreover, employed
is useless for regression since it has no variation at all, drop it. Some other NA values are found in veteran
and earnings
but they are not numerous. In this case you must have in mind what kind of regression you are running. Suppose your goal is to estimate the causal effect of union membership/coverage (variable union
) on weekly earnings (variable earnings
). Obviously, you should drop any observations where earnings
is missing. Also, there are two columns that according to the provided dictionary (dictionary.xlsx
) are categorical: class_of_worker
, class_of_worker_last_year
(the name class should’ve hinted you) and both marital_status
and race
which you may want to recode to be binary. Let’s do all of that.
<- dt[!is.na(earnings)] # Keep only not NA in earnings
dt c("V1", "CPSID", "CPSIDP", "public_housing", "employed") := NULL] # drop whole columns
dt[, `:=`(
dt[, marital_status = ifelse(marital_status %in% c(1, 2), 1, 0),
race = ifelse(race == 1, 1, 0),
age_2 = age^2
)]<- c("class_of_worker", "class_of_worker_last_year")
cat_cols := lapply(.SD, factor), .SDcols = cat_cols] dt[, (cat_cols)
Linear Regressions
Simple regression
Now we are ready to perform a simple regression of earnings
on union
(a binary variable) and interpret it. Is this a causal regression? Why?
<- feols(earnings~union, data = dt)
simple_reg
etable(simple_reg, cluster = ~class_of_worker_last_year)
## simple_reg
## Dependent Var.: earnings
##
## (Intercept) 976.5*** (19.40)
## union 165.7*** (15.80)
## _______________ ________________
## S.E.: Clustered by: class_of_w..
## Observations 12,815
## R2 0.00544
## Adj. R2 0.00537
Multiple regression
Now let’s add some control variables to our regression. The Mincerian equation stipulate that age
, age squared, education
and female
should play a role in determining earnings
, let’s also add race
, marital_status
and class_of_worker
.
<- feols(
mult_reg ~union+age+age_2+education+female+race+marital_status+class_of_worker,
earningsdata = dt
)
etable(mult_reg, cluster = ~class_of_worker_last_year, drop = "class_of_worker")
## mult_reg
## Dependent Var.: earnings
##
## (Intercept) -1,444.1*** (19.81)
## union 94.22*** (15.78)
## age 60.98*** (1.755)
## age_2 -0.6218*** (0.0211)
## education 13.11*** (0.3877)
## female -334.7*** (11.27)
## race 27.01* (9.142)
## marital_status 104.8** (23.24)
## _______________ ___________________
## S.E.: Clustered by: class_of_work..
## Observations 12,815
## R2 0.33918
## Adj. R2 0.33866
Multiple estimations
The fixest
package allows for multiple estimations (i.e. changing the dependent variable) at once. Suppose you want to regress not only earnings
on union
but also wage_income_last_year
and total_income_last_year
. You can do that in just one call to feols
function.
setnames(dt, c("wage_income_last_year", "total_income_last_year"),
c("wage", "total"))
<- feols(c(earnings, wage, total)~union, data = dt)
mult_est etable(mult_est, se = "hetero")
## model 1 model 2 model 3
## Dependent Var.: earnings wage total
##
## (Intercept) 976.5*** (6.816) 55,188.6*** (673.4) 60,643.3*** (735.1)
## union 165.7*** (17.82) 5,362.0** (1,650.7) 5,254.6** (1,766.1)
## _______________ ________________ ___________________ ___________________
## S.E. type Heterosked.-rob. Heteroskedast.-rob. Heteroskedast.-rob.
## Observations 12,815 12,815 12,815
## R2 0.00544 0.00059 0.00048
## Adj. R2 0.00537 0.00052 0.00040
Step-wise estimations
You can also incrementally increase the complexity of your model by introducing new regressors. To estimate multiple RHS, you need to use a specific set of functions, the stepwise functions. There are four of them: sw, sw0, csw, csw0.
- sw: this function is replaced sequentially by each of its arguments,
- sw0: starts with the empty element,
- csw: it stands for cumulative stepwise. It adds to the formula each of its arguments sequentially,
- csw0: cumulative, but starting with the empty element.
<- feols(earnings~union+csw0(age, age_2),
step_wise data = dt)
etable(step_wise)
## model 1 model 2 model 3
## Dependent Var.: earnings earnings earnings
##
## (Intercept) 976.5*** (6.722) 598.0*** (19.13) -852.2*** (48.65)
## union 165.7*** (19.79) 132.9*** (19.51) 100.2*** (18.80)
## age 8.998*** (0.4269) 83.45*** (2.348)
## age_2 -0.8467*** (0.0263)
## _______________ ________________ _________________ ___________________
## S.E. type Standard Standard Standard
## Observations 12,815 12,815 12,815
## R2 0.00544 0.03878 0.11076
## Adj. R2 0.00537 0.03863 0.11055
More details are provided in fixest
’s vignette on multiple estimations.
Instrumental Variables
As we all know, if we were interested in the causal effect of education attainment on earnings, we would not be able to run a multiple regression like above and interpret it as causal, education is endogenous with relation to earnings. Thus, we may resort to instrumental variable - IV - estimation to overcome this pitfall. Suppose, just for the sake of this example, that veteran
status is a potential instrument to education
. How can you estimate a two-stage least squares regression with fixest
?
First notice that veteran
still have NA values in it. Now you have to decide wether to impute values on it or just drop those observations. Let’s proceed dropping the NA values.
<- dt[!is.na(veteran)]
dt2 <- feols(earnings~union+age+age_2+female+race+marital_status+class_of_worker|education~veteran,
iv_reg data = dt2)
etable(iv_reg, keep = c("education", "union"), fitstat = ~.+ivf+ivf.p+wh+wh.p)
## iv_reg
## Dependent Var.: earnings
##
## education -10.61 (19.36)
## union 81.91** (25.13)
## ______________________________________ _______________
## S.E. type Standard
## Observations 12,732
## R2 -0.20462
## Adj. R2 -0.20556
## F-test (1st stage), education 3.3921
## F-test (1st stage), p-value, education 0.06553
## Wu-Hausman 2.7669
## Wu-Hausman, p-value 0.09626
So, what have we done here? First notice how the endogenous variable does not appear on the RHS of the formula, it goes into a formula of its own after the vertical |
bar delimiter. The formula for endogenous variables and its instruments takes the form: endogenous ~ instrument.
On the table presented above we opt to show only two coefficents by using the argument keep, education
, our instrumented variable and union
. Whenever using an IV approach to estimate something, it’s useful to perform tests of weak instrument and exclusion restriction. Here we are showing the first-stage F-test to assess instrument weakness and the Wu-Hausman endogeneity test where H0 is the absence of endogeneity of the instrumented variables. No wonder veteran
status is a weak instrument, its correlation to education is very small at 0.0170532.
Fixed-effects
Let’s change the dataset, enough of earnings! Who wants to make money after all … In International Trade we have what is called gravity models in which we are interested in finding out the negative effect of geographic distance on trade flows. Gravity models typically include many fixed effects to account for product, period, exporter country and importer country. A very simple gravity equation would take the form:
\[ \log(trade_{nipt})=\beta \log(\text{distance}_{ni})+\gamma_n+\gamma_i+\gamma_p+\gamma_t+\varepsilon_{nipt} \] where we have the indexes \(\{n, i, p, t\}\) respectively for the importer, exporter, product and period.
data("trade")
<- as.data.table(trade)
trade head(trade)
## Destination Origin Product Year dist_km Euros
## 1: LU BE 1 2007 139.5719 2966697
## 2: BE LU 1 2007 139.5719 6755030
## 3: LU BE 2 2007 139.5719 57078782
## 4: BE LU 2 2007 139.5719 7117406
## 5: LU BE 3 2007 139.5719 17379821
## 6: BE LU 3 2007 139.5719 2622254
<- feols(
gravity log(Euros) ~ log(dist_km) | Destination + Origin + Product + Year,
data = trade)
etable(gravity, cluster = ~Origin + Product)
## gravity
## Dependent Var.: log(Euros)
##
## log(dist_km) -2.170*** (0.1603)
## Fixed-Effects: ------------------
## Destination Yes
## Origin Yes
## Product Yes
## Year Yes
## _______________ __________________
## S.E.: Clustered by: Orig. & Prod.
## Observations 38,325
## R2 0.70558
## Within R2 0.21932
Like the IV approach, we provide the fixed effects in a formula of its own, after the vertical bar. But notice one very important distinction, the formula for fixed-effects is one sided, that is, only the fixed-effects, put together by +
signs, are included. Also, you are able to cluster your standard-errors by more than one variable.
Difference-in-Differences
The trade
dataset is a panel of several european countries stacked along the years, from 2007 to 1016. Just for demonstration purposes let’s assume the following scenario: countries DE, FR and GB implement a reduction in tariffs, starting in 2010. With the given assumptions, it is possible to estimate the effect of such a measure on trade flows by DID. We have three countries in the treatment group and the treatment happens from 2010 onwards. Let’s create two dummy variables for group and period and then estimate our DID model1.
`:=`(
trade[, d_g = ifelse(Destination %in% c("DE", "FR", "GB"), 1, 0),
d_t = ifelse(Year >= 2010, 1, 0)
)]:= d_g*d_t]
trade[, d_treat <- feols(log(Euros) ~ d_treat+log(dist_km)| Destination + Year,
did data = trade)
etable(did, cluster = ~Origin + Product)
## did
## Dependent Var.: log(Euros)
##
## d_treat 0.0078 (0.0397)
## log(dist_km) -2.387*** (0.5411)
## Fixed-Effects: ------------------
## Destination Yes
## Year Yes
## _______________ __________________
## S.E.: Clustered by: Orig. & Prod.
## Observations 38,325
## R2 0.27227
## Within R2 0.18419
A DID specification can be interpreted and run as a canonical two-way fixed effects (TWFE). You can add control variables, like log(dist_km)
either to make the unconfoundedness assumption more plausible or to improve the precision of your estimator.
Event Studies
What if we want to run the entire “event-study” where we estimate the treatment effect for the whole period at hand, even before the actual treatment takes place (placebo-like regression)? In a regression context, TWFE essentially amounts to an interaction between our treatment group dummy and Year variables. This is easily done using the i(fact_var, num_var, reference)
syntax:
<- feols(log(Euros)~log(dist_km) + i(Year, d_g, "2009")|Destination + Year,
es data = trade)
etable(es, cluster = ~Origin + Product)
## es
## Dependent Var.: log(Euros)
##
## log(dist_km) -2.387*** (0.5411)
## d_g x Year = 2007 0.0742 (0.0778)
## d_g x Year = 2008 0.1161 (0.0663)
## d_g x Year = 2010 -0.0918 (0.0600)
## d_g x Year = 2011 0.0323 (0.0552)
## d_g x Year = 2012 0.0750* (0.0320)
## d_g x Year = 2013 0.1080* (0.0488)
## d_g x Year = 2014 0.0980** (0.0310)
## d_g x Year = 2015 0.1535* (0.0587)
## d_g x Year = 2016 0.1240** (0.0334)
## Fixed-Effects: ------------------
## Destination Yes
## Year Yes
## _________________ __________________
## S.E.: Clustered by: Orig. & Prod.
## Observations 38,325
## R2 0.27235
## Within R2 0.18428
Here we set the last year before treatment takes place, 2009, as the reference, thus, its coefficient is set to zero and is not reported. Before that year we have placebo regressions, where one should not expect any effect. The coefficients post-reference are treatment effect estimations for each of those years, and the event-study may uncover dynamic effects2.
The Year
variable is taken as a factor (i.e. categorical) and interacted with the treatment group dummy, then the coefficient associated with that interaction is interpreted as the differential of the dependent variable between the treatment and control group on that specific year.
You can visualise the results with the command iplot
to see only the interacted coefficients and their confidence intervals.
iplot(es, cluster = ~Origin + Product)
Tidying up multiple regressions
You can use data frames (and data.table in particular) to store models, lists, and other data frames in what is called a list-column. This will be a nested data.table
, where some columns may be the usual vectors but the list-column will hold a list of more complex data structures. The table will store the products of your data analysis in an organized way, and you can manipulate the table with your familiar tools.
Imagine a data.table
created to hold many parameterized models. One column to hold the dependent variable, another for regressors and one more for instrumental variables. Each row will have all the components related to a regression, therefore, the natural place to hold the results of such regression is in the same data.table
, there enters the list-column.
<- "union+education+age+age_2+female+race+marital_status+class_of_worker"
r_ed <- sub("education\\+", "", r_ed)
r_iv <- CJ(
models_dt y = c("earnings", "log(earnings)"),
x = c("union", r_ed, r_iv))
:= ifelse(x == r_iv, "|education~veteran", "")]
models_dt[, iv := paste0(y, "~", x, iv)]
models_dt[, form := lapply(form, function(x){
models_dt[, reg feols(as.formula(x),
data = dt2,
se = "hetero",
lean = TRUE)
})] models_dt
## y
## 1: earnings
## 2: earnings
## 3: earnings
## 4: log(earnings)
## 5: log(earnings)
## 6: log(earnings)
## x
## 1: union
## 2: union+age+age_2+female+race+marital_status+class_of_worker
## 3: union+education+age+age_2+female+race+marital_status+class_of_worker
## 4: union
## 5: union+age+age_2+female+race+marital_status+class_of_worker
## 6: union+education+age+age_2+female+race+marital_status+class_of_worker
## iv
## 1:
## 2: |education~veteran
## 3:
## 4:
## 5: |education~veteran
## 6:
## form
## 1: earnings~union
## 2: earnings~union+age+age_2+female+race+marital_status+class_of_worker|education~veteran
## 3: earnings~union+education+age+age_2+female+race+marital_status+class_of_worker
## 4: log(earnings)~union
## 5: log(earnings)~union+age+age_2+female+race+marital_status+class_of_worker|education~veteran
## 6: log(earnings)~union+education+age+age_2+female+race+marital_status+class_of_worker
## reg
## 1: <fixest[27]>
## 2: <fixest[38]>
## 3: <fixest[27]>
## 4: <fixest[27]>
## 5: <fixest[38]>
## 6: <fixest[27]>
So, now we have a data.table called models_dt
which holds parameters and fitted models, everything in one place. The column reg
is a list-column, and each element of this list (corresponding to one row) is a fixest
regression result. Now it is quite easy to select only a subset of models and present them in a table, for example.
etable(models_dt[y == "log(earnings)", reg], keep = c("education", "union"),
fitstat = ~.+ivf+ivf.p+wh+wh.p)
## model 1 model 2
## Dependent Var.: log(earnings) log(earnings)
##
## union 0.2782*** (0.0183) 0.1787*** (0.0220)
## education -0.0027 (0.0181)
## ______________________________________ __________________ __________________
## S.E. type Heteroskedas.-rob. Heteroskedas.-rob.
## Observations 12,732 12,732
## R2 0.01159 0.14682
## Adj. R2 0.01152 0.14615
## F-test (1st stage), education -- 3.3921
## F-test (1st stage), p-value, education -- 0.06553
## Wu-Hausman -- 0.90495
## Wu-Hausman, p-value -- 0.34148
## model 3
## Dependent Var.: log(earnings)
##
## union 0.1868*** (0.0179)
## education 0.0131*** (0.0003)
## ______________________________________ __________________
## S.E. type Heteroskedas.-rob.
## Observations 12,732
## R2 0.32650
## Adj. R2 0.32597
## F-test (1st stage), education --
## F-test (1st stage), p-value, education --
## Wu-Hausman --
## Wu-Hausman, p-value --
This is a very simple, and short, example but sometimes the number of models we want to estimate grows exponentially and, in those cases this approach of holding models inside a data frame proves useful. Would he choose one name for each regression run, Sala-I-Martin (1997) would be in trouble to come up with insightful names…
Useful Links
References
Remember, this is just for demonstration purposes. The following exercise is not historically accurate nor should make economic sense. Moreover, I have no clue whether we will find any statistical significance and if so, this is just a data artifact and should not be interpreted in any way.↩︎
There is a growing literature showing that the assumptions under this kind of regression is much stronger than the canonical DID (2x2), therefore, those event-study like estimations may be severely biased. See Callaway and Sant’Anna (2020), De Chaisemartin and d’Haultfoeuille (2020), Goodman-Bacon (2021), Sun and Abraham (2020).↩︎