R Introduction: Swirl package

Either if it is your first time using R, you have not used it in a long time or you have only seldom used it, I recommend using the Swirl package to learn/refresh the basic notions. Swirl offers a series of interactive tutorials which can be easily followed. First we install and then we load the package. An interactive menu will appear

install.packages("swirl")
library(swirl)

If you follow the instructions you will get to a menu of courses, the more you do the better, however I would definitely do the course “R Programming: The basics of programming in R”. Also, in here you can find more courses. To download a course just do the following:

install.packages("swirl")
library(swirl)

I also find the course named “Getting and Cleaning Data” extremely useful. A final piece of advice is to get used to work with Rstudio projects which help you keep your workflow organized, you can learn about how to use them here.

Some Statistics

We are going to use the Encuesta de Condiciones de Vida 2019 (ECV 2019) data to illustrate statistical and econometrics concepts. This data is available at the website of the Instituto Nacional de Estadística (INE) and contains information about annual income in 2018. This survey is one of the main references for the analysis of the income distribution and poverty in Spain and it is part of the European survey EU-SILC. We use a cleaned version which is used in this paper and can be found here. In the years 2005, 2011 and 2019, the ECV includes a module on intergenerational transmission of disadvantages with information on circumstances outside the control of the individual such as sex, parental education/occupation, population of the municipality where they grew, whether they grew up without a mother/father, etc. We measure income with equivalized household income and the level of observation is the individual. We restrict the sample to those aged between 25 and 59 years old to focus on the working age population.

For the moment we are just going to focus on income and forget about the circumstances. The first thing to do is to load the data and do some summary statistics

ecv <- read.csv("../Data/ecv2019.csv")
summary(ecv$Y)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      0.11  10480.96  15937.38  17861.57  22769.91 163597.20

This already gives us some useful information. For instance, the average income in Spain is 17,861€, the median is 15,937€ meaning that half of the sample has an income below 15,937 €. We also see that the observation with lowest income has 0.11€ and the one with the most income has 163,597 €. Finally, the poorest 25% of the sample has an income below 10,480€ and the richest 25% has an income above 22,769€. This already gives us an image of the distribution of income in our sample. One of the most useful concepts which helps us get a more complete image of the distribution is the histogram

hist(ecv$Y)

The x-axis is split into bins and the y-axis tells us the frequency (how many observations) are in each bin. Even though our data will always be discrete it is reasonable to model income as a continuous random variable, meaning it has a Lebesgue p.d.f. We can estimate this pdf:

# Calculate density
dy <- density(ecv$Y)
# Add density
plot(dy, lwd = 2, col = "red", main = "Income pdf", ylab = TeX("$f_n(y)$"))

As could already be seen from the histogram, the income distribution is characterized by a fat right tail. This means that the probability of extreme positive values does not decay as fast as for instance in the normal distribution. In economic terms this means that the probability of encountering individuals much richer than the rest does not go down as fast as in other distributions such as the normal distribution.

Note that the y-axis of the density does not reflect the frequency or probability of the income values. This is because the p.d.f. has to integrate to 1, \(\int f(y) dy = 1\). We can plot the histogram and the density together if we force the histogram to integrate to 1 as well.

# Create a histogram
hist(ecv$Y, freq = FALSE, main = "Histogram and density", ylim = c(0, max(dy$y)))
# Add density
lines(dy, lwd = 2, col = "red")

Another key statistical concept is the c.d.f., \(F(x)\). Its sample analog is the empirical distribution function e.c.d.f. \(F_n(y) = n^{-1}\sum_{i=1}^n 1(Y_i \leq y)\). This we can also compute and visualize with R.

# Let n be the number of obs
n <- length(ecv$Y)
#Create step function object with sorted Y in y axis and the
#accumulation of 1/n in the x-axis
step_ecdf <- stepfun(sort(ecv$Y),c(0,(1:n)/n))
#Plot e.c.d.f
plot(step_ecdf, main = "ECDF", ylab = TeX("$F_n(y)$"))

Of course, the e.c.d.f. satisfies all the properties of a c.d.f. Since we have a lot of observations the e.c.d.f looks almost continuous, however it is a step function. If we do the same with a random subset of only 20 observations

#Create random sample
Yr <- ecv$Y[sample(1:n, 20, replace = TRUE)]
#Create step function object
step_ecdf20 <- stepfun(sort(Yr),c(0,(1:20)/20))
#Plot e.c.d.f
plot(step_ecdf20, main = "ECDF n = 20", ylab = TeX("$F_n(y)$"))

Quantiles are another key tool to understand the distribution of income, for \(\tau \in [0,1]\)

\[ F^{-1}(\tau) = inf\{y: F(y) \geq \tau \} \] this is also called the generalized inverse. If \(F\) is continuous (which implies monotone increasing since its a cumulative function) then \(F^{-1}(\tau)\) is the usual inverse. To find the quantile function in our data we can just look at the inverse of our e.c.d.f.

ecdf = (1:n)/n
df_q <- as_tibble(cbind(Y = sort(ecv$Y), ecdf = ecdf))
plot(stepfun(ecdf,c(sort(ecv$Y),max(ecv$Y))),
     xlim = c(0,1),
     main = "Quantile function",
     ylab = TeX("$F_n^{-1} (\tau)$"))

Popular inequality measures are quantile share ratios, for instance the quintile share ratios \(S80/S20\), \(S80/S40\) and \(S80/S60\) compute the ratio of the total income received by the top 20% of the population against the total income received by the poorest 20%, 40% and 60% respectively. Let’s start using ggplot now

quint_grid <- c(0.2,0.4,0.6)
s80 <- sum(df_q$Y[which.min(abs(ecdf - 0.8)):n])
s_quint <- vapply(quint_grid,
                  function (x)  sum(df_q$Y[1:which.min(abs(ecdf - x))]),
                  1)
s80_quints <- as_tibble(cbind(grid = quint_grid,
                              ratio = s80/s_quint))

ggplot(s80_quints, aes(grid,ratio)) +
  geom_point(color = "red", size = 3) + 
  labs(x = "X", y = "S80/SX", title = "Share Quintile ratios")  

So we can see that the income share of the top 20% is more than 6 times that of the poorest 20%. We can also look at the share of the top 10% against the share of the other deciles

dec_grid <- seq(0.1,0.8,0.1)
s90 <- sum(df_q$Y[which.min(abs(ecdf - 0.9)):n])
s_dec <- vapply(dec_grid,
                  function (x)  sum(df_q$Y[1:which.min(abs(ecdf - x))]),
                  1)
s90_decs <- as_tibble(cbind(grid = dec_grid,
                              ratio = s90/s_dec))

ggplot(s90_decs, aes(grid,ratio)) +
  geom_point(color = "red", size = 3) + 
  labs(x = "X", y = "S90/SX", title = "Share Decile ratios")  

Here we can see that the income share of the top 10% is more than 10 times that of the poorest 10%. Finally, we can also compare the top 5% with poorest 5%, 10%, 15%, etc.

f_grid <- seq(0.05,0.9,0.05)
s95 <- sum(df_q$Y[which.min(abs(ecdf - 0.95)):n])
s_f <- vapply(f_grid,
                  function(x)  sum(df_q$Y[1:which.min(abs(ecdf - x))]),
                  1)
s95_fs <- as_tibble(cbind(grid = f_grid,
                              ratio = s95/s_f))

ggplot(s95_fs, aes(grid,ratio)) +
  geom_point(color = "red", size = 3) + 
  labs(x = "X", y = "S95/SX", title = " 95% Share ratios")  

So the share of the top 5% is almost 20 times larger than the share of the bottom 5%. A fundamental tool in inequality is the Lorenz curve, this curve tracks the share of total income that each percentile has, formally

\[ L(p) = \mu^{-1} \int_0^p F^{-1}(u) \, du \] Using the empirical quantile function we can plot this Lorenz curve

shares <- vapply(ecdf,
                  function(x)  sum(df_q$Y[1:which.min(abs(ecdf - x))])/sum(df_q$Y),
                  1)

lorenz <- as_tibble(cbind(p = ecdf,
                          shares = shares))

ggplot(lorenz, aes(p,shares)) +
  geom_line(color = "red") + 
  geom_abline(intercept = 0, slope = 1) +
  labs(x = "p", y = "L(p)", title = "Lorenz curve")  

Suppose that there was one individual who own all the income, then the Lorenz curve would be equal to \(0\) and jump to one for \(p=1\). If income was equally distributed the Lorenz curve would be equal to the \(45^o\) degrees line. This motivates the use of the area between the \(45^o\) degrees line and the Lorenz curve as a measure of inequality. If \(A\) is the area between the \(45^o\) degrees line and the Lorenz curve and \(B\) is the area below the Lorenz curve, the Gini coefficient is

\[ G = \frac{A}{A+B} = 1 - 2 \int_0^1 L(p) \, dp \] Again, using the empirical counterparts we have that

#numerical integral
dp <- c(0,lorenz$p)
aux <- 0
n1 <- n-1
for (i in 1:n1){
  aux <- aux + lorenz$shares[i]*(dp[i+1] - dp[i])
}
G = 1 - 2*aux
print(paste0("The Gini coefficient is ", round(G,2)))
## [1] "The Gini coefficient is 0.32"

LLN and CLT

In this part we are going to illustrate the Law of the Large Numbers and the Central Limit Theorem. For this we are going to forget about the data and simulate our own income data. A common parametric model for income is the log-normal model

\[ Y_i = e^{X_i}, \text{ with } X_i \sim \mathcal{N}(0,\sigma^2). \] This model is called log-normal because once you take logs the model is normal: \(\ln Y_i = X_i\). It is particularly useful for modelling variables with fat right tails such as income. Let’s generate \(3000\) observations for \((\mu,\sigma) = (0,1)\) and plot an estimate of the density

n <- 3000
mu <- 0
s <- 1
X <- rnorm(n,mu,s)
Y <- exp(X)
dy <- density(Y)
# Add density
plot(dy, lwd = 2, col = "red", main = "Simulated Y", ylab = TeX("$f_n(y)$"))

As we can see the distribution does seem to have a fat tail. A nice thing about simulations is that you know the true value of the parameters, in this case we have a log-normal distribution with parameters \((\mu,\sigma) = (0,1)\). Given these parameters we can compute the true expectation of \(Y\)

\[ \mathbb{E}(Y_i) = e^{\mu + \sigma^2/2} = e^{1/2} = 1.648721... \]

Let \(\bar{Y} = n^{-1} \sum_{i=1}^n Y_i\), the LLN tells us that for all \(\varepsilon > 0\)

\[ \lim_{n \to \infty} \mathbb{P} \biggl( |\bar{Y} - \mathbb{E}(Y_i)| > \varepsilon \biggr) = 0, \]

i.e. \(\bar{Y} \to_p \mathbb{E}(Y_i)\). We can compute the sample average for different sample sizes to check whether this is true

ngrid <- seq(10,10000,10)
EY <- exp(0.5)
lln <- vapply(ngrid, function(x){
  X <- rnorm(x,0,1)
  Y <- exp(X)
  mY <- mean(Y)},1)
lln <- as_tibble(cbind(n = ngrid, mean = lln))
ggplot(lln, aes(n,mean)) + 
  geom_line(color = "red") + 
  geom_abline(intercept = EY, slope = 0) + 
  labs(title = "LLN")

So we see that for increasing sample sizes we concentrate around the true value. How much it takes to get to concentrate around the true value can change a lot depending on the underlying distribution, the LLN is an asymptotic result so it tells us what happens in the limit not on the way to the limit. For instance, let’s see what happens if we compare our simulated data with normally generated data with the same mean and variance equal to \(1/2\).

lln2 <- vapply(ngrid, function(x) mean(rnorm(x,EY,0.5)),1)
lln2 <- as_tibble(cbind(n = ngrid, mean = lln2, distr = rep(2,length(ngrid))))
lln1 <- as_tibble(cbind(lln, distr = rep(1,length(ngrid))))
lln_distr <- rbind(lln1,lln2)

ggplot(lln_distr, aes(n,mean)) +
  geom_line(aes(col = factor(distr))) +
  geom_abline(intercept = EY, slope = 0) +
  scale_color_discrete(name = "Distribution", labels = c("LN","N")) + 
  labs(title = "LLN")

We see that the sample mean when the DGP is \(\mathcal{N(e^{1/2},1/2)}\) concentrates much faster around the true value.

The CLT states that

\[ \sqrt{n}(\bar{Y} - \mathbb{E}(Y_i)) \to_d \mathcal{N}(0,\mathbb{V}ar(Y_i)) \]

This means that for large \(n\), \(\bar{Y}\) is approximately distributed as a \(\mathcal{N}(\mathbb{E}(Y_i), \mathbb{V}ar(Y_i)/n)\). We call \(\sqrt{\mathbb{V}ar(Y_i)/n}\) the standard error of \(\bar{X}\). For the log-normal we have that \(\mathbb{V}ar(Y_i) = (e^{\sigma^2} - 1)e^{2\mu + \sigma^2}\). To see how this approximation works we can extract many random samples, say \(R\), of same sample size \(n\) and compute the sample mean for each one of them. That is, in the end we end up with a vector with \(R\) sample means. This can be seen as \(R\) random draws from the distribution of \(\bar{Y}\) and hence they can be used to see how good the approximation is. If we do this for different sample sizes we can see how the CLT kicks in.

ngrid <- c("n100" = 100,"n3000" = 3000)
R <- 10000
Rgrid <- 1:R
mu <- 0
s <- 1

MC <- NULL
for (i in 1:length(ngrid)){
  MC <- cbind(vapply(Rgrid, function(x){
  X <- rnorm(x,0,1)
  Y <- exp(X)
  mY <- mean(Y)},1),MC)
}
colnames(MC) <- names(ngrid)
MC <- as_tibble(MC)
se_true <- sqrt(((exp(s^2) - 1)*(exp(2*mu) + s^2))/ngrid)

c1 <- rgb(0,255,255,max = 255, alpha = 150, names = "lt.blue")
xx <- c(1.5,1.8)
yy <- c(0,15)
br <- 300
lsz <- 0.5
lpos <- "topright"

par(mfrow = c(1,2)) 

hist(MC$n100, breaks = br, xlim = xx, ylim = yy,
     xlab = "", ylab = "", freq = FALSE, col = c1, main = "")
curve(dnorm(x,mean = EY, sd = se_true[1]),
      col = "red", lwd = 2, add = TRUE)
      title(main = "n = 100")
      legend(lpos, legend = c("Simulation","Normal Distr."),
      cex = lsz, col = c(c1,"red"), pch = c(15,NA), lty = c(NA,1))
      abline(v=EY,col="red",lwd=2)


hist(MC$n3000, breaks = br, xlim = xx, ylim = yy,
     xlab = "", ylab = "", freq = FALSE, col = c1, main = "")
curve(dnorm(x,mean = EY, sd = se_true[2]),
      col = "red", lwd = 2, add = TRUE)
abline(v=EY,col="red",lwd=2)
title(main = "n = 3000")
legend(lpos, legend = c("Simulation","Normal Distr."),
       cex = lsz, col = c(c1,"red"), pch = c(15,NA), lty = c(NA,1))

We see that as theory predicts the sample mean is unbiased and the normal approximation improves as \(n\) grows.

Ordinary Least Squares

We can go back to our ECV data to run some regressions. To run a regression of log income on sex we write

m1 <- lm(log(Y) ~ sex, ecv) 
summary(m1)
## 
## Call:
## lm(formula = log(Y) ~ sex, data = ecv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7812  -0.3115   0.1079   0.4657   2.4513 
## 
## Coefficients:
##             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 9.553853   0.009395 1016.956   <2e-16 ***
## sexMale     0.030155   0.013432    2.245   0.0248 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8722 on 16872 degrees of freedom
## Multiple R-squared:  0.0002986,  Adjusted R-squared:  0.0002394 
## F-statistic:  5.04 on 1 and 16872 DF,  p-value: 0.02479

So we see that the expected wage for males is 3% higher than that of female. However, our standard errors are not robust to heteroskedasticity. Unfortunately, in R it is a bit more of a hassle to have robust standard errors than in Stata, still it is not very hard either if we use the library \(\texttt{sandwich}\) and \(\texttt{AER}\)

vcv <- vcovHC(m1, type = "HC1")
coeftest(m1, vcv)
## 
## t test of coefficients:
## 
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 9.5538535  0.0093864 1017.8375  < 2e-16 ***
## sexMale     0.0301546  0.0134329    2.2448  0.02479 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So we see the coefficient is significant 5% level. A good exercise is to not use any package. Let \(X_i = (1,sex_i)'\), we know that the ols estimator is

\[ \hat{\beta} = \mathbb{E}_n(X_i X_i')^{-1}\mathbb{E}_n(X_i Y_i) = (\mathbb{X}'\mathbb{X})^{-1}\mathbb{X}'\mathbb{Y}, \]

where \(\mathbb{X} = (X_1',...,X_n')\) is called the design or model matrix and in this case has two columns, the first one filled with ones and the second one is \((sex_1,...,sex_n)'\) and \(\mathbb{Y}\) is a column with \((Y_1,...,Y_n)\). We can write our own function s_reg and use it instead of the inbuilt regression package. For now just look at the function s_reg up to where bhat is created, the rest is for the standard errors which will be explained right after.

n <- length(ecv$Y)
Y <- log(ecv$Y)
sex <- as.numeric(as.factor(ecv$sex)) - 1
XX <- cbind(rep(1,n),sex)
#number of regressors (including constant)
k <- ncol(XX)

s_reg <- function(X,Y, se = NULL){
  n <- length(Y)
  Y <- as.matrix(Y)
  bhat <- solve(t(X)%*%X)%*%(t(X)%*%Y)
  u <- Y - X%*%bhat
  if (is.null(se) == 1){
    Shat <- solve((1/n)*(t(X) %*% X))*as.numeric((1/(n-k))*(t(u)%*%(u)))
  }
  else{
    B <- 0
    B <- (t(as.matrix(X)*as.numeric(u)))%*%(as.matrix(X)*as.numeric(u))
    Shat <- (solve((1/n)*(t(X) %*% X)) %*% ((1/(n-k)*B)) %*% solve((1/n)*(t(X) %*% X)))
  }
  sehat <- sqrt(Shat[row(Shat) == col(Shat)]/n)
  res <- list("b" = bhat, "se" = sehat)
}

m1 <- s_reg(XX,Y)
print(m1$b)
##           [,1]
##     9.55385345
## sex 0.03015465

As you see we get exactly the same as with the \(\texttt{lm()}\) function. From OLS theory we know that under homoskedasticity

\[ \sqrt{n}(\hat{\beta} - \beta_0) \to_d \mathcal{N}(0, \Sigma), \text{ } \Sigma = \mathbb{E}(X_i X_i')^{-1} \mathbb{E}(u_i^2) \] where \(u_i = Y_i - X_i' \beta_0\) and \(\beta_0 = \mathbb{E}(X_i X_i')^{-1}\mathbb{E}(X_i Y_i)\), an estimator of \(\Sigma\) is

\[ \hat{\Sigma} = \mathbb{E}_n(X_i X_i')^{-1} \mathbb{E}_n(\hat{u}_i^2) = (\mathbb{X}'\mathbb{X})^{-1} (\hat{\mathbb{u}}'\hat{\mathbb{u}}) \]

where \(\hat{\mathbb{u}} = (\hat{u}_1,...,\hat{u}_n)'\) and the standard errors are \(se = (\sqrt{\Sigma_{11}/n},\sqrt{\Sigma_{22}/n})\) with estimator \(\hat{se} = (\sqrt{\hat{\Sigma}_{11}/n},\sqrt{\hat{\Sigma}_{22}/n})\)

print(m1$se)
## [1] 0.009394559 0.013432387

More generally (without assuming homoskedasticity) we have

\[ \sqrt{n}(\hat{\beta} - \beta_0) \to_d \mathcal{N}(0, \Sigma), \text{ } \Sigma = \mathbb{E}(X_i X_i')^{-1} \mathbb{E}(X_i X_i' u_i^2) \mathbb{E}(X_i X_i')^{-1} \]

and

\[ \hat{\Sigma} = \mathbb{E}_n(X_i X_i')^{-1} \mathbb{E}_n(X_i X_i' \hat{u}_i^2) \mathbb{E}_n(X_i X_i')^{-1} = \biggl(\frac{1}{n}\mathbb{X}'\mathbb{X}\biggr)^{-1}\biggl(\frac{1}{n}\sum_{i=1}^n u_i^2 X_i X_i'\biggr)\biggl(\frac{1}{n}\mathbb{X}'\mathbb{X}\biggr)^{-1} \]

For the middle term the matrix form would be \(\mathbb{X}'B\mathbb{X}\) where \(B = diag(u_1^2,...,u_n^2)\), since this is an nxn matrix it will be heavy on the memory of the computer so it is better to just compute the sum with a loop or (even faster) using element by element multiplication: \((1/n)*(\mathbb{X}**u)'*(\mathbb{X}**u)\), where \(**\) stands for element by element multiplication (common multiplication in R). Note that in the function s_reg whenever we compute a mean involving residuals we divide by \(n-k\) to correct for the degrees of freedom taken by the estimation of the coefficients, however this is asymptotically negligible.

m2 <- s_reg(XX,Y,se = "r")
print(m2$b)
##           [,1]
##     9.55385345
## sex 0.03015465
print(m2$se)
## [1] 0.009386423 0.013432903

Boostrapping standard errors

A common way to get standard errors when no explicit expression is available is to use bootstrap. To illustrate let us compute the standard errors of the above regression with bootstrap (even though we do have an explicit expression for them). The key idea is to consider the empirical distribution of our data \(F_n\) as the “true” distribution and then we can take samples from it and get an estimate for each sample. In this way we can approximate the distribution of our estimator.

B <- 1000

b_boot <- sapply(1:B, function(x){
  ecv_boot <- sample(1:n, n, replace = TRUE)
  ecv_boot <- ecv[ecv_boot,]
  Y <- log(ecv_boot$Y)
  sex <- as.numeric(as.factor(ecv_boot$sex)) - 1
  XX <- cbind(rep(1,n),sex)

  m <- s_reg(XX,Y,se = "r")
  return(m$b)})

print(c(sd(b_boot[1,]),sd(b_boot[2,])))
## [1] 0.009492263 0.013734838

Bias and Coverage rate in Monte Carlo experiments

When an estimator is proposed together with its inferential theory it is common to do a Monte Carlo (MC) experiment showing the bias and the coverage rate of the estimators. Let us do a MC to see the bias of OLS and the coverage rate of different standard error estimators. In the MC we take many samples from the true DGP and for each one we estimate the coefficients and the standard errors. The bias will be the difference between the mean of all the estimates and the true value. The coverage rate is the proportion of times that the true value falls inside the confidence interval or, equivalently, the proportion of times that we do not reject that the coefficient is equal to the true value.

Let us use the following DGP with heteroskedasticity

\[ \begin{align*} Y_i &= 1 + 3*X_i + u_i \\ u_i &= \varepsilon_i*X_i \\ X_i, \varepsilon_i &\sim \mathcal{N}(0,1) , \quad X_i \perp \varepsilon_i \end{align*} \] So now \(\mathbb{E}[u_i^2|X_i] = \mathbb{E}[\varepsilon^2]*X_i^2 = X_i^2\), so we have heteroskedasticity and note that \(\mathbb{E}[X_i*u_i] = \mathbb{E}[X_i*X_i]*\mathbb{E}[\varepsilon_i] = 0\) so we do not have endogeneity.

In the simulation we report the bias, MSE and coverage for the (incorrect) homoskedastic standard errors, the heteroskedasticity robust standard errors and the bootstrap standard errors.

ngrid <- c(100,500,1000)
R <- 1000
B <- 100
Rgrid <- 1:R
btr <- 0.2
set.seed(123)

MC <- NULL
bias <- rep(0,length(ngrid))
mse <- rep(0,length(ngrid))
cov1 <- rep(0,length(ngrid))
cov2 <- rep(0,length(ngrid))
cov3 <- rep(0,length(ngrid))
for (i in 1:length(ngrid)){
  n <- ngrid[i]
  MC <- lapply(Rgrid, function(x){
    X <- rnorm(n)
    XX <- cbind(rep(1,n),X)
    u <- rnorm(n)*X
    Y <- 1 + btr*X + u
    # dff <- as_tibble(cbind(Y = Y, X = X))
    #Point estimate
    m1 <- s_reg(XX,Y)
    b1 <- m1$b[2]
    # b1 <- lm(Y ~ X, dff)
    # b1 <- b1$coefficients[2]
    #Standard errors
    #non heteroskedasticity robust
    se1 <- m1$se[2]
    rej1 <- (abs(b1 - btr)/se1) >= qnorm(0.975) 
    #heteroskedasticity robust
    m2 <- s_reg(XX,Y, se = "r")
    se2 <- m2$se[2]
    rej2 <- (abs(b1 - btr)/se2) >= qnorm(0.975) 
    #Bootstrap
    b_boot <- sapply(1:B, function(x){
      boot <- sample(1:n, n, replace = TRUE)
      boot <- as_tibble(cbind(Y = Y, X = X)[boot,])
      Yb <- boot$Y
      Xb <- boot$X
      XXb <- cbind(rep(1,n),Xb)
      m <- s_reg(XXb,Yb)
      return(m$b[2])})
    se3 <- sd(b_boot)
    rej3 <- (abs(b1 - btr)/se3) >= qnorm(0.975) 
    return(c(b1,rej1,rej2,rej3))})
  MC <- matrix(unlist(MC),4,R)
  bias[i] <- mean(MC[1,]) - btr
  mse[i] <- (mean(MC[1,]) - btr)^2 + var(MC[1,])
  cov1[i] <- 1 - mean(MC[2,])
  cov2[i] <- 1 - mean(MC[3,])
  cov3[i] <- 1 - mean(MC[4,])
}
res <- cbind(Bias = bias, MSE = mse, Homoskedastic = cov1,
             Heteroskedastic = cov2, Bootstrap = cov3)
row.names(res) <- paste("n = ",ngrid)
res
##                    Bias         MSE Homoskedastic Heteroskedastic Bootstrap
## n =  100  -0.0016502524 0.029549964         0.723           0.926     0.927
## n =  500  -0.0022743768 0.006338042         0.737           0.940     0.937
## n =  1000 -0.0006827952 0.003350155         0.721           0.933     0.933

As predicted by OLS theory, OLS is unbiased, its accuracy measured by the MSE decreases as n increases. Also, the coverage of the heteroskedasticity robust standard errors is close to the nominal 0.95 while the standard errors computed under the assumption of homoskedasticity are far from 0.95.

Average Treatment Effects

Random Treatment

Suppose that we have some treatment \(D_i \in \{0,1\}\) and an observed outcome \(Y_i = D_i Y(1)_i + (1 - D_i)*Y(0)_i\) where \(Y(D_i)_i\) is a potential outcome: the outcome individual \(i\) gets under \(D_i\). Since in our data individuals are either treated or not, there is always an element of the pair \((Y(1)_i,Y(0)_i)\) which we do not observe. If the potential outcomes are independent from treatment \(A1: (Y(1)_i,Y(0)_i) \perp D_i\) (e.g. if the treatment is randomized) then we can identify what we call the Average treatment effect (ATE):

\[ \tau_0 = \mathbb{E}[Y(1)_i - Y(0)_i] = \mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i | D_i = 0] , \] where the second equality follows from A1. Let us load data used in (LaLonde 1986) and (Dehejia and Wahba 1999) of a random experiment (National Supported Work (NSW)) where job training is assigned at random among a group of disadvantaged individuals. We also clean the workspace and load some packages we are going to need in this section (we download the package containing the data from github using devtools, check the link for more info).

rm(list = ls())
library(devtools)
library(dplyr)
library(lmtest)
library(matlib)
library(ranger)
#download package from github see https://github.com/jjchern/lalonde
devtools::install_github("jjchern/lalonde")

exp_df <- as_tibble(lalonde::nsw)
print(exp_df)
## # A tibble: 722 × 10
##    data_id    treat   age education black hispanic married nodegree  re75   re78
##    <chr>      <dbl> <dbl>     <dbl> <dbl>    <dbl>   <dbl>    <dbl> <dbl>  <dbl>
##  1 Lalonde S…     1    37        11     1        0       1        1     0  9930.
##  2 Lalonde S…     1    22         9     0        1       0        1     0  3596.
##  3 Lalonde S…     1    30        12     1        0       0        0     0 24909.
##  4 Lalonde S…     1    27        11     1        0       0        1     0  7506.
##  5 Lalonde S…     1    33         8     1        0       0        1     0   290.
##  6 Lalonde S…     1    22         9     1        0       0        1     0  4056.
##  7 Lalonde S…     1    23        12     1        0       0        0     0     0 
##  8 Lalonde S…     1    32        11     1        0       0        1     0  8472.
##  9 Lalonde S…     1    22        16     1        0       0        0     0  2164.
## 10 Lalonde S…     1    33        12     0        0       1        0     0 12418.
## # … with 712 more rows

We are interested in real earnings in 1978 (re78) since in 1978 the program had already finished. Since this data is experimental we can directly estimate the ATE by looking at the difference in sample means

\[ \hat{\tau} = \frac{\sum_{i = 1}^n D_i Y_i}{\sum_{i = 1}^n D_i} - \frac{\sum_{i = 1}^n (1-D_i) Y_i}{\sum_{i = 1}^n (1-D_i)} = \frac{1}{n_1}\sum_{i: D_i = 1} Y_i - \frac{1}{n_0}\sum_{i: D_i = 0} Y_i, \] where \(n_j = \sum_{i=1}^n 1(D_i = j)\) for \(j = 0,1\). We can also estimate the standard error as

\[ \hat{se}(\hat{\tau}) = \sqrt{\frac{\hat{\sigma}_1^2}{n_1} + \frac{\hat{\sigma}_0^2}{n_0}}, \] where \(\hat{\sigma}_j = n_j^{-1} \sum_{i: D_i = j} (Y_i - \bar{Y}_j)^2\) where \(\bar{Y}_j = n_j^{-1} \sum_{i: D_i = j} Y_i\). In R

tauhat <- mean(exp_df$re78[exp_df$treat == 1]) - mean(exp_df$re78[exp_df$treat == 0])
sehat <- sqrt(var(exp_df$re78[exp_df$treat == 1])/sum(exp_df$treat == 1) + 
              var(exp_df$re78[exp_df$treat == 0])/sum(exp_df$treat == 0))
c("tauhat" = tauhat, "se" = sehat)
##   tauhat       se 
## 886.3037 488.2045

Alternatively, we can estimate the ATE by a regression, letting \(\alpha = \mathbb{E}[Y_i | D_i = 0]\) and \(\varepsilon_i = Y_i - \mathbb{E}[Y_i|D_i]\) we have that

\[ Y_i = \mathbb{E}[Y_i |D_i] + Y_i - \mathbb{E}[Y_i |D_i] = \alpha + \tau D_i + \varepsilon_i. \] where by construction \(\mathbb{E}[\varepsilon_i |D_i] = 0\). Hence we can estimate \(\hat{\tau}\) by OLS and the standard errors by the common robust OLS standard errors and we get the same

m1exp <- lm(re78 ~ treat, exp_df)
lmtest::coeftest(m1exp, vcov=sandwich::vcovHC(m1exp, type='HC2'))[2,]
##     Estimate   Std. Error      t value     Pr(>|t|) 
## 886.30372245 488.20452854   1.81543527   0.06987292

The regression approach is useful since it allows to adjust for covariates. This means adding other covariates \(X_i\) as regressors. Since \(D_i\) is randomly assigned, \(X_i\) will be independent of \(D_i\) and hence will not affect the parameter associated to \(D_i\), also, if \(X_i\) has explanatory power over \(Y_i\) it will increase the precision of our ATE estimator by reducing the variance of the residuals. For instance, following (LaLonde 1986) and (Dehejia and Wahba 1999) we can adjust for age, age squared, education, race and high school degree.

m2exp <- lm(re78 ~ treat + poly(age,2) + education + black + hispanic + nodegree, exp_df)
lmtest::coeftest(m2exp, vcov=sandwich::vcovHC(m2exp, type='HC2'))[2,]
##    Estimate  Std. Error     t value    Pr(>|t|) 
## 798.3511545 488.0314384   1.6358601   0.1023096

The unadjusted ATE was 886 dollars, meaning that real earnings of treated individuals in 1978 were 886 dollars higher than the real earnings of the control. When adjusting for covariates the ATE decreases slightly to almost 800 dollars. We see that the adjusted estimate is more precise although less significant since it is closer to zero.

Selection on observables

The main idea in (LaLonde 1986) and (Dehejia and Wahba 1999) is to replace the experimental control group with a control group which is taken from a survey. This means that we go from an experimental setting to an observational one (one where the potential outcomes cannot be expected to be independent from the treatment). The goal is to see whether observational methods can get near to the experimental estimates. To construct this database we keep only the treated from the experimental data and take as control the Current Population Survey (CPS).

cps_lalonde <- lalonde::cps_controls2
obs_df <- exp_df %>%
  filter(treat == 1) %>% 
  bind_rows(cps_lalonde) %>% 
  dplyr::select(-data_id)

To be able to identify treatment effects we still need to assume selection on observables: \(A1': (Y(1)_i, Y(0)_i) \perp D_i | X_i\). That is, the distribution of the potential outcomes now can vary for individuals with different treatments as long as they have different covariates, however, for those with the same covariates, the potential outcomes still have to be independent from treatment. Let us define an ATE which might depend on \(X_i\) and denote it as \(ATE(X_i)\) (also called CATE)

\[ ATE(X_i) = \mathbb{E}[Y(1)_i| X_i] - \mathbb{E}[Y(0)_i |X_i]. \] To retrieve the ATE we can just take \(\mathbb{E}[ATE(X_i)]\). By A1’ we have that

\[ ATE(X_i) = \mathbb{E}[Y_i | D_i = 1, X_i] - \mathbb{E}[Y_i | D_i = 0, X_i]. \]

By specifying some regression \(m_1(X_i) = \mathbb{E}[Y_i | D_i = 1, X_i]\) and \(m_0(X_i) = \mathbb{E}[Y_i | D_i = 0, X_i]\) and estimating the regressions we get what is sometimes called a direct estimator. If the ATE is constant across the support of \(X_i\), then \(ATE = ATE(X_i)\). One case in which this is true is under the assumption that \(\mathbb{E}[Y_i | D_i, X_i] = \delta_0 + \tau D_i + \delta_1' X_i\). Under this linearity assumption, the ATE can be estimated by OLS

m1obs <- lm(re78 ~ treat + poly(age,2) + education + black + hispanic + nodegree, obs_df)
lmtest::coeftest(m1obs, vcov=sandwich::vcovHC(m1obs, type='HC2'))[2,]
##      Estimate    Std. Error       t value      Pr(>|t|) 
## -2.340814e+03  5.453899e+02 -4.292000e+00  1.834070e-05

we see that the estimated ATE is very negative. This is because the treatment and the control are hardly comparable. The treatment is formed by disadvantaged individuals and the control by a representative sample of the US population. In the absence of treatment disadvantaged individuals would already have much lower earnings. One way of allowing for \(ATE(X_i)\) to vary with \(X_i\) is to assume two different linear regressions for treated and untreated. This is equivalent to specifying one linear regression with an interaction term.

\[ \mathbb{E}[Y_i | D_i, X_i] = \gamma_0 + \gamma_1 D_i + \gamma_2' X_i + \gamma_3' (D_i X_i). \] In this case \(ATE(X_i) = \gamma_1 + \gamma_3' X_i\) and \(ATE = \gamma_1 + \gamma_3' \mathbb{E}[X_i]\). And easier way to estimate the ATE (under the linearity assumption) is to reparametrize the above specification and run

\[ Y_i = \gamma_0^* + \gamma_1^* D_i + \gamma_2^{*'} X_i + \gamma_3^{*'} (D_i \tilde{X}_i) + \eta_i \]

where \(\tilde{X}_i\) is \(X_i\) minus its mean. Then it can be showed that \(ATE = \gamma_1^*\).

Y <- obs_df$re78
D <- obs_df$treat
X <- dplyr::select(obs_df, c("age", "education", "black", "hispanic", "nodegree"))
X$agesq <- X$age^2
DXc <- D*scale(X, center = TRUE, scale = FALSE)
colnames(DXc) <- paste("dx",names(X), sep = "_")
df3 <- cbind(Y = Y,D,X,DXc)
m3obs <- lm(Y ~., df3)
lmtest::coeftest(m3obs, vcov=sandwich::vcovHC(m3obs, type='HC2'))[2,]
##      Estimate    Std. Error       t value      Pr(>|t|) 
## -2.921629e+03  9.501247e+02 -3.074996e+00  2.126508e-03

Inverse propensity score weighting (IPW)

When the dimension of \(X_i\) is large it is handy to define the propensity score:

\[ p(X_i) = \mathbb{P}(D_i = 1 | X_i), \]

i.e. the probability of being treated given the covariates. This is useful thanks to two results in Rosenbaum & Rubin(1983). Mainly that if selection on observables holds and \(A2: 0< p(X_i) < 1\) a.s. (Common Suppport), then \((Y(1)_i, Y(0)_i) \perp D_i | p(X_i)\), i.e. it is enough to control for \(p(X_i)\). Also, if \(p(X_i)\) is the propensity score then \(X_i \perp D_i | p(X_i)\), so the distribution of \(X_i\) does not vary across treatment status given the propensity score. One can show that under A1’ and A2

\[ \tau_0 = \mathbb{E}\biggl(\frac{D_i Y_i}{p(X_i)} - \frac{(1-D_i)Y_i}{1 - p(X_i)} \biggr). \] However, \(p(X_i)\) is generally unknown and has to be estimated. To this end we can impose a parametric model. A popular choice is

\[ p(X_i) = G(X_i' \beta_0), \] where \(G(\cdot)\) is the c.d.f of a distribution which is symmetric around 0 and \(\beta_0\) is a vector of parameters. Then the conditional likelihood of \(D_i|X_i\) is

\[ f(D_i|X_i; \beta) = G(X_i' \beta)^{D_i}[1-G(X_i' \beta)]^{1-D_i} \]

and the log-likelihood

\[ l_i(\beta) = D_i \log G(X_i' \beta) + (1 - D_i) \log[1 - G(X_i' \beta)] \] and \(\hat{\beta} = \arg \max_\beta \sum_{i=1}^n l_i(\beta)\). A popular choice of \(G(\cdot)\) is the logit model

\[ G(z) = \frac{e^z}{1 + e^z}, \]

for which it is simple to compute the score. To estimate the propensity score \(\hat{p}(X_i) = G(X_i' \hat{\beta})\) using a logit model in R we can do the following (we introduce as new covariates age squared, married and earnings in 1975 and its squared following (Dehejia and Wahba 1999))

obs_df$agesq <- obs_df$age^2
obs_df$educationsq <- obs_df$education^2
obs_df$re75sq <- obs_df$re75^2
logm = glm(treat ~ age + agesq + education + educationsq +
           married + re75 + re75sq + black + hispanic + nodegree,
             data = obs_df, family = "binomial")
obs_df$pscore <- logm$fitted.values
b <- as.matrix(logm$coefficients)

Now, we are ready to estimate \(\tau_0\) as

\[ \hat{\tau}^{IPW}_1 = \frac{1}{n}\sum_{i=1}^n \frac{D_i Y_i}{\hat{p}(X_i)} - \frac{1}{n}\sum_{i=1}^n\frac{(1-D_i)Y_i}{1-\hat{p}(X_i)} \] Noting that \(\mathbb{E}[D_i/p(X_i)] = 1\) by LIE and the same happens with \(\mathbb{E}[(1-D_i)/(1-p(X_i))]\), an asymptotically equivalent estimator is

\[ \hat{\tau}^{IPW}_2 = \biggl(\sum_{i=1}^n \frac{D_i}{\hat{p}(X_i)}\biggr)^{-1} \biggl(\sum_{i=1}^n \frac{D_i Y_i}{\hat{p}(X_i)}\biggr) - \biggl(\sum_{i=1}^n \frac{1-D_i}{1-\hat{p}(X_i)}\biggr)^{-1} \biggl(\sum_{i=1}^n\frac{(1-D_i)Y_i}{1-\hat{p}(X_i)}\biggr) \]

\(\hat{\tau}^{IPW}_1\) and \(\hat{\tau}^{IPW}_2\) are called Inverse Probability Weight (IPW) estimators and are a difference of weighted sums of \(Y_i\). In \(\hat{\tau}^{IPW}_2\) the weights sum up to \(1\) and we have better finite sample properties. In fact, both estimators are members of the broader class of consistent, semiparametric estimators found in (Robins, Rotnitzky, and Zhao 1994). In this paper they find the estimator achieving the least asymptotic variance in this class to be none of the above but

\[ \hat{\tau}^{DR} = \frac{1}{n}\sum_{i=1}^n \frac{D_iY_i - \hat{m}_1(X_i)(D_i - \hat{p}(X_i))}{\hat{p}(X_i)} - \frac{1}{n}\sum_{i=1}^n \frac{(1-D_i)Y_i - \hat{m}_0(X_i)(D_i - \hat{p}(X_i))}{1 - \hat{p}(X_i)}, \]

where \(m_j(X_i) = \mathbb{E}[Y_i | D_i = j, X_i]\), \(j = 0,1\) and \(\hat{m}_j(X_i)\) is some estimator. This estimator is called Doubly robust because it has the attractive property that if we impose some parametric specification say \(m_j(X_i) = m_j(X_i, \alpha_j)\), the estimator is still consistent if \(m_j\) is incorrectly specified but the specification of the propensity score is correct and it is also consistent if the specification of the propensity score is not correct but that of \(m_j\) is correct. This estimator is also called augmented IPW since it is “augmented” with the outcome regression. Now we are ready to do these estimations in R

Y <- obs_df$re78
D <- obs_df$treat
PS <- obs_df$pscore

ipw1 <- mean(D*Y/PS - (1-D)*Y/(1-PS))
ipw2 <- (1/(sum(D/PS)))*sum(D*Y/PS) - (1/(sum((1-D)/(1-PS))))*sum((1-D)*Y/(1-PS))

# Doubly Robust
m1 <- lm(re78 ~ poly(age,2) + education + black + hispanic + nodegree, filter(obs_df,treat == 1))
m1hat <- predict(m1, obs_df)
m0 <- lm(re78 ~ poly(age,2) + education + black + hispanic + nodegree, filter(obs_df,treat == 0))
m0hat <- predict(m0, obs_df)

ipwdr <- mean((D*Y - (D-PS)*m1hat)/PS) - mean(((1-D)*Y - (D-PS)*m0hat)/(1-PS))
c("IPW1" = ipw1, "IPW2" = ipw2, "DR" = ipwdr)
##     IPW1     IPW2       DR 
## 6769.465 1753.993 4111.064

We can see that the estimates are not very stable. This can happen when the common support assumption is violated since having estimated propensity scores in the denominators which are close to zero will produce instability. In our case it is very hard to maintain the common support assumption. To see these let us do an histogram with the propensity scores by treatment status.

c1 <- rgb(173,216,230,max = 255, alpha = 180, names = "lt.blue")
c2 <- rgb(255,192,203, max = 255, alpha = 180, names = "lt.pink")

par(mar = c(5, 5, 5, 5) + 0.3)
hist(filter(obs_df, treat == 1)$pscore, freq = TRUE, col = c1, axes = FALSE, xlab = "", ylab = "", main = "")
axis(side = 1, xlim = c(0,1))
axis(side = 4, ylab = "")
mtext(side = 4, text = "NSW", line = 2.5, col = "blue")
par(new=TRUE)
hist(filter(obs_df, treat == 0)$pscore, ylim = c(0,3000), freq = TRUE, axes = FALSE, col = c2, xlab = "", ylab = "", main = "Common Support")
axis(side = 2)
mtext(side = 2, text = "CPS", line = 2.5, col = "pink")

We see that the control group (the CPS observations) are highly concentrated close to zero. This is because the NSW experiment chose disadvantaged individuals which are not representative of the US population as the CPS is. Hence, we have a comparability issue. A practical solution is to trim the observations with very low (or high) propensity scores. For instance, if we trim observation with a propensity score lower than 0.1 and higher than 0.9 we see the Common Support assumption seems more plausible

obs_dftrim <- filter(obs_df, pscore > 0.1 & pscore < 0.9)

par(mar = c(5, 5, 5, 5) + 0.3)
hist(filter(obs_dftrim, treat == 1)$pscore, freq = TRUE, col = c1, axes = FALSE, xlab = "", ylab = "", main = "")
axis(side = 1, xlim = c(0,1))
axis(side = 4, ylab = "")
mtext(side = 4, text = "NSW", line = 2.5, col = "blue")
par(new=TRUE)
hist(filter(obs_dftrim, treat == 0)$pscore, freq = TRUE, axes = FALSE, col = c2, xlab = "", ylab = "", main = "Common Support")
axis(side = 2)
mtext(side = 2, text = "CPS", line = 2.5, col = "pink")

and the estimates become more stable.

Ytr <- obs_dftrim$re78
Dtr <- obs_dftrim$treat
PStr <- obs_dftrim$pscore

ipw1tr <- mean(Dtr*Ytr/PStr - (1-Dtr)*Ytr/(1-PStr))
ipw2tr <- (1/(sum(Dtr/PStr)))*sum(Dtr*Ytr/PStr) - (1/(sum((1-Dtr)/(1-PStr))))*sum((1-Dtr)*Ytr/(1-PStr))

# Doubly Robust
m1 <- lm(re78 ~ poly(age,2) + education + black + hispanic + nodegree, filter(obs_dftrim,treat == 1))
m1hat <- predict(m1, obs_dftrim)
m0 <- lm(re78 ~ poly(age,2) + education + black + hispanic + nodegree, filter(obs_dftrim,treat == 0))
m0hat <- predict(m0, obs_dftrim)

ipwdrtr <- mean((Dtr*Ytr - (Dtr-PStr)*m1hat)/PStr) - mean(((1-Dtr)*Ytr - (Dtr-PStr)*m0hat)/(1-PStr))
c("IPW1" = ipw1tr, "IPW2" = ipw2tr, "DR" = ipwdrtr)
##      IPW1      IPW2        DR 
## -865.2516 -631.5008 -847.0897

For the rest of this exercise let us restrict the sample to these observations

obs_df <- obs_dftrim
PS <- PStr
Y <- Ytr
D <- Dtr
ipw1 <- ipw1tr
ipw2 <- ipw2tr

Inference on IPW

\(\hat{\tau}_1^{IPW}\) and \(\tau_2^{IPW}\) can be written as Z-estimators. Let us illustrate this for \(\hat{\tau}_1^{IPW}\), let \(Z_i = (Y_i,D_i,X_i)\) be the data and \(\theta_0 = (\tau_0, \beta')'\). Define

\[ q(Z_i, \theta) = \begin{pmatrix} \frac{D_i Y_i}{G(X_i' \beta)} - \frac{(1-D_i)Y_i}{1 - G(X_i' \beta)} - \tau \\ X_i(D_i - G(X_i'\beta)) \end{pmatrix} \] \(q(Z_i, \theta)\) is a \((k+1) \times 1\) vector function where \(k\) is the dimension of \(\beta\). The last \(k\) elements of \(q(Z_i, \theta)\) are the FOCs of the MLE from the Logit model (the score). Then, \(\mathbb{E}[q(Z_i, \theta_0)] = 0\) and \((\hat{\tau}^{IPW}_1, \hat{\beta})\) are chosen such that \(\sum_{i=1}^n q(Z_i, \hat{\theta}) = 0\). Let \(\dot{q}(Z_i, \theta)\) be the matrix of derivatives of \(q\) with respect to parameters in \(\theta\). By taking a mean value expansion around \(\theta_0\) and rearranging we can show that

\[ \sqrt{n}(\hat{\theta} - \theta_0) = - \biggl(n^{-1}\sum_{i=1}^n \dot{q}(Z_i, \bar{\theta})\biggr)^{-1} \biggl(n^{-1/2} \sum_{i=1}^n q(Z_i, \theta_0) \biggr) \to_d \mathcal{N}(0,\mathbb{E}[\dot{q}(Z_i,\theta_0)]^{-1} \mathbb{E}[q(Z_i, \theta_0)^2]\mathbb{E}[\dot{q}(Z_i,\theta_0)]^{-1}), \]

where \(\bar{\theta}\) is some intermediate point between \(\hat{\theta}\) and \(\theta_0\). So we can estimate the variance as

\[ \hat{V}_1 = \biggl(n^{-1}\sum_{i=1}^n \dot{q}(Z_i,\hat{\theta})\biggr)^{-1} \biggl(n^{-1}\sum_{i=1}^nq(Z_i, \hat{\theta}) q(Z_i, \hat{\theta})' \biggr) \biggl(n^{-1}\sum_{i=1}^n \dot{q}(Z_i,\hat{\theta})\biggr)^{-1} \]

Noticing that \(\frac{d}{d u} G(u) = G(u)(1-G(u))\) we can compute the \((k + 1) \times (k +1)\) matrix \(\dot{q}\)

\[ \dot{q}(Z_i, \theta) = \begin{pmatrix} -1 & -X_i'\biggl(\frac{D_iY_i(1-G(X_i'\beta))}{G(X_i'\beta)} - \frac{(1-D_i)Y_iG(X_i'\beta)}{1-G(X_i' \beta)}\biggr) \\ 0 & X_iX_i'G(X_i' \beta)(1-G(X_i'\beta)) \end{pmatrix} \]

So we can estimate the standard error of \(\hat{\tau}^{IPW}_1\)

obs_df$const <- rep(1,nrow(obs_df))
X <- as.matrix(dplyr::select(obs_df,c(const,age,agesq,education,educationsq,married,
                               re75,re75sq,black,hispanic,nodegree)))
A <- 0
B12 <- 0
B22 <- 0
G <- function(u){exp(u)/(1+exp(u))}
k <- ncol(X)

for (i in 1:nrow(X)){
  aux <-  matrix(c(D[i]*Y[i]/PS[i] - (1-D[i])*Y[i]/(1-PS[i]) - ipw1,
                    X[i,]*(D[i] - PS[i])),k+1,1)
  A <- A + aux%*%t(aux)
  B12 <- B12 - t(X[i,])*((D[i]*Y[i]*(1-PS[i]))/PS[i] + ((1-D[i])*Y[i]*PS[i])/(1-PS[i]))
  B22 <- B22 - (X[i,] %*% t(X[i,]))*(PS[i]*(1-PS[i]))
}
B <- (1/nrow(X))*rbind(cbind(-nrow(X),B12), cbind(rep(0,k),B22))
V <- inv(B) %*% ((1/nrow(X))*A) %*% inv(B)
se <- sqrt(V[1,1]/nrow(X))
c("IPW1" = ipw1, "SE" = se)
##      IPW1        SE 
## -865.2516  529.8508

Inference for \(\hat{\tau}_2^{IPW}\) and \(\hat{\tau}^{DR}\) inference follows similarly by framing the problem as a Z-estimator. With a little tweak in the double robust estimator we can get simpler inference.

Locally Robust IPW

Note that we can rewrite \(\hat{\tau}^{DR}\) as

\[ \hat{\tau}^{DR} = \frac{1}{n}\sum_{i=1}^n \biggl( \hat{m}_1(X_i) - \hat{m}_0(X_i) + \frac{D_i}{\hat{p}(X_i)}[Y_i - \hat{m}_1(X_i)] - \frac{1-D_i}{1 - \hat{p}(X_i)}[Y_i - \hat{m}_0(X_i)] \biggr). \]

Note that this resembles the direct estimation of the ATE but with an added term, this term is a correction term which ensures local robustness. Now define

\[ \psi(Z_i, p, m_1, m_0, \tau) = m_1(X_i) - m_0(X_i) + \underbrace{\frac{D_i}{p(X_i)}[Y_i - m_1(X_i)] - \frac{(1-D_i)}{1 - p(X_i)}[Y_i - m_0(X_i)]}_{\phi(Z_i,p,m_1,m_0)} - \tau, \] and notice that \(\hat{\tau}^{DR}\) is the \(\tau\) which solves \(\sum_{i=1}^n \psi(Z_i, \hat{p}, \hat{m}_1, \hat{m}_0, \tau) = 0\). The \(\phi\) function is sometimes called the First Step Influence Function (FSIF, see (Chernozhukov et al. 2022)). As already stated it makes the orthogonal moment function locally robust by capturing the effect of first steps estimation. It turns out that \(\psi\) has another very special property: that of an orthogonal moment function. This means that, locally, estimating nuisance functions \(p\) \(m_1\) and \(m_0\) has no impact. In fact we can even estimate them non-parametrically under mild rate conditions. Modifying slightly the estimator to estimate \(p\), \(m_1\) and \(m_0\) with observations different from the ones we use to estimate \(\tau_0\) (this is called cross-fitting) allows us to avoid over-fitting, use many non-parametric estimators such as machine learners and provide valid inference based only on the orthogonal moment function \(\psi\). Hence, we define the locally robust IPW estimator as a cross-fitted version of the above. Partition the observations in \(L\) sets denoted by \(I_1,...,I_L\), then

\[ \hat{\tau}^{LR} = \frac{1}{n} \sum_{l=1}^L \sum_{i \in I_l} \biggl(\hat{m}_{1,l}(X_i) - \hat{m}_{0,l}(X_i) + \frac{D_i}{\hat{p}_l(X_i)}[Y_i - \hat{m}_{1,l}(X_i)] - \frac{(1-D_i)}{1 - \hat{p}_l(X_i)}[Y_i - \hat{m}_{0,l}(X_i)] \biggr) \] To estimate it we split the data in two.

set.seed(123)
n <- nrow(obs_df)
L <- 2
ind <- split(seq(n), seq(n) %% L)

fv1 <- rep(0,n)
fv0 <- rep(0,n)
ps <- rep(0,n)
for (i in 1:L){ 
  mps <- ranger(treat ~ age + agesq + education + educationsq + married + re75 + re75sq + black +    
                   hispanic + nodegree, data = obs_df[ind[[i]],])
  mfv1 <- ranger(re78 ~ treat + age + agesq + education + black + hispanic + nodegree,
                data = filter(obs_df[ind[[i]],], treat == 1))
  mfv0 <- ranger(re78 ~ treat + age + agesq + education + black + hispanic + nodegree,
                data = filter(obs_df[ind[[i]],], treat == 0))
  
  fv1[-ind[[i]]] <- predict(mfv1, obs_df[-ind[[i]],])$predictions
  fv0[-ind[[i]]] <- predict(mfv0, obs_df[-ind[[i]],])$predictions
  ps[-ind[[i]]] <- predict(mps, obs_df[-ind[[i]],])$predictions

}

ipwlr_sc <- fv1 - fv0 + (D/ps)*(Y-fv1) - ((1-D)/(1-ps))*(Y-fv0)
ipwlr <- mean(ipwlr_sc)
se_lr <- sd(ipwlr_sc)/sqrt(n)
c(ipwlr = ipwlr, se = se_lr)
##     ipwlr        se 
## -703.2693  847.9069

We still get a negative effect and very large standard errors. In general this data seems to be very ill suited for estimating treatment effects since the controls are wildly different from the treated.

Packages and other sources

For a book on treatment effects one can read (Angrist and Pischke 2009), for a reference on IPW is a good reference is (Lunceford and Davidian 2004). The theory of the logistic regression is that of the conditional MLE and can be found for instance in (Wooldridge 2002). A similar tutorial to this can be found here. Also, there are several packages with functions to perform the above estimation. An example of such a package is PSweight in (Zhou et al. 2020) where the package is compared to other similar packages.

References

Angrist, Joshua D, and Jörn-Steffen Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton university press.
Chernozhukov, Victor, Juan Carlos Escanciano, Hidehiko Ichimura, Whitney K Newey, and James M Robins. 2022. “Locally Robust Semiparametric Estimation.” Econometrica 90 (4): 1501–35.
Dehejia, Rajeev H, and Sadek Wahba. 1999. “Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs.” Journal of the American Statistical Association 94 (448): 1053–62.
LaLonde, Robert J. 1986. “Evaluating the Econometric Evaluations of Training Programs with Experimental Data.” The American Economic Review, 604–20.
Lunceford, Jared K, and Marie Davidian. 2004. “Stratification and Weighting via the Propensity Score in Estimation of Causal Treatment Effects: A Comparative Study.” Statistics in Medicine 23 (19): 2937–60.
Robins, James M, Andrea Rotnitzky, and Lue Ping Zhao. 1994. “Estimation of Regression Coefficients When Some Regressors Are Not Always Observed.” Journal of the American Statistical Association 89 (427): 846–66.
Wooldridge, Jeffrey M. 2002. “Econometric Analysis of Cross Section and Panel Data MIT Press.” Cambridge, MA 108 (2): 245–54.
Zhou, Tianhui, Guangyu Tong, Fan Li, and Laine E Thomas. 2020. “PSweight: An r Package for Propensity Score Weighting Analysis.” arXiv Preprint arXiv:2010.08893.