Matching

class: center, middle, inverse, title-slide

.title[
# Matching
]
.subtitle[
## EC 607, Set 8
]
.author[
### Edward Rubin
]

---

class: inverse, middle

$$
`\begin{align}
  \def\ci{\perp\mkern-10mu\perp}
\end{align}`
$$

# Prologue

---
name: schedule

# Schedule

## Last time(s)

- DAGs
- The conditional independence assumption: `$\left( \text{Y}_{0i},\, \text{Y}_{1i}\right) \ci \text{D}_{i}\big| \text{X}_{i}$`
- Omitted variable bias
- Good *vs.* bad controls

## Today

- Matching estimators (*MHE* 3.2 and Cameron and Trivedi 25.4).
- Probably time for another problem set

---
layout: true
# Matching

---
class: inverse, middle

---
name: gist
## The gist

Remember the .hi[conditional independence assumption].super[.pink[†]] in a setting—_i.e._, treatment is as-good-as random conditional on a known set of covariates?

.footnote[.pink[†] AKA "selection on observables"]

.hi[Matching estimators] take us at our word.

If we really believe `$\left(\text{Y}_{1i},\, \text{Y}_{0i} \right)\ci \text{D}_{i}|\text{X}_{i}$`, then we can just calculate a bunch of treatment effects conditional on `$\text{X}_{i}$`, _i.e._,
$$
`\begin{align}
  \tau(x) = \mathop{E}\left[ \text{Y}_{1i} - \text{Y}_{0i} \mid \text{X}_{i} = x \right]
\end{align}`
$$

.note[The idea:] Estimate a treatment effect only using observations with (nearly?) identical values of `$\text{X}_{i}$`.
--
 The CIA buys us causality within these groups.

---
name: goals
## Goals

Let's return to .b[the fundamental problem of causal inference] for a moment.

1. We want/need to know `$\tau_i = \text{Y}_{1i} - \text{Y}_{0i}$`.
2. We cannot simultaneously observe *both* `$\text{Y}_{1i}$` *and* `$\text{Y}_{0i}$`.

Most (all?) empirical strategies boil to estimating `$\text{Y}_{0i}$` for treated individuals—the unobservable counterfactual for the treatment group.

Matching is no different.

We match untreated observations to treated observations using `$\text{X}_{i}$`, _i.e._, calculate a `$\widehat{\text{Y}_{0i}}$` for each `$\text{Y}_{1i}$`, based upon "matched" untreated individuals.

---
## More formally

We want to construct a counterfactual for each individual with `$\text{D}_{i}=1$`.

.note[CIA:] The counterfactual for `$i$` should only use individuals that match `$\text{X}_{i}$`.

Let there be `$N_T$` treated individuals and `$N_C$` control individuals. We want
- `$N_T$` sets of weights
- with `$N_C$` weights in each set
--
: `$w_i(j)\, \left( i = 1,\,\ldots,\, N_T;\, j=1,\,\ldots,\, N_C \right)$`

Assume `$\sum_j w_i(j) = 1$`. Our estimate for the counterfactual of treated `$i$` is
$$
`\begin{align}
  \widehat{\text{Y}_{0i}} = \sum_{j\in \left( D=0 \right)} w_i(j) \text{Y}_{j}
\end{align}`
$$
---
name: generic
## More formally

If our estimated counterfactual for treated individual `$i$` is
$$
`\begin{align}
  \widehat{\text{Y}_{0i}} = \sum_j w_i(j) \text{Y}_{j}
\end{align}`
$$
then our estimated treatment effect (for individual `$i$`) is
$$
`\begin{align}
  \hat{\tau}_i = \text{Y}_{1i} - \widehat{\text{Y}_{0i}} = \text{Y}_{1i} - \sum_j w_i(j) \text{Y}_{j}
\end{align}`
$$

∴ a generic matching estimator for the .pink[treatment effect on the treated] is
$$
`\begin{align}
  \hat{\tau}_M = \dfrac{1}{N_T} \sum_{i \in \left( \text{D}=1 \right)} \left( \text{Y}_{1i} - \widehat{\text{Y}_{0i}} \right) = \dfrac{1}{N_T} \sum_{i \in \left( \text{D}=1 \right)} \left( \text{Y}_{1i} - \sum_{j\in \left( D=0 \right)} w_i(j) \text{Y}_{j} \right)
\end{align}`
$$
---
name: weights
## Weight for it.super[.pink[†]]

So all we need is those weights and we're done..super[.pink[††]]

.footnote[
.pink[†] 🤦 .pink[††] Plus an interesting, policy-relevant setting with valid conditional independence. And data.
]

.qa[Q] Where does one find these handy weights?

.qa[A] You've got options, but you need to choose carefully/responsibly.

*E.g.*, if `$w_i(j) = \frac{1}{N_C}$` for all `$(i,j)$`, then we're back to a difference in means.
 This weighting doesn't abide by our conditional independence assumption.

.note[The plan] Choose weights `$w_i(j)$` that indicate .hi-slate[*how close*] `$\text{X}_{j}$` is to `$\text{X}_{i}$`.
---
name: discrete
## Proximity

Our weights `$w_i(j)$` should be a measure of .hi-slate[*how close*] `$\text{X}_{j}$` is to `$\text{X}_{i}$`.

If `$\text{X}$` is .hi-pink[discrete], then we can consider equality, _i.e._, `$w_i(j) = \mathbb{I}(\text{X}_{i} = \text{X}_{j})$`, scaling as necessary to get `$\sum_j w_i(j) = 1$`.

---
name: nn-euclidean
## Proximity

Our weights `$w_i(j)$` should be a measure of .hi-slate[*how close*] `$\text{X}_{j}$` is to `$\text{X}_{i}$`.

If `$\text{X}$` is .hi-purple[continuous], then we need .it[proximity] rather than .it[equality].

.purple[*Nearest-neighbor* matching] chooses the single closest control observation using the Euclidean distance between `$\text{X}_{i}$` and `$\text{X}_{j}$`, _i.e._,

$$
`\begin{align}
  \text{d}_{i,j} = \left( \text{X}_{i} - \text{X}_{j} \right)'\left(\text{X}_{i} - \text{X}_{j}\right)
\end{align}`
$$

- `$\hat{\tau}_i = \text{Y}_{1i} - \text{Y}_{0j}^i$`, where `$\text{Y}_{0j}^i$` is `$i$`'s nearest neighbor in the control group.
- .hi-slate[Estimator:] `$\hat{\tau}_M = \frac{1}{N_T} \sum_i \hat{\tau}_i$`
- Produces causal estimates if CIA is valid *and* we have sufficient overlap.
- Suffers from arbitrary choices of units.

---
name: nn-mahalanobis
## Proximity

Our weights `$w_i(j)$` should be a measure of .hi-slate[*how close*] `$\text{X}_{j}$` is to `$\text{X}_{i}$`.

If `$\text{X}$` is .hi-purple[continuous], then we need .it[proximity] rather than .it[equality].

.purple[*Nearest-neighbor* matching with Mahalanobis distance] chooses the single closest control using .purple[Mahalanobis] distance between `$\text{X}_{i}$` and `$\text{X}_{j}$`, _i.e._,

$$
`\begin{align}
  \text{d}_{i,j} = \left( \text{X}_{i} - \text{X}_{j} \right)' \Sigma_{X}^{-1} \left(\text{X}_{i} - \text{X}_{j}\right)
\end{align}`
$$
where `$\Sigma_{X}^{-1}$` is the covariance matrix of `$\text{X}$`.

- .hi-slate[Estimator:] `$\hat{\tau}_M = \frac{1}{N_T} \sum_i \hat{\tau}_i$` where `$\left(\hat{\tau}_i = \text{Y}_{1i} - \text{Y}_{0j}^i\right)$`
- Produces causal estimates if CIA is valid *and* we have sufficient overlap.
- Does not suffer from arbitrary choices of units.
---
## More neighbors?

Why limit ourselves to a .b[single] "best" match?

If we're going to let a function/algorithm choose the *nearest* match, can't we also let the function/algorithm choose *how many* matches?

Furthermore, if `$N_C \gg N_T$`, it we're throwing away *a lot* of information.

We could instead use this information and be more efficient.

---
name: kernel
## More neighbors!

.purple[Kernel matching] gives positive weight to all control observations within some .hi-slate[bandwidth] `$h$`, with higher weight for closer matches determined by some .hi-slate[kernel function] `$K(\cdot)$`,
$$
`\begin{align}
  w_i(j) = \dfrac{K\!\!\left( \dfrac{\text{X}_{j} - \text{X}_{i}}{h} \right)}{\sum_{j\in(D=0)} K\!\!\left(\dfrac{\text{X}_{j} - \text{X}_{i}}{h} \right)}
\end{align}`
$$

.ex[Example] The *Epanechnikov kernel* is defined as
$$
`\begin{align}
 K(z) = \dfrac{3}{4} \left( 1 - z^2 \right) \times \mathbb{I}\!\left( |z| < 1 \right)
\end{align}`
$$

---
layout: false
class: clear

.hi-orange[The Epanechnikov kernel] `$K(z) = \frac{3}{4} \left( 1 - z^2 \right) \times \mathbb{I}\!\left( |z| < 1 \right)$`

<img src="08-matching_files/figure-html/epanechnikov-1.svg" style="display: block; margin: auto;" />
---
layout: false
class: clear

---
class: clear
count: false

<img src="08-matching_files/figure-html/ex_weights-1.svg" style="display: block; margin: auto;" />
---
layout: false
class: clear

.hi-orange[The Epanechnikov kernel] `$K(z) = \frac{3}{4} \left( 1 - z^2 \right) \times \mathbb{I}\!\left( |z| < 1 \right)$`

<img src="08-matching_files/figure-html/epanechnikov2-1.svg" style="display: block; margin: auto;" />
---
layout: false
class: clear

.hi-orange[The Triangle kernel] `$K(z) = \left( 1 - |z| \right) \times \mathbb{I}\!\left( |z| < 1 \right)$`

<img src="08-matching_files/figure-html/triangle-1.svg" style="display: block; margin: auto;" />
---
layout: false
class: clear

.hi-orange[The Uniform kernel] `$K(z) = \frac{1}{2} \times \mathbb{I}\!\left( |z| < 1 \right)$`

<img src="08-matching_files/figure-html/uniform-1.svg" style="display: block; margin: auto;" />
---
layout: false
class: clear

.hi-orange[The Gaussian kernel] `$K(z) = \left( 2\pi \right)^{-1/2}  \exp\left(-z^2/2 \right)$`

<img src="08-matching_files/figure-html/gaussian-1.svg" style="display: block; margin: auto;" />
---
# Kernels
## Aside

Kernel functions are good for more than just matching.

You will most commonly see/use them smoothing out densities—providing a smooth, moving-window average.

_E.g._, .mono[R]'s (`ggplot2`'s) smooth, density-plotting function `geom_density()`.

`geom_density()` defaults to `kernel = "gaussian"`, but you can specify many other kernel functions (including `"epanechnikov"`).

You can also change the `bandwidth` argument. The default is a bandwidth-choosing function called `bw.nrd0()`.
---
layout: true
# Matching

---
## Adding neighbors

As we add more neighbors—either moving from `$1$` to `$n>1$` or increasing our bandwidth—we potentially increase the efficiency of our estimator.

We need to .hi[be careful not to add *too many* controls] for each treated `$i$`.

CIA requires that we're actually conditioning on the observables—it does not allow us to take a simple average across all control observations.
---
## The curse of dimensionality.super[.pink[†]]

.footnote[.pink[†] I'm not sure if this is a title for Harry Potter or Indiana Jones... crossover anyone?]

It turns out kernel- and bandwidth-selection are not our biggest enemies.

As the dimension of `$\text{X}$` expands (matching on more variables), it becomes .hi[harder and harder to find a nice, close control] for each treated unit.

We need a way to shrink the dimensionality of `$\text{X}$`.
---
layout: true
# Propensity-score methods

---
class: inverse, middle
---
name: setup
## Setup

Let's begin with two assumptions—one old and one new.

1. .hi-purple[Conditional independence:] `$\left( \text{Y}_{0i},\, \text{Y}_{1i} \right) \ci \text{D}_{i}|\text{X}_{i}$`

2. .hi-purple[Overlap:] `$0 < \mathop{\text{Pr}}\left(\text{D}_{i} = 1 \mid \text{X}_{i}\right) < 1$`

We can estimate an average treatment effect by conditioning on `$\text{X}_{i}$`.

However, overlap may fail if the dimensions of `$X$` are large and `$N$` is finite.

.hi[Propensity scores] propose a solution to this mess.

---
name: magic
## The magic

It turns out that if `$\left( \text{Y}_{0i},\,\text{Y}_{1i} \right) \ci \text{D}_{i}|\text{X}_{i},\,$` then we actually only need to match/condition on `$p(\text{X}_{i}) = \mathop{E}\left[ \text{D}_{i} | \text{X}_{i} \right]$`.

`$p(\text{X}_{i})$` is the .attn[propensity score]
--
, the probability of treatment given `$\text{X}_{i}.$`

.attn[Propensity-score theorem] If `$\left( \text{Y}_{0i},\,\text{Y}_{1i} \right) \ci \text{D}_{i}|\text{X}_{i},\,$` then `$\left( \text{Y}_{0i},\,\text{Y}_{1i} \right) \ci \text{D}_{i}|p(\text{X}_{i}).$`

This theorem extends our CIA to a one-dimensional score, avoiding the curse of dimensionality.
---
layout: true
# Propensity-score methods

.note[Theorem] If `$\left( \text{Y}_{0i},\,\text{Y}_{1i} \right) \ci \text{D}_{i}|\text{X}_{i},\,$` then `$\left( \text{Y}_{0i},\,\text{Y}_{1i} \right) \ci \text{D}_{i}|p(\text{X}_{i}).$`

## Proof

---
name: proof

To prove this theorem, we will show `$\mathop{\text{Pr}}\left(\text{D}_{i}=1 \mid \text{Y}_{0i},\, \text{Y}_{1i},\, p(\text{X}_{i})\right) = p(\text{X}_{i})$`, _i.e._, `$\text{D}_{i}$` is independent of `$\left( \text{Y}_{0i},\, \text{Y}_{1i} \right)$` after conditioning on `$p(\text{X}_{i})$`.

---
count: false

`$\mathop{\text{Pr}}\!\bigg[\text{D}_{i}=1 \bigg| \text{Y}_{0i},\, \text{Y}_{1i},\, p(\text{X}_{i})\bigg]$`

--
.pad-left[
`$=\mathop{E}\!\bigg[\text{D}_i \bigg| \text{Y}_{0i},\, \text{Y}_{1i},\, p(\text{X}_{i})\bigg]$`
]

--
.pad-left[
`$=\mathop{E}\!\bigg[ \mathop{E}\!\bigg(\text{D}_i \bigg| \text{Y}_{0i},\, \text{Y}_{1i},\, p(\text{X}_{i}),\, \text{X}_{i} \bigg) \bigg| \text{Y}_{0i},\, \text{Y}_{1i},\, p(\text{X}_{i})\bigg]$`
]

--
.pad-left[
`$=\mathop{E}\!\bigg[ \mathop{E}\!\bigg(\text{D}_i \bigg| \text{Y}_{0i},\, \text{Y}_{1i},\, \text{X}_{i} \bigg) \bigg| \text{Y}_{0i},\, \text{Y}_{1i},\, p(\text{X}_{i})\bigg]$`
]
---

`$\mathop{\text{Pr}}\!\bigg[\text{D}_{i}=1 \bigg| \text{Y}_{0i},\, \text{Y}_{1i},\, p(\text{X}_{i})\bigg]= \cdots =\mathop{E}\!\bigg[ \mathop{E}\!\bigg(\text{D}_i \bigg| \text{Y}_{0i},\, \text{Y}_{1i},\, \text{X}_{i} \bigg) \bigg| \text{Y}_{0i},\, \text{Y}_{1i},\, p(\text{X}_{i})\bigg]$`

--
.pad-left[
`$=\mathop{E}\!\bigg[ \mathop{E}\!\bigg(\text{D}_i \bigg| \text{X}_{i} \bigg) \bigg| \text{Y}_{0i},\, \text{Y}_{1i},\, p(\text{X}_{i})\bigg]$`
]

--
.pad-left[
`$=\mathop{E}\!\bigg[ p(\text{X}_{i}) \bigg| \text{Y}_{0i},\, \text{Y}_{1i},\, p(\text{X}_{i})\bigg]$`
]

--
.pad-left[
`$=p(\text{X}_{i})$`
]

∴ `$\left( \text{Y}_{0i},\,\text{Y}_{1i} \right) \ci \text{D}_{i}|\text{X}_{i} \implies \left( \text{Y}_{0i},\,\text{Y}_{1i} \right) \ci \text{D}_{i}|p(\text{X}_{i})$` .orange[✔]
---
layout: true
# Propensity-score methods

---
name: intuition
## Intuition

.qa[Q] What's going on here?

`$\text{X}_{i}$` carries way more information than `$p(\text{X}_{i})$`, so how can we still get conditional independence of treatment by only conditioning on `$p(\text{X}_{i})$`?

.qa[A].sub[.pink[1]] Conditional independence of treatment isn't about extracting all of the information possible from `$\text{X}_{i}$`. We actually only care about creating a situation in which `$\text{D}_{i}|$`something is independent of `$\left( \text{Y}_{0i},\,\text{Y}_{1i} \right)$`.

.qa[A].sub[.pink[2]] Back to our main concern: .hi[selection bias]. People select into treatment. If `$\text{X}$` says two people were equally likely to be treated, and if `$\text{X}_{i}$` explains all of selection (CIA), then there cannot be selection between these two people.
---
name: estimation
## Estimation

So where do propensity scores come from?

We estimate them—and there are a lot of ways to do that.

1. Flexible (_i.e._, interactions) logit specification
2. Kernel regression (remember kernel functions?)
3. Many others—machine learning, series-logit estimator, *etc.*

.qa[Q] Can we just use plain OLS (linear probability model)?

.qa[A] Sort of. Think about FWL. This route is going to be the same as a regression conditioning on `$\text{X}_{i}$`.

---
## Estimation

From *MHE* (p. 83)

.qa[Question]

> A big question here is how to best model and estimate `$p(\text{X}_{i})$`...

.qa[Answer]

> The answer to this is inherently application-specific. A growing empirical literature suggests that a logit model for the propensity score with a few polynomial terms in continuous covariates works well in practice...
---
name: application
## Application

So you have some estimated propensity scores `$\hat{p}(\text{X}_{i})$`. What next?

.note[Option 1] Conditioning via regression

.note[Option 1a] Use a .b[regression to condition] on `$p(\text{X}_{i})$`, _i.e._,
$$
`\begin{align}
   \text{Y}_{i} = \alpha + \delta \text{D}_{i} + \beta p(\text{X}_{i}) + u_i \tag{1a}
\end{align}`
$$

.note[Option 1b] If we think treatment effects are heterogeneous and may covary with `$\text{X}$`, then we might want to also .b[interact] treatment with `$p(\text{X}_{i})$`, _i.e._,
$$
`\begin{align}
   \text{Y}_{i} = \alpha + \delta_1 \text{D}_{i} + \delta_2 \text{D}_{i} p(\text{X}_{i}) + \beta p(\text{X}_{i}) + u_i \tag{1b}
\end{align}`
$$
---
name: heterogeneity
## Heterogeneity with regression

Let's think a bit more about heterogeneous treatment effects in this setting.
$$
`\begin{align}
  \text{Y}_{0i} &= \alpha + \beta \text{X}_{i} + u_i \\
  \text{Y}_{1i} &= \text{Y}_{0i} + \delta_1 + \delta_2 \text{X}_{i}
\end{align}`
$$
_i.e._, the treatment effect depends upon `$\text{X}_{i}$`.

`$\text{Y}_{i} = \text{D}_{i}\text{Y}_{1i} + \left( 1 - \text{D}_{i} \right) \text{Y}_{0i}$`

--
.pad-left[
`$= \text{D}_{i}\bigg( \text{Y}_{0i} + \delta_1 + \delta_2 \text{X}_{i} \bigg) + \left( 1 - \text{D}_{i} \right) \text{Y}_{0i}$`
]

--
.pad-left[
`$= \text{Y}_{0i} + \delta_1 \text{D}_{i} + \delta_2 \text{D}_{i} \text{X}_{i}$`
]

--
.pad-left[
`$= \alpha + \delta_1 \text{D}_{i} + \delta_2 \text{D}_{i} \text{X}_{i} + \beta \text{X}_{i} + u_i$`
]

---
## Heterogeneity

This final equation
$$
`\begin{align}
  \text{Y}_{i} = \alpha + \delta_1 \text{D}_{i} + \delta_2 \text{D}_{i} \text{X}_{i} + \beta \text{X}_{i} + u_i
\end{align}`
$$

suggests that we want `$p(\text{X}_{i})$` *and* `$\text{D}_{i}p(\text{X}_{i})$`, _i.e._,
$$
`\begin{align}
   \text{Y}_{i} = \alpha + \delta_1 \text{D}_{i} + \delta_2 \text{D}_{i} p(\text{X}_{i}) + \beta p(\text{X}_{i}) + u_i \tag{1b}
\end{align}`
$$

which yields
1. a .hi-slate[group-specific treatment effect] `$\delta_1 + \delta_2 p(\text{X}_{i})$` for each `$\text{X}_{i}$`

2. an .hi-slate[average treatment effect] `$\delta_1 + \delta_2 \overline{p}(\text{X}_{i})$`
---
## More flexibility

We motivated propensity scores with a desire to reduce dimensionality and estimate/choose/assume fewer parameters.

Adding `$p(\text{X}_{i})$` and `$\text{D}_{i}p(\text{X}_{i})$` as covariates in a linear regression doesn't quite exhaust our potential for flexible/nonparametric estimation.

---
name: blocking
## Blocking

.note[Option 2] Block (stratify) on propensity scores.

1. Divide the range of `$\hat{p}(\text{X}_{i})$` into `$K$` blocks (_e.g._, 0.05-wide blocks).

1. Place each observation into a block via its `$\hat{p}(\text{X}_{i})$`.

1. Calculate `$\hat{\tau}_k$` for each block via difference in means.

1. Average the `$\hat{\tau}_k$` using their shares of the sample, _i.e._,

$$
`\begin{align}
  \hat{\tau}_\text{Block} = \sum_{k = 1}^K  \hat{\tau}_k \dfrac{N_{1k} + N_{0k}}{N}
\end{align}`
$$

.note[Note] Blocking is similar to NN/kernel matching using `$p(\text{X}_{i})$` as distance.
---
## Choosing blocks

Blocking on propensity scores requires defining defining blocks.

One common route involves some iteration.

1. .hi[Choose blocks].

1. Check the .hi[balance of the covariates] within each block..super[.pink[†]]

- If covariates are .pink[not balanced], then split your blocks and repeat.

- If covariates are .pink[balanced], then stop.

.footnote[.pink[†] Keep multiple-hypothesis testing in mind. With many covariates and many blocks, you are bound to find statistically significant relationships—even if you are balanced in truth.]

---
## Overlap

Blocking emphasizes our overlap assumption, _i.e._, `$0<\mathop{\text{Pr}}\left(\text{D}_{i} | \text{X}_{i}\right)<1$`.

If a block contains zero treated/control units, we cannot calculate `$\hat{\tau}_k$`.

.attn[Caution] Logit can hide violations—it forces `$0 < \hat{p}(\text{X}_{i}) < 1$`.

.note[Common practice] Empirically enforce overlap:

- Drop control units with `$\hat{p}(\text{X}_{i})$` below the minimum propensity score in the treatment group.

- Drop treated units with `$\hat{p}(\text{X}_{i})$` above the maximum propensity score in the control group.
---
name: weighting
## Weighting

.note[Option 3] Weight observations by the inverse propensity score.

.qa[Q] How does weighting by `$1/\hat{p}(\text{X}_{i})$` make sense?

.qa[A] Consider our old (likely biased) friend the difference in means, _i.e._,
$$
`\begin{align}
  \hat{\tau}_\text{Diff} = \overline{\text{Y}}_\text{T} - \overline{\text{Y}}_\text{C} = \dfrac{\sum_i \text{D}_{i} \text{Y}_{i}}{\sum_i \text{D}_{i}} - \dfrac{\sum_i \left(1 - \text{D}_{i}\right) \text{Y}_{i}}{\sum_i \left(1 - \text{D}_{i}\right)}
\end{align}`
$$
--
which we've discussed is biased due to selection into treatment, _i.e._,
$$
`\begin{align}
  \mathop{E}\left[ \text{Y}_{0i} | \text{D}_{i} = 1 \right] \neq \mathop{E}\left[ \text{Y}_{0i} \right]
\end{align}`
$$
---
## Weighting, justified

Suppose we know `$p(\text{X}_{i})$` and we weight each .hi-pink[treated] individual by `$1/p(\text{X}_{i})$`

`$\mathop{E}\left[ \dfrac{\text{D}_{i} \text{Y}_{i}}{p(\text{X}_{i})} \right]$`

--
 `$= \mathop{E}\left[ \dfrac{\text{D}_{i}\left(\text{D}_{i}\text{Y}_{1i} + (1-\text{D}_{i})\text{Y}_{0i}\right)}{p(\text{X}_{i})} \right]$`

--
 `$= \mathop{E}\left[ \dfrac{\text{D}_{i} \text{Y}_{1i}}{p(\text{X}_{i})} \right]$`

--
   `$= \mathop{E}\!\bigg( \mathop{E}\left[ \dfrac{\text{D}_{i}\text{Y}_{1i}}{p(\text{X}_{i})} \;\middle|\; \text{X}_{i} \right] \bigg)$`

--
 `$= \mathop{E}\!\bigg( \dfrac{\mathop{E}\left[ \text{D}_{i} \mid \text{X}_{i} \right] \mathop{E}\left[ \text{Y}_{1i} \mid \text{X}_{i} \right]}{p(\text{X}_{i})} \bigg)$`

--
   `$= \mathop{E}\!\bigg( \dfrac{p(\text{X}_{i}) \mathop{E}\left[ \text{Y}_{1i} \mid \text{X}_{i} \right]}{p(\text{X}_{i})} \bigg)$`

--
 `$= \mathop{E}\!\bigg( \mathop{E}\left[ \text{Y}_{1i} \mid \text{X}_{i} \right] \bigg)$`

--
 `$\color{#e64173}{= \mathop{E}\left[ \text{Y}_{1i} \right]}$`

Similarly, weighting .hi-purple[control] individuals by `$1/(1-p(\text{X}_{i}))$` yields
$$
`\begin{align}
  \mathop{E}\left[ \dfrac{(1-\text{D}_{i})\text{Y}_{i}}{1-p(\text{X}_{i})} \right] = \color{#6A5ACD}{\mathop{E}\left[ \text{Y}_{0i} \right]}
\end{align}`
$$
---
## Weighting: The estimator

Thus, we can estimate an unbiased treatment effect via
$$
`\begin{align}
  \hat{\tau}_{p\text{Weight}} = \dfrac{1}{N} \sum_{i=1}^N \left[ \dfrac{\text{D}_{i}\text{Y}_{i}}{p(\text{X}_{i})} - \dfrac{(1-\text{D}_{i})\text{Y}_{i}}{1 - p(\text{X}_{i})} \right]
\end{align}`
$$

.note[Intuition] We're trying to overcome selection bias, _i.e._, treated individuals were more likely to be treated as a function of `$\text{X}_{i}$`—producing higher `$p(\text{X}_{i})$`.

We want to get back to *as-good-as random* variation in treatment.

So we upweight .pink[(**1**) .hi-pink[treated] individuals with low] `$\color{#e64173}{p(\text{X}_{i})}$` and .purple[(**2**) .hi-purple[control] observations with high] `$\color{#6A5ACD}{p(\text{X}_{i})}$`.
---
## Weighting: The example

Suppose for some individual `$i$`, `$p(\text{X}_{i}) = 0.80$`.

This propensity score says someone with this set of `$\text{X}_{i}$` was four-times more likely to be .hi-pink[treated] than .hi-purple[control].

Our weights fix this imbalance for each `$\text{X}_{i}$`.

- If `$i$` is .hi-pink[treated], then her weight is `$1/p(\text{X}_{i}) = 1/0.80 = 1.25$`

- If `$i$` is .hi-purple[control], then her weight is `$1/(1-p(\text{X}_{i})) = 1/(1-0.80) = 5$`

And guess what `$5/1.25$` is...

--
 `$4$`!

--
 This weighting scheme gets us back to equal representation for each set of `$\text{X}_{i}$`.
---
## Weighting: Last issue

.note[Practical issue] Nothing guarantees `$\sum_i \hat{p}(\text{X}_{i}) = 1$`.

.note[Solution] Normalize weights by their total sum.

Applying the normalized (and estimated) propensity scores
$$
`\begin{align}
  \hat{\tau}_{p\text{Weight}} = \sum_{i=1}^N \dfrac{ \dfrac{\text{D}_{i}\text{Y}_{i}}{\hat{p}(\text{X}_{i})} }{\sum_{i} \dfrac{\text{D}_{i}}{\hat{p}(\text{X}_{i})}} -
  \sum_{i=1}^N \dfrac{ \dfrac{(1-\text{D}_{i})\text{Y}_{i}}{1-\hat{p}(\text{X}_{i})} }{\sum_{i} \dfrac{(1-\text{D}_{i})}{1-\hat{p}(\text{X}_{i})}}
\end{align}`
$$

Hirano, Imbens, and Ridder (2003) suggests this estimator is efficient.
---
name: two
## Why choose one?

There's nothing special about weighted averages—regression can weight.

Thus, a .hi-slate[regression-based estimate]
$$
`\begin{align}
  \text{Y}_{i} = \alpha + \text{X}_{i}\beta + \tau \text{D}_{i} + u_i
\end{align}`
$$
--
with .hi-slate[weights]
$$
`\begin{align}
  w_i = \sqrt{\dfrac{\text{D}_{i}}{\hat{p}(\text{X}_{i})} + \dfrac{(1-\text{D}_{i})}{1-\hat{p}(\text{X}_{i})}}
\end{align}`
$$
--
offers a *doubly robust* property—you have two chances to be right: `$p(\text{X}_{i})$` or the regression specification.
---
## Why choose one? Part two

An alternative, doubly robust method combines propensity-score blocking with regression.

.note[Step 1] For each block `$k$`, we run the regression
$$
`\begin{align}
  \text{Y}_{i} = \alpha_k + \text{X}_{i} \beta_k + \tau_k \text{D}_{i} + u_i
\end{align}`
$$

.note[Step 2] Aggregate block-level treatment-effect estimates
$$
`\begin{align}
  \hat{\tau} = \sum_{k=1}^K \hat{\tau}_k \dfrac{N_{1k} + N_{0k}}{N}
\end{align}`
$$
---
## Major requirements

Don't get (too) caught up in the bells and whistles.

We still have two .hi-slate[major] requirements for any of these methods to work.

1. Is the .hi-slate[conditional-independence assumption] true?

2. Do we have .hi-slate[overlap] between treatment and control units.

We can look for evidence of (.hi-slate[2]) in the data—particularly if we're using propensity-score methods..super[.pink[†]]

How? Plot the distributions of `$p(\text{X}_{i})$` for .hi-pink[T] and .hi-purple[C].

.footnote[.pink[†] Checking for overlap in `$\text{X}$`-space, can be tough as the dimensions of `$\text{X}$` expand.
]
---
name: overlap
layout: false
class: clear, middle

Missing overlap in `$p(\text{X}_{i})$`
<img src="08-matching_files/figure-html/ex-no-overlap-1.svg" style="display: block; margin: auto;" />
---
class: clear, middle

Authentic (enforced) overlap in `$p(\text{X}_{i})$`
<img src="08-matching_files/figure-html/ex-overlap-p-1.svg" style="display: block; margin: auto;" />
---
class: clear, middle

Logit-based `$\hat{p}(\text{X}_{i})$` hiding some of the missing overlap in `$p(\text{X}_{i})$`
<img src="08-matching_files/figure-html/ex-no-overlap-logit-1.svg" style="display: block; margin: auto;" />
---
class: clear, middle

Overlap in one dimension does not guarantee in two dimensions.
 .smallest[.note[Note] Shading denotes .hi-slate[share of treatment:] .gw[.grey-light[l]**white**.grey-light[l]]=0% and .hi-pink[pink]=100%.]
<img src="08-matching_files/figure-html/ex-overlap2-1.svg" style="display: block; margin: auto;" />
---
layout: false
# Table of contents

.pull-left[
### Admin
.smaller[

1. [Schedule](#schedule)
1. [Follow up](#follow-up)
]

### General matching
.smaller[

1. [The gist](#gist)
1. [Goals](#goals)
1. [Generic matching](#generic)
1. [Weights](#weights)
  - [Discrete `$\text{X}$`](#discrete)
  - [Nearest neighbor, Euclidean](#nn-euclidean)
  - [Nearest neighbor, Mahalanobis](#nn-mahalanobis)
  - [Kernel matching](#kernel)
]
]

.pull-right[

### Propensity-score methods
.smaller[

1. [Setup](#setup)
1. [Propensity-score theorem](#magic)
  - [The magic](#magic)
  - [The proof](#proof)
  - [Intuition](#intuition)
1. [Estimation](#estimation)
1. [Application](#application)
  - [Regression](#application)
  - [Heterogeneity](#heterogeneity)
  - [Blocking on `$p(\text{X}_{i})$`](#blocking)
  - [Weighting with `$p(\text{X}_{i})$`](#weighting)
  - [Doubly robust methods](#two)
1. [Overlap plots](#overlap)
]
]
---
exclude: true