class: center, middle, inverse, title-slide .title[ # 09 - High-Dimensional Confounding Adjustment ] .subtitle[ ## ml4econ, HUJI 2024 ] .author[ ### Itamar Caspi ] .date[ ### July 7, 2024 (updated: 2024-07-07) ] --- # Replicating this presentation Use the [pacman](https://cran.r-project.org/web/packages/pacman/vignettes/Introduction_to_pacman.html) package to install and load packages: ```r if (!require("pacman")) install.packages("pacman") pacman::p_load( tidyverse, tidymodels, hdm, ggdag, knitr, xaringan, ) ``` --- # Outline - [Lasso and Variable Selection](#sel) - [High Dimensional Confoundedness](#ate) - [Empirical Illustration using `hdm`](#hdm) --- class: title-slide-section-blue, center, middle name: sel # Lasso and Variable Selection --- # Key Lasso Theory Resources .pull-left[ - [_Statistical Learning with Sparsity - The Lasso and Generalizations_](https://web.stanford.edu/~hastie/StatLearnSparsity/) (Hastie, Tibshirani, and Wainwright), __Chapter 11: Theoretical Results for the Lasso.__ (PDF available online) - [_Statistics for High-Dimensional Data - Methods, Theory and Applications_](https://www.springer.com/gp/book/9783642201912) (Buhlmann and van de Geer), __Chapter 7: Variable Selection with the Lasso.__ - [_High Dimensional Statistics - A Non-Asymptotic Viewpoint_](https://www.cambridge.org/core/books/highdimensional-statistics/8A91ECEEC38F46DAB53E9FF8757C7A4E) (Wainwright), __Chapter 7: Sparse Linear Models in High Dimensions__ ] .pull-right[ <img src="figs/sparsebooks3.png" width="100%" style="display: block; margin: auto;" /> ] --- # Guidance vs. Guarantees: Fundamental Differences - We've primarily relied on *guidance* for our work: - Selection of folds in CV - Size determination of the holdout set - Tuning parameter(s) adjustment - Loss function selection - Function class selection - But in causal inference, *guarantees* become vital: - Selecting variables - Deriving confidence intervals and `\(p\)`-values - To attain these guarantees, we generally need: - Assumptions regarding a "true" model - Asymptotic principles, such as `\(n\rightarrow\infty\)`, `\(k\rightarrow ?\)` --- # Key Notations in Lasso Literature Assume `\(\boldsymbol{\beta}\)` is a `\(k\times 1\)` vector with a typical element as `\(\beta_i\)`. - `\(\ell_0\)`-norm is `\(||\boldsymbol{\beta}||_0=\sum_{j=1}^{k}\boldsymbol{1}_{\{\beta_j\neq0\}}\)`, indicating the count of non-zero elements in `\(\boldsymbol{\beta}\)`. - `\(\ell_1\)`-norm is `\(||\boldsymbol{\beta}||_1=\sum_{j=1}^{k}|\beta_j|\)`. - `\(\ell_2\)`-norm or Euclidean norm is `\(||\boldsymbol{\beta}||_2=\left(\sum_{j=1}^{k}|\beta_j|^2\right)^{\frac{1}{2}}\)`. - `\(\ell_\infty\)`-norm is `\(||\boldsymbol{\beta}||_\infty=\sup_j |\beta_j|\)`, signifying the maximum magnitude among `\(\boldsymbol{\beta}\)`'s entries. - Support of `\(\boldsymbol{\beta}\)` is `\(S\equiv\mathrm{supp}(\boldsymbol{\beta})= \{\beta_j\neq 0 , j=1,\dots,j\}\)`, the subset of non-zero coefficients. - Size of the support `\(s=|S|\)` is the count of non-zero elements in `\(\boldsymbol{\beta}\)`, namely `\(s=||\boldsymbol{\beta}||_0\)` --- # Understanding the Basic Setup of Lasso The linear regression model is given as: `$$Y_{i}=\alpha + X_{i}^{\prime}\boldsymbol{\beta}^{0}+\varepsilon_{i}, \quad i=1,\dots,n,$$` `$$\mathbb{E}\left[\varepsilon_{i}{X}_i\right]=0,\quad \alpha\in\mathbb{R},\quad \boldsymbol{\beta}^0\in\mathbb{R}^k.$$` Under the **_exact sparsity_** assumption, we include only a subset of variables of size `\(s\ll k\)` in the model, where `\(s \equiv\|\boldsymbol{\beta}\|_{0}\)` represents the sparsity index. `$$\underbrace{\mathbf{X}_{S}=\left(X_{(1)}, \ldots, X_{\left(s\right)}\right)}_{\text{Sparse Variables}}, \quad \underbrace{\mathbf{X}_{S^c}=\left(X_{\left(s+1\right)}, \ldots, X_{\left(k\right)}\right)}_{\text{Non-Sparse Variables}}$$` Here, `\(S\)` is the subset of active predictors, `\(\mathbf{X}_S \in \mathbb{R}^{n\times s}\)` corresponds to the subset of covariates in the sparse set, and `\(\mathbf{X}_{S^C} \in \mathbb{R}^{n\times k-s}\)` refers to the subset of "irrelevant" non-sparse variables. --- # Lasso: The Optimization The Lasso (Least Absolute Shrinkage and Selection Operator), introduced by Tibshirani (1996), poses the following optimization problem: `$$\underset{\beta_{0}, \beta}{\operatorname{min}} \sum_{i=1}^{N}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p} x_{i j} \beta_{j}\right)^{2}+\lambda \lVert\boldsymbol{\beta}\rVert_1$$` In this setup, Lasso places a "budget constraint" on the sum of *absolute* values of `\(\beta\)`'s. Differing from ridge, the Lasso penalty is linear (shifting from 1 to 2 bears the same weight as moving from 101 to 102). A major strength of Lasso lies in its ability to perform model selection - it zeroes out most of the `\(\beta\)`'s in the model, making the solution *sparse*. Any penalty involving the `\(\ell_1\)` norm will achieve this. --- # Evaluating the Lasso Suppose `\(\beta^0\)` is the true vector of coefficients and `\(\widehat{\beta}\)` represents the Lasso estimator. We can evaluate Lasso's effectiveness in several ways: I. Prediction Quality `$$\text{Loss}_{\text{ pred }}\left(\widehat{\boldsymbol{\beta}} ; \boldsymbol{\beta}^{0}\right)=\frac{1}{N}\left\|(\widehat{\boldsymbol{\beta}}- \boldsymbol{\beta}^{0})\mathbf{X}^{}\right\|_{2}^{2}=\frac{1}{N}\sum_{j=1}^k\left[(\hat{\beta}_j-\beta^0_j)\mathbf{X}_{(j)}\right]^2$$` II. Parameter Consistency `$$\text{Loss}_{\text{param}}\left(\widehat{\boldsymbol{\beta}} ; \boldsymbol{\beta}^{0}\right)=\left\|\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta}^{0}\right\|_{2}^{2}=\sum_{j=1}^k(\hat{\beta}-\beta^0)^2$$` III. Support Recovery (Sparsistency) For example, score `+1` if `\(\mathrm{sign}(\beta^0)=\mathrm{sign}(\beta_j)\)` for all `\(j=1,\dots,k\)`, and `0` otherwise. --- # Leveraging Lasso for Variable Selection - Variable selection consistency is crucial for causal inference, considering omitted variable bias. - Lasso frequently serves as a tool for variable selection. - The successful selection of the "true" support by Lasso depends heavily on strong assumptions about: - Distinguishing between relevant and irrelevant variables. - Identifying `\(\boldsymbol{\beta}\)`. --- # Critical Assumption #1: Distinguishable Sparse Betas **_Lower Eigenvalue:_** The minimum eigenvalue, `\(\lambda_{\min }\)`, of the sub-matrix `\(\mathbf{X}_S\)`, should be bounded away from zero. `$$\lambda_{\min }\left(\mathbf{X}_{S}^{\prime} \mathbf{X}^{}_{S} / N\right) \geq C_{\min }>0$$` Linear dependence between the columns of `\(\mathbf{X}_s\)` makes it impossible to identify the true `\(\boldsymbol{\beta}\)`, even if we *knew* which variables are included in `\(\mathbf{X}_S\)`. **NOTE:** The high-dimensional lower eigenvalue condition replaces the low-dimensional rank condition (i.e., that `\(\mathbf{X}^\prime\mathbf{X}\)` is invertible). --- # Critical assumption #2: Distinguishable active predictors _Irrepresentability condition_ (Zou ,2006; Zhao and Yu, 2006): There must exist some `\(\eta>0\)` such that `$$\max _{j \in S^{c}}\left\|\left(\mathbf{X}_{S}^{\prime} \mathbf{X}_{S}\right)^{-1} \mathbf{X}_{S}^{\prime} \mathbf{x}_{j}\right\|_{1} \leq 1-\eta$$` __INTUITION__: What's inside `\(\left\|\cdot\right\|_{1}\)` is like regressing `\(\mathbf{x}_{j}\)` on the variables in `\(\mathbf{X}_s\)` . - When `\(\eta=1\)`, the sparse and non-sparse variables are orthogonal to each other. - When `\(\eta=0\)`, we can perfectly reconstruct (some elements of) `\(\mathbf{X}_S\)` using `\(\mathbf{X}_{S^C}\)`. Thus, the irrepresentability condition roughly states that we can distinguish the sparse variables from the non-sparse ones. --- <img src="figs/lassofolds.png" width="55%" style="display: block; margin: auto;" /> Source: [Mullainathan and Spiess (JEP 2017)](https://www.aeaweb.org/articles/pdf/doi/10.1257/jep.31.2.87). --- # Setting the Optimal Tuning Parameter - Throughout this course, we have frequently chosen `\(\lambda\)` empirically, often by cross-validation, based on its predictive performance. - In causal analysis, however, the end goal is inference, not prediction. These two objectives often conflict (bias vs. variance). - Ideally, the choice of `\(\lambda\)` should provide assurances about the model's performance. - Generally, for satisfying sparsistency, we set `\(\lambda\)` such that it selects non-zero `\(\beta\)`'s with a high probability. --- class: title-slide-section-blue, center, middle name: ate # High Dimensional Confoundedness --- # "Naive" Implementation of the Lasso Run `glmnet` ```r glmnet(Y ~ DX) ``` where `DX` is the feature matrix which includes both the treatment `\(D\)` and the features vector `\(X\)`. The estimated coefficients are: `$$\left(\widehat{\alpha},\widehat{\tau}, \widehat{\boldsymbol{\beta}}^{\prime}\right)^{\prime}=\underset{\alpha, \tau \in \mathbb{R}, \boldsymbol{\beta} \in \mathbb{R}^{k+1}}{\arg \min } \sum_{i=1}^{n}\left(Y_{i}-\alpha-\tau D_{i}-\boldsymbol{\beta}^{\prime} X_{i}\right)^{2}+\lambda\left(|\tau|+\sum_{j=1}^{k}\left|\beta_{j}\right|\right)$$` **ISSUES:** 1. Both `\(\widehat\tau\)` and `\(\widehat{\boldsymbol{\beta}}\)` experience shrinkage, thus biased towards zero. 2. Lasso might eliminate `\(D_i\)`, i.e., shrink `\(\widehat\tau\)` to zero. The same can happen to relevant confounding factors. 3. The choice of `\(\lambda\)` is a challenge. --- # Moving Towards a Solution To avoid eliminating `\(D_i\)`, we can adjust the Lasso regression: `$$\left(\widehat{\alpha}, \widehat{\tau}, \widehat{\boldsymbol{\beta}}^{\prime}\right)^{\prime}=\underset{\alpha,\tau \in \mathbb{R}, \boldsymbol{\beta} \in \mathbb{R}^{k}}{\arg \min } \sum_{i=1}^{n}\left(Y_{i}-\alpha-\tau D_{i}-\boldsymbol{\beta}^{\prime} X_{i}\right)^{2}+\lambda\left(\sum_{j=1}^{k}\left|\beta_{j}\right|\right)$$` We can then _debias_ the results using the "Post-Lasso" method, i.e., use the Lasso for variable selection, then run OLS with the selected variables. **ISSUES:** The Lasso might eliminate features that are correlated with `\(D_i\)` because they are not good predictors of `\(Y_i\)`, leading to _omitted variable bias_. --- # Problem Solved? <img src="figs/naive_Lasso.png" width="55%" style="display: block; margin: auto;" /> Source: [https://stuff.mit.edu/~vchern/papers/Chernozhukov-Saloniki.pdf](https://stuff.mit.edu/~vchern/papers/Chernozhukov-Saloniki.pdf) --- # Solution: Double-selection Lasso (Belloni, et al., REStud 2013) **Step 1:** Perform two Lasso regressions: `\(Y_i\)` on `\({X}_i\)` and `\(D_i\)` on `\({X}_i\)`: `$$\begin{aligned} \widehat{\gamma} &=\underset{\boldsymbol{\gamma} \in \mathbb{R}^{p+1}}{\arg \min } \sum_{i=1}^{n}\left(Y_{i}-\boldsymbol{\gamma}^{\prime} X_{i}\right)^{2}+\lambda_{\gamma}\left(\sum_{j=2}^{p}\left|\gamma_{j}\right|\right) \\ \widehat{\delta} &=\underset{\boldsymbol{\delta} \in \mathbb{R}^{q+1}}{\arg \min } \sum_{i=1}^{n}\left(D_{i}-\boldsymbol{\delta}^{\prime} X_{i}\right)^{2}+\lambda_{\delta}\left(\sum_{j=2}^{q}\left|\delta_{j}\right|\right) \end{aligned}$$` **Step 2:** Refit the model using OLS, but only include the `\(\mathbf{X}\)`'s that were significant predictors of both `\(Y_i\)` and `\(D_i\)`. **Step 3:** Proceed with the inference using standard confidence intervals. > The tuning parameter `\(\lambda\)` is set in a way that ensures non-sparse coefficients are correctly selected with high probability. --- # Does it Work? <img src="figs/double_Lasso.png" width="55%" style="display: block; margin: auto;" /> Source: [https://stuff.mit.edu/~vchern/papers/Chernozhukov-Saloniki.pdf](https://stuff.mit.edu/~vchern/papers/Chernozhukov-Saloniki.pdf) --- # Statistical Inference <img src="figs/lassonormal.png" width="55%" style="display: block; margin: auto;" /> Source: [https://stuff.mit.edu/~vchern/papers/Chernozhukov-Saloniki.pdf](https://stuff.mit.edu/~vchern/papers/Chernozhukov-Saloniki.pdf) --- # Intuition: Partialling-out regression Consider two methods for estimating the effect of `\(X_{1i}\)` (a scalar) on `\(Y_i\)`, while adjusting for `\(X_{2i}\)`: __Alternative 1:__ Run `$$Y_i = \alpha + \beta X_{1i} + \gamma X_{2i} + \varepsilon_i$$` __Alternative 2:__ First, run `\(Y_i\)` on `\(X_{2i}\)` and `\(X_{1i}\)` on `\(X_{2i}\)` and keep the residuals, i.e., run `$$Y_i = \gamma_0 + \gamma_1 X_{2i} + u^Y_i,\quad\text{and}\quad X_{1i} = \delta_0 + \delta_1 X_{2i} + u^{X_{1}}_i,$$` and keep `\(\widehat{u}^Y_i\)` and `\(\widehat{u}^{X_{1}}_i\)`. Next, run `$$\widehat{u}^Y_i = \beta^*\widehat{u}^{X_{1}}_i + v_i.$$` According to the [Frisch-Waugh-Lovell (FWV) Theorem](https://en.wikipedia.org/wiki/Frisch%E2%80%93Waugh%E2%80%93Lovell_theorem), `\(\widehat\beta = \widehat{\beta}^*.\)` --- # Guarantees of Double-selection Lasso (VERY Wonkish) __Approximate Sparsity__ Consider the following regression model: `$$Y_{i}=f\left({W}_{i}\right)+\varepsilon_{i}={X}_{i}^{\prime} \boldsymbol{\beta}^{0}+r_{i}+\varepsilon_{i}, \quad 1,\dots,n$$` where `\(r_i\)` is the approximation error. Under _approximate sparsity_, it is assumed that `\(f\left({W}_{i}\right)\)` can be approximated sufficiently well (up to `\(r_i\)`) by `\({X}_{i}^{\prime} \boldsymbol{\beta}^{0}\)`, while using only a small number of non-zero coefficients. __Restricted Sparse Eigenvalue Condition (RSEC)__ This condition puts bounds on the number of variables outside the support the Lasso can select. Relevant for the post-lasso stage. __Regularization Event__ The tuning parameter `\(\lambda\)` is to a value that it selects to correct model with probability of at least `\(p\)`, where `\(p\)` is set by the user. Further assumptions regarding the quantile function of the maximal value of the gradient of the objective function at `\(\boldsymbol{\beta}^0\)`, and the error term (homoskedasticity vs. heteroskedasticity). See Belloni et al. (2012) for further details. --- # Additional Extensions of Double-selection 1. **Other Function Classes (Double-ML):** Chernozhukov et al. (AER 2017) proposed using other function classes, such as applying random forest for `\(Y \sim X\)` and regularized logit for `\(D \sim X\)`. 2. **Instrumental Variables:** Techniques involving instrumental variables have been developed by Belloni et al. (Ecta 2012) and Chernozhukov et al. (AER 2015). For further understanding, please refer to the problem set. 3. **Heterogeneous Treatment Effects:** Heterogeneous treatment effects have been studied by Belloni et al. (Ecta 2017). We'll explore this topic more thoroughly next week. 4. **Panel Data:** Consideration for panel data was made by Belloni et al. (JBES 2016). --- # Evidence on the Applicability of Double-Lasso ["Machine Labor"](https://economics.mit.edu/sites/default/files/2022-11/Machine%20Labor.pdf) (Angrist and Frandsen, 2022 JLE): - ML can be useful for regression-based causal inference using Lasso. - Post-double-selection (PDS) Lasso offers data-driven sensitivity analysis. - ML-based instrument selection can improve on 2SLS, but split-sample IV and limited information maximum likelihood (LIML) perform better. - ML might not be optimal for Instrumental Variables (IV) applications in labor economics. This is due to the creation of artificial exclusion restrictions potentially resulting in inaccurate findings. --- # More from "Labor Machine" .pull-left[ <img src="figs/angrist_2.png" width="100%" style="display: block; margin: auto;" /> __Source__: Angrist and Frandsen (2022). ] .pull-right[ <img src="figs/angrist_1.png" width="100%" style="display: block; margin: auto;" /> ] --- # DML: A Cautionary Tale .pull-left[ [Hünermund, Louw, and Caspi (2023 JCI)](https://www.degruyter.com/document/doi/10.1515/jci-2022-0078/html): - DML is highly sensitive to a few "bad controls" in the covariate space, leading to potential bias. - This bias varies depending on the theoretical causal model, raising questions about the practicality of data-driven control variable selection. ] .pull-right[ <img src="figs/dmlct.png" width="100%" style="display: block; margin: auto;" /> ] --- # Simulation Results .pull-left[ <img src="figs/dmlct-figs1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="figs/dmlct-figs2.png" width="100%" style="display: block; margin: auto;" /> ] --- # Empirical Relevance <img src="figs/dmlct-app1.png" width="50%" style="display: block; margin: auto;" /> <img src="figs/dmlct-app2.png" width="100%" style="display: block; margin: auto;" /> --- class: title-slide-section-blue, center, middle name: hdm #Empirical Illustration using `hdm` --- # Introducing the `hdm` R Package ["High-Dimensional Metrics"](https://journal.r-project.org/archive/2016/RJ-2016-040/RJ-2016-040.pdf) (`hdm`) by Victor Chernozhukov, Chris Hansen, and Martin Spindler is an R package for estimation and quantification of uncertainty in high-dimensional approximately sparse models. [*] A Stata module named [`Lassopack`](https://stataLasso.github.io/docs/Lassopack/) offers a comprehensive set of programs for regularized regression in high-dimensional contexts.. ] --- # Illustration: Testing for Growth Convergence The standard empirical model for growth convergence is represented by the equation: `$$Y_{i,T}=\alpha_{0}+\alpha_{1} Y_{i,0}+\sum_{j=1}^{k} \beta_{j} X_{i j}+\varepsilon_{i},\quad i=1,\dots,n,$$` where - `\(Y_{i,T}\)` national growth rates in GDP per capita for the periods 1965-1975 and 1975-1985. - `\(Y_{i,0}\)` is the log of the initial level of GDP at the beginning of the specified decade. - `\(X_{ij}\)` covariates which might influence growth. The growth convergence hypothesis implies that `\(\alpha_1<0\)`. --- # Growth Data To test the **growth convergence hypothesis**, we will employ the Barro and Lee (1994) dataset. ```r data("GrowthData") ``` The data features macroeconomic information for a substantial group of countries over various decades. Specifically, - `\(n\)` equals 90 countries - `\(k\)` equals 60 country features While these numbers may not seem large, the quantity of covariates is substantial compared to the sample size. Hence, **variable selection** is crucial! --- # Let's Have a Look ```r GrowthData %>% as_tibble %>% head(2) ``` ``` ## # A tibble: 2 x 63 ## Outcome intercept gdpsh465 bmp1l freeop freetar h65 hm65 hf65 p65 pm65 pf65 s65 sm65 sf65 ## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 -0.0243 1 6.59 0.284 0.153 0.0439 0.007 0.013 0.001 0.29 0.37 0.21 0.04 0.06 0.02 ## 2 0.100 1 6.83 0.614 0.314 0.0618 0.019 0.032 0.007 0.91 1 0.65 0.16 0.23 0.09 ## # i 48 more variables: fert65 <dbl>, mort65 <dbl>, lifee065 <dbl>, gpop1 <dbl>, fert1 <dbl>, mort1 <dbl>, ## # invsh41 <dbl>, geetot1 <dbl>, geerec1 <dbl>, gde1 <dbl>, govwb1 <dbl>, govsh41 <dbl>, gvxdxe41 <dbl>, ## # high65 <dbl>, highm65 <dbl>, highf65 <dbl>, highc65 <dbl>, highcm65 <dbl>, highcf65 <dbl>, ## # human65 <dbl>, humanm65 <dbl>, humanf65 <dbl>, hyr65 <dbl>, hyrm65 <dbl>, hyrf65 <dbl>, no65 <dbl>, ## # nom65 <dbl>, nof65 <dbl>, pinstab1 <dbl>, pop65 <int>, worker65 <dbl>, pop1565 <dbl>, pop6565 <dbl>, ## # sec65 <dbl>, secm65 <dbl>, secf65 <dbl>, secc65 <dbl>, seccm65 <dbl>, seccf65 <dbl>, syr65 <dbl>, ## # syrm65 <dbl>, syrf65 <dbl>, teapri65 <dbl>, teasec65 <dbl>, ex1 <dbl>, im1 <dbl>, xr65 <dbl>, ... ``` --- # Data Processing Rename the response and "treatment" variables: ```r df <- GrowthData %>% rename(YT = Outcome, Y0 = gdpsh465) ``` Transform the data to vectors and matrices (to be used in the `rlassoEffect()` function) ```r YT <- df %>% select(YT) %>% pull() Y0 <- df %>% select(Y0) %>% pull() X <- df %>% select(-c("Y0", "YT")) %>% as.matrix() Y0_X <- df %>% select(-YT) %>% as.matrix() ``` --- # Estimation of the Convergence Parameter `\(\alpha_1\)` __Method 1:__ OLS ```r ols <- lm(YT ~ ., data = df) ``` __Method 2:__ Naive (rigorous) Lasso ```r naive_Lasso <- rlasso(x = Y0_X, y = YT) ``` Does the Lasso drop `Y0`? ```r naive_Lasso$beta[2] ``` ``` ## Y0 ## 0 ``` Unfortunately, yes... --- # Estimation of the Convergence Parameter `\(\alpha_1\)` __Method 3:__ Partialling-out Lasso ```r part_Lasso <- rlassoEffect( x = X, y = YT, d = Y0, method = "partialling out" ) ``` __Method 4:__ Double-selection Lasso ```r double_Lasso <- rlassoEffect( x = X, y = YT, d = Y0, method = "double selection" ) ``` --- # Tidying the Results ```r # OLS ols_tbl <- tidy(ols) %>% filter(term == "Y0") %>% mutate(method = "OLS") %>% select(method, estimate, std.error) # Naive Lasso naive_Lasso_tbl <- tibble(method = "Naive Lasso", estimate = NA, std.error = NA) # Partialling-out Lasso results_part_Lasso <- summary(part_Lasso)[[1]][1, 1:2] part_Lasso_tbl <- tibble(method = "Partialling-out Lasso", estimate = results_part_Lasso[1], std.error = results_part_Lasso[2]) # Double-selection Lasso results_double_Lasso <- summary(double_Lasso)[[1]][1, 1:2] double_Lasso_tbl <- tibble(method = "Double-selection Lasso", estimate = results_double_Lasso[1], std.error = results_double_Lasso[2]) ``` --- # Summary of the Convergence Test ```r bind_rows(ols_tbl, naive_Lasso_tbl, part_Lasso_tbl, double_Lasso_tbl) %>% kable(digits = 3, format = "html") ``` <table> <thead> <tr> <th style="text-align:left;"> method </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> OLS </td> <td style="text-align:right;"> -0.009 </td> <td style="text-align:right;"> 0.030 </td> </tr> <tr> <td style="text-align:left;"> Naive Lasso </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> Partialling-out Lasso </td> <td style="text-align:right;"> -0.050 </td> <td style="text-align:right;"> 0.014 </td> </tr> <tr> <td style="text-align:left;"> Double-selection Lasso </td> <td style="text-align:right;"> -0.050 </td> <td style="text-align:right;"> 0.016 </td> </tr> </tbody> </table> The use of double-selection and partialling-out methods lead to significantly **more precise estimates** and lend support to the **conditional convergence hypothesis**. --- # An Advanced R Package: DoubleML .pull-left[ - The Python and R packages [{DoubleML}](https://docs.doubleml.org/stable/index.html) offer a modern implementation of the double / debiased machine learning framework. - For more details, visit the [Getting Started](https://docs.doubleml.org/stable/intro/intro.html) and [Examples](https://docs.doubleml.org/stable/examples/index.html) sections. - The package is constructed on the [{mlr3}](https://github.com/mlr-org/mlr3) ecosystem. ] .pull-right[ <img src="figs/dml.png" width="80%" style="display: block; margin: auto;" /> ] --- class: .title-slide-final, center, inverse, middle # `slides %>% end()` [<i class="fa fa-github"></i> Source code](https://github.com/ml4econ/lecture-notes-2023/tree/master/09-lasso-ate) --- # Selected References Ahrens, A., Hansen, C. B., & Schaffer, M. E. (2019). lassopack: Model selection and prediction with regularized regression in Stata. Angrist, Joshua D, and Alan B Krueger. 1991. "Does Compulsory School Attendance Affect Schooling and Earnings?" _The Quarterly Journal of Economics_, 106(4): 979–1014. Angrist, J. D., & Frandsen, B. (2022). Machine labor. _Journal of Labor Economics_, 40(S1), S97-S140. Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen. 2012. Sparse Models and Methods for Optimal Instruments With an Application to Eminent Domain. _Econometrica_ 80(6): 2369–2429. Belloni, A., & Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse models. _Bernoulli_, 19(2), 521–547. Belloni, A., Chernozhukov, V., & Hansen, C. (2013). Inference on treatment effects after selection among high-dimensional controls. _Review of Economic Studies_, 81(2), 608–650. --- # Selected References Belloni, A., Chernozhukov, V., & Hansen, C. (2014). High-Dimensional Methods and Inference on Structural and Treatment Effects. _Journal of Economic Perspectives_, 28(2), 29–50. Chernozhukov, V., Hansen, C., & Spindler, M. (2015). Post-selection and post-regularization inference in linear models with many controls and instruments. _American Economic Review_, 105(5), 486–490. Chernozhukov, V., Hansen, C., & Spindler, M. (2016). hdm: High-Dimensional Metrics. _The R Journal_, 8(2), 185–199. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., & Newey, W. (2017). Double/debiased/Neyman machine learning of treatment effects. _American Economic Review_, 107(5), 261–265. --- # Selected References Dale, Stacy Berg, and Alan B Krueger. 2002. "Estimating the Payoff to Attending a More Selective College: An Application of Selection on Observables and Unobservables." _The Quarterly Journal of Economics_, 117(4): 1491–1527. Hünermund, P., Louw, B., & Caspi, I. (2023). Double machi_ne learning and automated confounder selection: A cautionary tale. _Journal of Causal Inference_, 11(1), 20220078. Mullainathan, S. & Spiess, J., 2017. Machine Learning: An Applied Econometric Approach. _Journal of Economic Perspectives_, 31(2), pp.87–106. Van de Geer, S. A., & Bühlmann, P. (2009). On the conditions used to prove oracle results for the lasso. _Electronic Journal of Statistics_, 3, 1360–1392. Zhao, P., & Yu, B. (2006). On Model Selection Consistency of Lasso. _Journal of Machine Learning Research_, 7, 2541–2563.