class: center, middle, inverse, title-slide .title[ # ScPoEconometrics ] .subtitle[ ## Intro To Causality ] .author[ ### Mylène Feuillade, Gustave Kenedi, Florian Oswald and Junnan He ] .date[ ### SciencesPo Paris 2022-09-27 ] --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Today - Introduction to Causal Inference * ***Causality*** versus ***correlation*** * The ***Potential Outcomes Framework*** a.k.a. Rubin's Causal Model * ***Randomized controlled trials*** (RCTs) * Follow up on the empirical application of *class size* and *student performance* --- # Causality and Economics - Making causal inference from data can be seen as economists' *comparative advantage* among the social sciences! - Plenty of fields do statistics. But very few make it standard training for their students to understand causality. - Economists' endeavour to establish causal relationships is also what makes them useful both in the private (e.g. tech companies) and public sector (e.g. policy advisors). -- - Ok, that's enough preaching 😅 --- # The Concept of Causality __Causality__: what are we talking about? - We say that `X` *causes* `Y` -- - if we were to intervene and *change* the value of `X` ***without changing anything else***... -- - then `Y` would also change ***as a result***. -- - The key point here is the ***without changing anything else***, often referred as the **ceteris paribus (*all else equal*) assumption**. (*latin* makes things seem more complicated 🤓) -- - ⚠️ It does **NOT** mean that `X` is the only factor that causes `Y`. --- # Correlation vs Causation ***Correlation does not equal causation*** has become a ubiquitous mantra, but can you tell why it is true? -- Some correlations obviously don't imply causation ([e.g. spurious correlation website](https://www.tylervigen.com/spurious-correlations)). -- <img src="../img/photos/spurious.png" width="800px" style="display: block; margin: auto;" /> --- # Correlation vs Causation: Smoking and Lung Cancer But not all correlations are so easy to rule out -- ***Does smoking cause lung cancer?*** -- .pull-left[ - Today, we know the answer is *YES*! - But let's go back in the 1950's - We are at the start of a big increase in deaths from lung cancer... - ... which is happening after a fast growth in cigarette consumption ] -- .pull-right[ <img src="../img/photos/Smoking_lung_cancer.png" width="400px" style="display: block; margin: auto;" /> ] -- - It's very tempting to claim that smoking causes lung cancer based on this graph. --- # Correlation vs Causation: Smoking and Lung Cancer At the time many people were still skeptical, including some famous statisticians: -- .pull-left[ ***Macro confouding factors***: Other macro factors which can cause cancers also changed between 1900 and 1950: - Tarring of roads, - Inhalation of motor exhausts (leaded gasoline fumes), - General greater air pollution. ] -- .pull-right[ ***Self selection***: Smokers and non-smokers may be different in the first place: - __Selection on observable characteristics__: age, education, income, etc. - __Selection on unobservable characteristics__: genes (the hypothetical confounding genome theory of [Fisher](https://en.wikipedia.org/wiki/Ronald_Aylmer_Fisher)). ] --- # Correlation vs Causation: Other Examples > Why might the observed correlation between ***years of education*** and ***income*** not reflect the causal effect of education? -- *Individuals who choose to get more education likely differ from those who don't: maybe they have higher innate ability, they enjoy schooling and are good at it* `\(\rightarrow\)` ***self-selection*** -- > Why might the observed correlation between the ***employment rate*** and the ***minimum wage*** level not reflect the causal effect of the minimum wage? -- *Policymakers may increase the minimum wage in times when the employment rate is high* `\(\rightarrow\)` ***reverse causality / simultaneity*** -- > Why might the observed correlation between ***economic growth*** and ***financial development*** not reflect the causal effect of the financial sector? -- *Again, maybe economic growth leads to financial development and not the other way around* `\(\rightarrow\)` ***reverse causality / simultaneity*** --- # Link with Economic Theory * Economic theory tells us individuals' behave in order to ***maximise their utility*** * Thus they don't choose to act in ***random ways*** `\(\rightarrow\)` we say that individual's behavior is ***endogenous*** * We should be ***suspicious*** of any correlation found in data -- - How can we make ***causal claims*** then? - The ***Potential Outcomes Framework*** will be our guide. --- layout: false class: title-slide-section-red, middle # Causal Inference --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # The Potential Outcomes Framework Often called the **Rubin Causal Model** in honor of statistician **Donald Rubin** who generalised and formalized this model in the 1970's. -- ***Key idea***: Each individual can be exposed to **multiple alternative treatment states**. - smoking cigarrettes, smoking cigars or not smoking, - growing up in a poor vs a middle class neighborhood vs a rich neighborhood, - being in a small or a big class. -- .pull-left[ For practicality, let this treatment variable `\(D_i\)` be a binary variable: $$ D_i = \begin{cases} 1 \textrm{ if individual `\(i\)` is treated} \\\\ 0 \textrm{ if individual `\(i\)` is not treated} \end{cases} $$ ] -- .pull-right[ ***Treatment group*** all the individuals such that `\(D_i = 1\)`. ***Control group*** all the individuals such that `\(D_i = 0\)`. ] --- # The Potential Outcomes Framework * In this framework, each individual has two ***potential outcomes***, but only one ***observed outcome*** `\(Y_i\)`: - `\(Y_i^1\)`: *potential outcome if individual `\(i\)` receives the treatment* `\((D_i = 1)\)`, - `\(Y_i^0\)`: *potential outcome if individual `\(i\)` does not receive the treatment* `\((D_i = 0)\)`. -- * In real life we only observe `\(Y_i\)` which can be written as: `$$Y_i = D_i \times Y_i^1 + (1- D_i) \times Y_i^0$$` -- * ***Fundamental Problem of Causal Inference***: for any individual `\(i\)`, we only observe one of either potential outcomes [(Holland, 1986)](http://people.umass.edu/~stanek/pdffiles/causal-holland.pdf). --- # The Potential Outcomes Framework * The potential outcome that is not observed exists in principle, it is called the ***counterfactual outcome***. -- Group | `\(Y_i^1\)` | `\(Y_i^0\)` --------|:---------:|:---------: Treatment group `\((D_i = 1)\)` | Observable as `\(Y_i\)` | Counterfactual Control group `\((D_i = 0)\)` | Counterfactual | Observable as `\(Y_i\)` -- * From these we can define the ***individual treatment effect*** `\(\delta_i\)`: $$ \delta_i = Y_i^1 - Y_i^0$$ * `\(\delta_i\)` measures the **causal effect of the treatment `\((D_i)\)`** on outcome `\(Y\)` for individual `\(i\)`. -- * Since the treatment effect *cannot* be observed at the individual level, we estimate population averages. --- # Sidenote: Expectation and Conditional Expectation Let's say you have a fair die and you roll it an infinite number of times. What is the average number rolled? * If `\(X\)` is a random variable containing the number rolled, we write: $$ \mathop{\mathbb{E}}(X) = \frac{1}{6} \times 1 + \frac{1}{6} \times 2 + \frac{1}{6} \times 3 + \frac{1}{6} \times 4 + \frac{1}{6} \times 5 + \frac{1}{6} \times 6 = 3.5 $$ * The `\(\mathop{\mathbb{E}}(.)\)` operator stands for **expectation** or *population mean*. * The `\(\mathop{\mathbb{E}}(.)\)` operator is linear, in other words, `\(\mathop{\mathbb{E}}(X+Y) = \mathop{\mathbb{E}}(X) + \mathop{\mathbb{E}}(Y)\)` with `\(X\)` and `\(Y\)` being two random variables. --- # Sidenote: Expectation and Conditional Expectation Now, let's say you have two fair dice and you roll them an infinite number of times. What is the average sum of numbers rolled, conditional on one of them being always equal to 5? * If `\(X\)` is a random variable containing the number rolled of die 1 and `\(Y\)` a random variable containing the number rolled of die 2, we write: $$ `\begin{align} \mathop{\mathbb{E}}(X+Y|Y = 5) &= \mathop{\mathbb{E}}(X|Y = 5) + \mathop{\mathbb{E}}(Y|Y = 5) \\ &= \mathop{\mathbb{E}}(X) + 5 \\ &= 3.5 + 5 \\ &= 8.5 \end{align}` $$ * The `\(\mathop{\mathbb{E}}(.|D = x)\)` operator stands for **conditional expectation**. It refers to the expectation over a subcategory of the entire population, namely people who satisfy the condition `\(D = x\)`. --- name: ate # Average Treatment Effect (ATE) Broadest possible average effect: `\begin{align} ATE &= \mathop{\mathbb{E}}(\delta_i) \\ &= \mathop{\mathbb{E}}(Y_i^1 - Y_i^0) \\ &= \mathop{\mathbb{E}}(Y_i^1) - \mathop{\mathbb{E}}(Y_i^0) \end{align}` * The ATE simply measures the ***average of individual treatment effects over the whole population***. ([*Appendix:*](#attandatu) Average Treatment on the Treated and Average Treatment on the Untreated) --- # Example: Small vs. Large Class Potential outcomes for students of being in a small `\((Y^1)\)` or large class `\((Y^0)\)` on GPA (0-10): .pull-left[ Student | `\(Y^1\)` | `\(Y^0\)` | `\(\delta\)` -----------|:---------:|:---------:|:---------:| 1 | 5 | 2 | 3 2 | 6 | 4 | 2 3 | 3 | 6 | -3 4 | 5 | 4 | 1 5 | 10 | 8 | 2 6 | 2 | 4 | -2 7 | 5 | 2 | 3 8 | 6 | 4 | 2 9 | 2 | 9 | -7 10 | 8 | 2 | 6 Average | 5.2 | 4.5 | 0.7 ] -- .pull-right[ $$ `\begin{align} \color{#d90502}{\text{ATE}} &= \mathbb{E}(\delta) \\ &=\mathbb{E}(Y^1) - \mathbb{E}(Y^0) \\ &= 5.2 - 4.5 \\ &= 0.7 \end{align}` $$ `\(\rightarrow\)` the ***average*** causal effect of being in small relative to large class on GPA is 0.7 points. ⚠️ not all students benefited equally from the treatment! ] --- # The Problem of Causal Inference * In practice, we have the same **missing data problem** for computing the ATE as we did for `\(\delta_i\)`. Either `\(Y_i^1\)` or `\(Y_i^0\)` is missing for each `\(i\)`. -- * From the data, we can compute the **S**imple **D**ifference in mean **O**utcomes (***SDO***) for both groups: $$ `\begin{align} SDO &= \mathop{\mathbb{E}}(Y_i^1|D_i=1) - \mathop{\mathbb{E}}(Y_i^0|D_i=0) \\ &= \underbrace{\frac{1}{N_T}\sum_{i=1}^{N_T}(Y_i|D_i=1)}_{\text{average outcome of treatment group}} - \underbrace{\frac{1}{N_C}\sum_{i=1}^{N_C}(Y_i|D_i=0)}_{\text{average outcome of control group}} \end{align}` $$ --- # Simple Difference in Mean Outcomes: With An Example Now, imagine a benevolent and omniscient school director assigns students to the treatment that maximizes their GPA -- Student | `\(Y^1\)` | `\(Y^0\)` | `\(\delta\)` | `\(Y\)` | `\(D\)` -----------|:---------:|:---------:|:---------:|:-----------:|:---------:|:---------: 1 | 5 | 2 | 3 | | 2 | 6 | 4 | 2 | | 3 | 3 | 6 | -3 | | 4 | 5 | 4 | 1 | | 5 | 10 | 8 | 2 | | 6 | 2 | 4 | -2 | | 7 | 5 | 2 | 3 | | 8 | 6 | 4 | 2 | | 9 | 2 | 9 | -7 | | 10 | 8 | 2 | 6 | | --- # Simple Difference in Mean Outcomes: With An Example Now, imagine a benevolent and omniscient school director assigns students to the treatment that maximizes their GPA Student | `\(Y^1\)` | `\(Y^0\)` | `\(\delta\)` | `\(Y\)` | `\(D\)` -----------|:---------:|:---------:|:---------:|:-----------:|:---------:|:---------: 1 | 5 | 2 | 3 | 5 | 1 2 | 6 | 4 | 2 | 6 | 1 3 | 3 | 6 | -3 | 6 | 0 4 | 5 | 4 | 1 | 5 | 1 5 | 10 | 8 | 2 | 10 | 1 6 | 2 | 4 | -2 | 4 | 0 7 | 5 | 2 | 3 | 5 | 1 8 | 6 | 4 | 2 | 6 | 1 9 | 2 | 9 | -7 | 9 | 0 10 | 8 | 2 | 6 | 8 | 1 --- # Simple Difference in Mean Outcomes: With An Example .pull-left[ Student | `\(Y\)` | `\(D\)` | `\(\delta\)` -----------|:---------:|:---------:|:---------: 1 | 5 | 1 | 3 2 | 6 | 1 | 2 3 | 6 | 0 | -3 4 | 5 | 1 | 1 5 | 10 | 1 | 2 6 | 4 | 0 | -2 7 | 5 | 1 | 3 8 | 6 | 1 | 2 9 | 9 | 0 | -7 10 | 8 | 1 | 6 Average | | | 0.7 ] .pull-right[ The simple difference in mean outcomes: $$ `\begin{align} SDO &= \frac{5+6+5+10+5+6+8}{7} - \frac{6+4+9}{3} \\ &\approx 6.43 - 6.33 \approx 0.1 \end{align}` $$ * The SDO is much smaller than the ATE! * Such a difference **will (almost always) fail to capture the causal treatment effect** * Notice that this kind "naive" comparison is often done by journalists, politicians, badly trained scientists (but not you now! 😉) ] --- name: naive_comp # Problems with Naive Comparisons Let's rewrite the SDO to make the individual treatment effect `\((\delta_i)\)` appear in the equation. `\begin{align} SDO &= \mathop{\mathbb{E}}(Y_i^1|D_i=1) - \mathop{\mathbb{E}}(Y_i^0|D_i=0) \\ &= \mathop{\mathbb{E}}(Y_i^0 + \delta_i | D_i = 1) - \mathop{\mathbb{E}}(Y_i^0 | D_i = 0) \end{align}` -- For simplicity, suppose **treatment effect is constant** across people: for all `\(i, \delta_i = \delta\)`. -- Then, $$ SDO = \delta + \mathop{\mathbb{E}}(Y_i^0 | D_i = 1) - \mathop{\mathbb{E}}(Y_i^0 | D_i = 0) $$ And because `\(ATE = \mathop{\mathbb{E}}(\delta_i) = \mathop{\mathbb{E}}(\delta) = \delta\)` (by assumption), we get: `\begin{equation} SDO = ATE + \underbrace{\mathop{\mathbb{E}}(Y_i^0 | D_i = 1) - \mathop{\mathbb{E}}(Y_i^0 | D_i = 0)}_\text{Selection bias} \end{equation}` ([*Appendix*](#naive_comp_extended): when constant treatment assumption is relaxed another bias term appears.) --- class:inverse # Task 1: SDO, ATE and Randomization
−
+
10
:
00
Let's compute these various quantities and biases with data we generated ourselves. 1. Load the data [here](https://www.dropbox.com/s/oqc3vq1m9eyp1d6/toy_data_2.csv?dl=1) using `read.csv`. The `group` variable corresponds to whether the individual has been treated or not, `Y0` to the potential outcome if the individual does not receive the treatment `\((Y_i^0)\)` while `Y1` to the potential outcome if the individual receives the treatment `\((Y_i^1)\)`. Create a new variable containing the observed outcome `\((Y_i)\)` and the individual treatment effect `\((\delta_i)\)`. Recall `\(Y_i = D_i \times Y_i^1 + (1 - D_i) \times Y_i^0\)`, `\(\delta_i = Y_i^1 - Y_i^0\)`. 1. Compute the ***ATE*** and the ***SDO***. Is there is any *bias*? Is it large in magnitude? 1. In this new [dataset](https://www.dropbox.com/s/0qfsonzz1t9gkxi/toy_data_random.csv?dl=1) we've randomly assigned the same individuals to the treatment and control groups. Compute the ***SDO under randomization***. Remember that you need to recompute `\(Y_i\)` because the assignment is not the same anymore. If you got it right, the bias should be very close to 0. Why is it not exactly 0? 1. *Optional*: Compute the ***selection bias*** and the ***heterogeneous treatment effect bias*** and check that `\(SDO = ATE + \text{selection bias} + \text{heterogenous treatment effect bias}\)` --- # Randomization solves the problem of causal inference! * ***Randomized experiments***: you ***randomly*** assign people to a treatment and a control group. * In this case, the treatment assignment is **independent** of the potential outcomes. -- * In particular, there is no reason for `\(\mathop{\mathbb{E}}(Y_i^0 | D_i = 1)\)` to be different from `\(\mathop{\mathbb{E}}(Y_i^0 | D_i = 0)\)` * Therefore the ***selection bias is equal to 0***. -- * With random assignment we have: $$ SDO = \mathop{\mathbb{E}}(Y_i^1|D_i=1) - \mathop{\mathbb{E}}(Y_i^0|D_i=0) = ATE$$ 👉 We can directly estimate the ATE from the data! --- layout: false class: title-slide-section-red, middle # Randomized Experiments --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Randomized Experiments - Often called **R**andomized **C**ontrolled **T**rials (RCT). - The first RCTs were conducted a long time ago (18th and 19th century), mainly in **Medecine**. - In the beginning of the 20th century they were popularized by famous statisticians like **J. Neyman** or **R.A. Fisher**. - Since then they have had a growing influence and have progressively become a reliable [tool for public policy evaluation](https://www.povertyactionlab.org/fr). - As for economics, the **2019 Nobel Price in Economics** was awarded to three exponents of RCTs, [Abhijit Banerjee, Esther Duflo and Michael Kremer](https://www.economist.com/finance-and-economics/2019/10/17/a-nobel-economics-prize-goes-to-pioneers-in-understanding-poverty), "for their experimental approach to alleviating global poverty". --- # Back to class size and students' achievement Last week we regressed average student math or reading scores on class size. `$$\textrm{math score}_i = b_0 + b_1 \textrm{class size}_i + e_i$$` We briefly discussed why `\(b_1^{OLS}\)` could only establish an ***association*** and not a ***causal relationship***. -- * **Student sorting**: There is selection into schools with different sized classes. Suppose parents have a prior that smaller classes are better, they will try to get their kids into those schools. -- * **Teacher sorting**: Teachers may sort in schools with smaller classes because it’s easier to teach a small rather than a large class, and if there is competition for those places then higher quality teachers will have an advantage. -- * **Location effect**: Large classes may be more common in wealthier and bigger cities, while small classes may be more likely in poorer rural areas. -- An RCT would take care of all these biases! --- # The Project STAR Experiment Tennessee **S**tudent/**T**eacher **A**chievement **R**atio Experiment (see [Krueger (1999)](http://piketty.pse.ens.fr/files/Krueger1999.pdf)) * Funded by Tennesse legislature for a total cost of approx. $12 million. * The experiment started in the 1985-1986 school year and lasted four years. -- * 11,600 students and their teachers where **randomly assigned** to one of the following 3 groups from kindergarten through third grade: 1. ***Small class***: 13-17 students per teacher, 2. ***Regular class***: 22-25 students, 3. ***Regular/aide class***: 22-25 students with a full-time teacher's *aide*. -- * Randomization occurred within schools. * Students' math and reading skills were tested around March each year. -- * There was a problem of ***non-random attrition*** but we will ignore it. --- class:inverse # Task 2: STAR data
−
+
10
:
00
1. Load the *STAR* data from [here](https://www.dropbox.com/s/bf1fog8yasw3wjj/star_data.csv?dl=1) and assign it to an object called `star_df`. Read the help for the data [here](https://rdrr.io/cran/AER/man/STAR.html) to understand what the variables correspond to. (Note: the data has been *reshaped* so don't mind the "k", "1", etc. in the variable names in the help.) 1. What's the unit of observation? Which variable contains: (i) the (random) class assignment, (ii) the student's class grade, (iii) the outcomes of interest? 1. How many observations are there? Why so many if 11,598 students participated? Why are there so many `NA`s? What do they correspond to? 1. Keep only cases with no `NA`s with the following code: `star_df <- star_df[complete.cases(star_df),]` 1. Let's check how well the randomization was done by doing ***balancing checks***. Compute the average percentage of girls, african americans, and free lunch qualifiers by grade *and* treatment class. (*Hint*: The following computes the percentage of girls (without the relevant `dplyr` verbs): `share_female = mean(gender == "female") * 100`.) --- # The Project STAR Experiment We just saw that in an RCT the Average Treatment Effect is obtained by computing the differences in outcomes between the treatment and control groups. Let's only focus on: - One treatment group: **small classes**, - One control group: **regular classes**, - One grade: **kindergarten** (k). -- grade | test | mean regular | mean small | ATE --------|---------|:---------:|:---------:|:---------: k | math | 484.45 | 493.34 | 8.9 k | read | 435.76 | 441.13 | 5.37 What's the interpretation for these ATEs? -- That's nice but can't we put this in regression form? --- # RCT in Regression Form $$ Y_i = D_i Y_i^1 + (1 - D_i) Y_i^0 $$ -- Factoring by `\(D_i\)` and replacing `\(Y_i^1 - Y_i^0\)` by `\(\delta_i\)`, we get: `$$\begin{align} Y_i &=Y_i^0 +D_i (Y_i^1 - Y_i^0) \\ &= Y_i^0 +D_i \delta_i \end{align}$$` -- Assuming `\(\delta_i = \delta\)`, for all `\(i\)`, `$$Y_i = Y_i^0 + D_i \delta$$` -- Adding `\(\mathbb{E}[Y_i^0] - \mathbb{E}[Y_i^0] = 0\)` to the right-hand side: `$$\begin{align} Y_i &= \mathbb{E}[Y_i^0] + D_i \delta + Y_i^0 - \mathbb{E}[Y_i^0] \\ &= b_0 + \delta D_i + e_i \end{align}$$` where `\(b_0 = \mathbb{E}[Y_i^0]\)` and `\(e_i = Y_i^0 - \mathbb{E}[Y_i^0]\)` --- # The Project STAR Experiment: Regression The last equation looks exactly like the simple regression model we saw last week! (with `\(\delta = b_1\)`) Let's therefore estimate the ATE of being assigned to a small class size on math scores. -- We want to estimate the following model: `\(\text{math score}_i = b_0 + \delta \text{small}_i + e_i\)`, with $$ \text{small}_i = \begin{cases} 1 \textrm{ if assigned to a small class} \\\\ 0 \textrm{ if assigned to a regular class} \end{cases} $$ .pull-left[ ```r star_df_k_small <- star_df %>% filter(star %in% c("regular", "small") & grade == "k") %>% mutate(small = (star == "small")) ``` ```r star_df_k_small %>% count(star, grade) star_df_k_small %>% count(small) ``` ] .pull-left[ ``` ## # A tibble: 2 × 3 ## star grade n ## <chr> <chr> <int> ## 1 regular k 1781 ## 2 small k 1578 ``` ``` ## # A tibble: 2 × 2 ## small n ## <lgl> <int> ## 1 FALSE 1781 ## 2 TRUE 1578 ``` ] --- # The Project STAR Experiment: Regression Regression model we want to estimate: `\(\text{math score}_i = b_0 + \delta \text{small}_i + e_i\)` ```r lm(math ~ small, star_df_k_small) ``` ``` ## ## Call: ## lm(formula = math ~ small, data = star_df_k_small) ## ## Coefficients: ## (Intercept) smallTRUE ## 484.446 8.895 ``` -- Recall that: `\(b_0 = \mathbb{E}[Y_i^0]\)` and `\(\delta = \mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i | D_i = 0]\)` .pull-left[ ```r b_0 = mean(star_df_k_small$math[ star_df_k_small$small == FALSE]) b_0 ``` ``` ## [1] 484.4464 ``` ] .pull-right[ ```r delta = mean(star_df_k_small$math[ star_df_k_small$small == TRUE]) - mean(star_df_k_small$math[ star_df_k_small$small == FALSE]) delta ``` ``` ## [1] 8.895193 ``` ] --- # Regression with a Dummy Variable: Graphically Contrary to last week, the regressor in our regression is a ***dummy variable***, i.e. a variable that takes the values TRUE or FALSE (1 or 0). <img src="chapter_causality_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" /> --- # Regression with a Dummy Variable: Graphically Contrary to last week, the regressor in our regression is a ***dummy variable***, i.e. a variable that takes the values TRUE or FALSE (1 or 0). <img src="chapter_causality_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" /> --- # Regression with a Dummy Variable: Graphically Contrary to last week, the regressor in our regression is a ***dummy variable***, i.e. a variable that takes the values TRUE or FALSE (1 or 0). <img src="chapter_causality_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto;" /> --- # Regression with a Dummy Variable: Graphically Contrary to last week, the regressor in our regression is a ***dummy variable***, i.e. a variable that takes the values TRUE or FALSE (1 or 0). <img src="chapter_causality_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" /> --- # Regression with a Dummy Variable: Formally Recall the regression model: `\(\text{math score}_i = b_0 + \delta \text{small}_i + e_i\)` `\(\begin{align} \mathbb{E}[\textrm{math score} | \text{small}_i = 0]&= \mathbb{E}[b_0 + \delta \text{small}_i + e_i | \text{small}_i = 0] \\ &= b_0 + \delta \mathbb{E}[\text{small}_i| \text{small}_i = 0] + \mathbb{E}[e_i|\text{small}_i = 0] \\ &= b_0 \end{align}\)` -- `\(\begin{align} \mathbb{E}[\textrm{math score} | \text{small}_i = 1]&= \mathbb{E}[b_0 + \delta \text{small}_i + e_i | \text{small}_i = 1] \\ &= b_0 + \delta \mathbb{E}[\text{small}_i| \text{small}_i = 1] + \mathbb{E}[e_i|\text{small}_i = 1] \\ &= b_0 + \delta \end{align}\)` -- `\(\begin{align} ATE &= \mathbb{E}[\textrm{math score} | \text{small}_i = 1] - \mathbb{E}[\textrm{math score} | \text{small}_i = 0] \\ &= b_0 + \delta - b_0 \\ &= \delta \end{align}\)` -- We knew this already but we now understand why this is true ✌️ --- class:inverse # Task 3: Your Turn!
−
+
10
:
00
Run the following code to filter the dataset to only keep first graders and the small class and regular class groups: ```r star_df_clean <- star_df %>% filter(grade == "1" & star %in% c("small", "regular")) ``` 1. Compute the average math score for both groups, and the difference between the two. (Use base `R`.) 1. Create a dummy variable `treatment` equal to `TRUE` if student is in treatment group (i.e. small class size) and `FALSE` if in control group (i.e. regular class size). *Hint:* you can create the dummy variable with `treatment = (star == "small")`. 1. Regress math score on the treatment dummy variable. Are the results in line with question 2? 1. How do you interpret these coefficients? --- # Shortcomings of RCTs RCTs have very strong ***internal validity***, that is they can convincingly establish causal links. However, they have some shortcomings: * RCT are often **infeasible**: * RCTs are **costly**, * RCTs may face some **ethical issues**: some *treatments* simply cannot be given to people, * RCTs take time and we may be **time constrained**. -- * **Interpretation** of the results: * ***External validity***: To what extent can the results from a given RCT be generalized to other contexts (countries, populations,...)? * Uncovering the mechanisms that are at stake may be difficult, * Imperfect randomization, attrition, ... --- # What comes next? * So if we cannot rely on RCTs to make our life easy, it means we have to find a way to make causal inference from ***observational data*** (as opposed to ***experimental data***). -- 2 broad cases: * ***selection occurs on observable characteristics***: *multiple regression* (next week!) * ***selection occurs on unobservable characteristics***: *regression discontinuity design* (lecture 10) or *difference-in-differences* (not covered but set of slides online!) --- # On the way to causality ✅ How to manage data? Read it, tidy it, visualise it... 🚧 How to summarise relationships between variables? Simple linear regression... to be continued ✅ **What is causality?** ❌ What if we don't observe an entire population? ❌ Are our findings just due to randomness? ❌ How to find exogeneity in practice? --- <br> <br> .center[ <img src="../img/photos/correlation_funny.png" width="1000px" style="display: block; margin: auto;" /> ] --- class: title-slide-final, middle background-image: url(../img/logo/ScPo-econ.png) background-size: 250px background-position: 9% 19% # SEE YOU NEXT WEEK! | | | | :--------------------------------------------------------------------------------------------------------- | :-------------------------------- | | <a href="https://github.com/ScPoEcon/ScPoEconometrics-Slides">.ScPored[<i class="fa fa-link fa-fw"></i>] | Slides | | <a href="https://scpoecon.github.io/ScPoEconometrics">.ScPored[<i class="fa fa-link fa-fw"></i>] | Book | | <a href="http://twitter.com/ScPoEcon">.ScPored[<i class="fa fa-twitter fa-fw"></i>] | @ScPoEcon | | <a href="http://github.com/ScPoEcon">.ScPored[<i class="fa fa-github fa-fw"></i>] | @ScPoEcon | --- layout: false class: title-slide-section-red, middle # Appendix --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- name: attandatu # Average Treatment on the Treated and on the Untreated Other ***conditional*** average treatment effects may be of interest: .pull-left[ **A**verage **T**reatment on the **T**reated (***ATT***) `\begin{align} ATT &= \mathop{\mathbb{E}}(\delta_i | D_i = 1) \\ &= \mathop{\mathbb{E}}(Y_i^1 - Y_i^0 | D_i = 1) \\ &= \mathop{\mathbb{E}}(Y_i^1 | D_i = 1) - \mathop{\mathbb{E}}(Y_i^0 | D_i = 1) \end{align}` The ATT measures the ***average treatment effect conditional on being in the treatment group***. *Example:* the effect of participating in a training program (*treatment*) for those who participated (*treatment group*). ] .pull-right[ **A**verage **T**reatment on the **U**ntreated (***ATU***) `\begin{align} ATU &= \mathop{\mathbb{E}}(\delta_i | D_i = 0) \\ &= \mathop{\mathbb{E}}(Y_i^1 - Y_i^0 | D_i = 0) \\ &= \mathop{\mathbb{E}}(Y_i^1 | D_i = 0) - \mathop{\mathbb{E}}(Y_i^0 | D_i = 0) \end{align}` The ATU measures the ***average treatment effect conditional on being in the control group***. *Example:* the effect of attending a private school (*treatment*) for students from a public school (*control group*). ] *Note:* In the majority of cases, ATE `\(\neq\)` ATT `\(\neq\)` ATU! <span> </span> [*back*](#ate) --- name: naive_comp_extended # Problems with Naive Comparisons Let's now relax the assumption that the treatment effect is constant among all individuals. After [some tedious calculations](https://mixtape.scunning.com/potential-outcomes.html#simple-difference-in-means-decomposition) that we skip, the SDO can now be decomposed as: `\begin{align} SDO &= ATE + \underbrace{\mathop{\mathbb{E}}(Y_i^0 | D_i = 1) - \mathop{\mathbb{E}}(Y_i^0 | D_i = 0)}_\text{Selection bias} \\ & \quad \quad \quad \quad + \underbrace{(1-\pi)(ATT - ATU)}_\text{Heterogenous treatment effect bias} \end{align}` where `\(1 - \pi\)` denotes the share of people in the control group. So there is a novel source of bias that comes from the potential ***heterogeneity in the individual treatment effect*** `\(\delta_i\)`. * ***Selection bias***: those who attend university are likely to have higher baseline cognitive skills (regardless of whether they actually attend college). * ***Heterogeneous treatment effect bias***: those who attend university may improve their cognitive skills more at university because they are more motivated. <span> </span> [back](#naive_comp)