class: center, middle, inverse, title-slide # Applied Data Analysis for Public Policy Studies ## Intro To Causality ### Michele Fioretti ### SciencesPo Paris 2022-08-29 --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Recap from last week * *Simple Linear Regression Model*: `\(y_i = b_0 + b_1 x_i + e_i\)` * *Ordinary Least Squares* (OLS) estimation: minimize the sum of squared errors * R command to estimate a linear model: `lm(dependent variable ~ independent variable, data)` ## Today - Introduction to causal inference * Causality versus correlation * The Potential Outcome Framework a.k.a. Rubin's Causal Model * Randomized controlled trials (RCTs) * Follow up on the empirical application of *class size* and *student performance* --- # Why do we care about causality? Many of the *interesting questions* we might want to answer with data are causal -- - ***Understanding*** the world - *Social sciences*: Why do people behave in the way they do? - *Health sciences*: Why do people get sick? Which medicine can cure them? -- - Causal understanding is also of first interest to **policymakers** - How to lower unemployment? - How to improve student learning? - Whether governments should care about the level of public debt? -- - Note that some questions we might want to answer are not causal - Most *Artificial Intelligence* tasks only care about **prediction** - *Example*: predicting whether a photo is of a dog or a cat is vital to how Google Images works, but it doesn't care what *caused* the photo to be of a dog or a cat. --- # Causality and Economics - Making causal inference from data can be seen as economists' *comparative advantage* among the social sciences! - Plenty of fields do statistics. But very few make it standard training for their students to understand causality. - Economists' endeavour to establish causal relationships is also what makes them useful both in the private (e.g. tech companies) and public sector (e.g. policy advisors). -- - Ok, that's enough preaching π --- # The Concept of Causality __Causality__: what are we talking about? - We say that `X` *causes* `Y` -- - if we were to intervene and *change* the value of `X` ***without changing anything else***... -- - then `Y` would also change ***as a result***. -- - The key point here is the ***without changing anything else***, often referred as the **ceteris paribus (*all else equal*) assumption**. (*latin* makes things seem more complicated π€) -- - β οΈ It does **NOT** mean that `X` is the only factor that causes `Y`. --- # Correlation vs Causation ***Correlation does not equal causation*** has become a ubiquitous mantra, but can you tell why it is true? -- Some correlations obviously don't imply causation ([e.g. spurious correlation website](https://www.tylervigen.com/spurious-correlations)). -- <img src="../img/photos/spurious.png" width="800px" style="display: block; margin: auto;" /> --- # Correlation vs Causation: Smoking and Lung Cancer But not all correlations are so easy to rule out -- ***Does smoking cause lung cancer?*** -- .pull-left[ - Today, we know the answer is *YES*! - But let's go back in the 1950's - We are at the start of a big increase in deaths from lung cancer... - ... which is happening after a fast growth of cigarette consumption ] -- .pull-right[ <img src="../img/photos/Smoking_lung_cancer.png" width="400px" style="display: block; margin: auto;" /> ] -- - It's very tempting to claim that smoking causes lung cancer based on this graph. --- # Correlation vs Causation: Smoking and Lung Cancer At the time many people were still skeptical, including some famous statisticians: -- .pull-left[ ***Macro confouding factors***: Other macro factors which can cause cancers also changed between 1900 and 1950: - Tarring of roads, - Inhalation of motor exhausts (leaded gasoline fumes), - General greater air pollution. ] -- .pull-right[ ***Self selection***: Smokers and non-smokers may be different in the first place: - __Selection on observable characteristics__: age, education, income, etc. - __Selection on unobservable characteristics__: genes (the hypothetical confounding genome theory of [Fisher](https://fr.wikipedia.org/wiki/Ronald_Aylmer_Fisher)). ] --- # Correlation vs Causation: Smoking and Lung Cancer - Let's focus on one of these potential confounding characteristics: **age**. -- - Based on [Cochran (1968)](https://www.jstor.org/stable/2528036?origin=crossref&seq=11#metadata_info_tab_contents), we will use death rates in Canada, the U.K. and the U.S. - For each country 3 groups of men were studied for 20 months (U.S.) to 5 five years (U.K.). -- - The following table gives the death rates per 1000 persons-years among these people. <img src="../img/photos/raw_death_rates.png" width="500px" style="display: block; margin: auto;" /> - Cigar/pipe smokers are 30% (U.S.) to 75% (Canada) more likely to die from lung cancer than non-smokers. --- # Correlation vs Causation: Smoking and Lung Cancer * But the age *distribution* is also very different depending on smoking status. * The following table gives the mean age by smoking status <img src="../img/photos/age_by_smoking_status.png" width="600px" style="display: block; margin: auto;" /> * Because health is likely to deteriorate with age the previous table could be far from giving causal estimates. --- class:inverse # Task 1 (5 minutes) Let's adjust our Canadian statistics **taking the age distribution into account **. | | Death rates of <br> Pipe-smokers| Nb. of <br> pipe smokers | Nb. of <br> non-smokers | |-----------|:----:|:----:|:----:| | Age 20-50 | 15 | 11 | 29 | | Age 50-70 | 35 | 13 | 9 | | Age +70 | 50 | 16 | 2 | | Total | | 40 | 40 | 1. Compute the average death rate for pipe smokers in Canada. 1. What would the average mortality rate be for pipe smokers if they had the same age distribution as non-smokers? --- # Correlation vs Causation: Smoking and Lung Cancer Here is the age-adjusted death rates table found by [Cochran (1968)](https://www.jstor.org/stable/2528036?origin=crossref&seq=11#metadata_info_tab_contents). <img src="../img/photos/adjusted_death_rates.png" width="600px" style="display: block; margin: auto;" /> -- - Cigars/pipes seem much less dangerous than in the previous table... - ... but this table still does not allow us to tell about the causal effect of smoking. - There are still a lot of ***confounding factors*** that are not accounted for. --- # How Can We Tell? - Sometimes correlations are just pure *artefacts*: there is no causal relationship between the variables of interest. - In some other cases, there are both correlation and causality but not of the same **magnitude**, or even the same **direction**. - So how can we make causal inference then? -- - The ***Potential Outcomes Framework*** will be our guide. --- layout: false class: title-slide-section-red, middle # Causal Inference --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # The Potential Outcomes Framework Often called the **Rubin Causal Model** in memory of the statistician **Donald Rubin** who generalised and formalized this model in the 1970's. -- ***Key idea***: Each individual can be exposed to **multiple alternative treatment states**. - smoking cigarrettes, smoking cigars or not smoking, - growing up in a poor vs a middle class neighborhood vs a rich neighborhood, - being in a small or a big class. -- .pull-left[ For practicality, let this treatment variable `\(D_i\)` be a binary variable: $$ D_i = \begin{cases} 1 \textrm{ if individual `\(i\)` is treated} \\\\ 0 \textrm{ if individual `\(i\)` is not treated} \end{cases} $$ ] -- .pull-right[ ***Treatment group***: all the individuals such that `\(D_i = 1\)`. ***Control group***: all the individuals such that `\(D_i = 0\)`. ] --- # The Potential Outcomes Framework * In this framework, each individual has two ***potential outcomes***: - `\(Y_i^1\)`: *potential outcome in the treatment state* `\((D_i = 1)\)` for individual `\(i\)`, - `\(Y_i^0\)`: *potential outcome in the control state* `\((D_i = 0)\)` for individual `\(i\)`. -- * From these we can define the ***individual treatment effect***: $$ \delta_i = Y_i^1 - Y_i^0$$ * `\(\delta_i\)` measures the **causal effect of `\(D_i\)` (treatment)** for individual `\(i\)` on outcome `\(Y\)`. -- * In real life we only observe `\(Y_i\)` which can be written as: `$$Y_i = D_i * Y_i^1 + (1- D_i)*Y_i^0$$` -- * ***Fundamental Problem of Causal Inference***: for any individual `\(i\)`, we only observe one of either potential outcomes, and thus we cannot compute `\(\delta_i\)` [(Holland, 1986)](http://people.umass.edu/~stanek/pdffiles/causal-holland.pdf). --- # The Potential Outcomes Framework <img src="../img/photos/fundamental_pb_inference.png" width="700px" style="display: block; margin: auto;" /> * The potential outcome that is not observed exists in principle, it is called the ***counterfactual outcome***. * What your test score would have been if you had been in a big class, knowing that you were in a small one. -- * Since the treatment effect *cannot* be observed at the individual level, we estimate population averages. .footnote[ This table is from Morgan and Winship (2015, p.46). ] --- # Average Treatment Effect (ATE) Broadest possible average effect: **A**verage **T**reatment **E**ffect (***ATE***) `\begin{align} ATE &= \mathop{\mathbb{E}}(\delta_i) \\ &= \mathop{\mathbb{E}}(Y_i^1 - Y_i^0) \\ &= \mathop{\mathbb{E}}(Y_i^1) - \mathop{\mathbb{E}}(Y_i^0) \end{align}` * The ATE simply measures the ***average of individual treatment effects over the whole population***. * The `\(\mathop{\mathbb{E}}(.)\)` operator stands for **expectation** or *population mean*. * The `\(\mathop{\mathbb{E}}(.)\)` operator is linear, in other words, `\(\mathop{\mathbb{E}}(X+Y) = \mathop{\mathbb{E}}(X) + \mathop{\mathbb{E}}(Y)\)` with `\(X\)` and `\(Y\)` being two random variables. --- # Average Treatment on the Treated (ATT) Other ***conditional*** average treatment effects may be of interest: * The **A**verage **T**reatment Effect on the **T**reated (***ATT***) `\begin{align} ATT &= \mathop{\mathbb{E}}(\delta_i | D_i = 1) \\ &= \mathop{\mathbb{E}}(Y_i^1 - Y_i^0 | D_i = 1) \\ &= \mathop{\mathbb{E}}(Y_i^1 | D_i = 1) - \mathop{\mathbb{E}}(Y_i^0 | D_i = 1) \end{align}` * The ATT measures the ***average treatment effect conditional on being in the treatment group***. * The `\(\mathop{\mathbb{E}}(.|D = x)\)` operator stands for **conditional expectation**. It refers to the expectation over a subcategory of the entire population, namely people who satisfy the condition `\(D = x\)`. * The `\(\mathop{\mathbb{E}}(.|D = x)\)` operator is also linear. --- # Average Treatment on the Untreated (ATU) Other ***conditional*** average treatment effects may be of interest: * The **A**verage **T**reatment Effect on the **U**ntreated (***ATU***) `\begin{align} ATU &= \mathop{\mathbb{E}}(\delta_i | D_i = 0) \\ &= \mathop{\mathbb{E}}(Y_i^1 - Y_i^0 | D_i = 0) \\ &= \mathop{\mathbb{E}}(Y_i^1 | D_i = 0) - \mathop{\mathbb{E}}(Y_i^0 | D_i = 0) \end{align}` * The ATU measures the ***average treatment effect conditional on being in the control group***. -- * In the majority of cases, ATE `\(\neq\)` ATT `\(\neq\)` ATU! --- # The Problem of Causal Inference * We have the same **missing data problem** for computing the ATE, ATT or ATU as we did for `\(\delta_i\)`. Either `\(Y_i^1\)` or `\(Y_i^0\)` is missing for each `\(i\)`. -- * From the data, we can compute the **S**imple **D**ifference in mean **O**utcomes (***SDO***) for both groups: $$ SDO = \mathop{\mathbb{E}}(Y_i^1|D_i=1) - \mathop{\mathbb{E}}(Y_i^0|D_i=0) $$ -- * For example, it would consist in taking the difference between: * Death rates for smokers and non-smokers, * GDP growth rates for democratic and non democratic countries, * Unemployment rates for countries with and without a minimum wage. -- * Most of the time, such a difference **will fail to capture the causal effect** of the treatment. -- * Notice that this kind raw comparison is often done by journalists, politicians, badly trained scientists (but not you now! π). --- # The Problem of Causal Inference Let's rewrite the SDO to make the individual treatment effect `\((\delta_i)\)` appear in the equation. -- `\begin{align} \mathop{\mathbb{E}}(Y_i^1|D_i=1) - \mathop{\mathbb{E}}(Y_i^0|D_i=0) &= \mathop{\mathbb{E}}(Y_i^0 + \delta_i | D_i = 1) - \mathop{\mathbb{E}}(Y_i^0 | D_i = 0) \end{align}` -- For now, suppose **treatment effect is constant** across people: `\(\forall i, \delta_i = \delta\)`. Then, `\begin{align} \mathop{\mathbb{E}}(Y_i^1|D_i=1) - \mathop{\mathbb{E}}(Y_i^0|D_i=0) &= \delta + \mathop{\mathbb{E}}(Y_i^0 | D_i = 1) - \mathop{\mathbb{E}}(Y_i^0 | D_i = 0) \end{align}` -- And because `\(ATE = \mathop{\mathbb{E}}(\delta_i) = \mathop{\mathbb{E}}(\delta) = \delta\)` (by assumption), we get: `\begin{align} SDO &= \mathop{\mathbb{E}}(Y_i^1|D_i=1) - \mathop{\mathbb{E}}(Y_i^0|D_i=0) \\ &= ATE + \underbrace{\mathop{\mathbb{E}}(Y_i^0 | D_i = 1) - \mathop{\mathbb{E}}(Y_i^0 | D_i = 0)}_\text{Selection bias} \end{align}` --- # The Problem of Causal Inference Let's now relax the assumption that the treatment effect is constant among all individuals. After tedious calculations that we skip, we get that the SDO is now decomposed as: `\begin{align} SDO &= ATE + \underbrace{\mathop{\mathbb{E}}(Y_i^0 | D_i = 1) - \mathop{\mathbb{E}}(Y_i^0 | D_i = 0)}_\text{Selection bias} \\ & \quad \quad \quad \quad + \underbrace{(1-\pi)(ATT - ATU)}_\text{Heterogenous treatment effect bias} \end{align}` where `\(1 - \pi\)` denotes the share of people in the control group. -- So there is a novel source of bias that comes from the potential ***heterogeneity in the individual treatment effect*** `\(\delta_i\)`. -- * ***Selection bias***: those who attend university are likely to have higher baseline cognitive skills (regardless of whether they actually attend college). * ***Heterogeneous treatment effect bias***: those who attend university may improve their cognitive skills more at university because they are more motivated. --- class:inverse # Task 2 (10 minutes) Let's compute these various quantities and biases with some toy data (i.e. data we generated ourselves). 1. Load the data [here](https://www.dropbox.com/s/a4u67nmu9gnl94y/toy_data.csv?dl=1). Notice that `Di_random` is a treatment status we would have under random assignment. 1. Create the following variables: `\(Y_i\)` and `\(\delta_i\)`. Recall that `\(Y_i = D_i * Y_i^1 + (1 - D_i) Y_i^0\)` and `\(\delta_i = Y_i^1 - Y_i^0\)`. 1. Compute the ***ATE*** and the ***SDO***. (Use base `R`.) Is there is any *bias*? Is it large in magnitude? 1. Using `D_i_random`, compute the ***SDO under randomization***. Remember that you need to recompute `\(Y_i\)` because the assignment is not the same anymore. If you got it right, the bias should be very close to 0. Why is it not exactly 0? 1. *To do at home*: Compute the value of the ***selection bias*** and the ***heterogenous treatment effect bias*** and check that we have `$$SDO = ATE + \textrm{selection bias} + \textrm{heterogenous treatment effect bias}$$` --- # Randomization solves the problem of causal inference! * ***Randomized experiments***: you ***randomly*** assign people to a treatment and a control group. * In this case, the treatment status is **independent** of the potential outcomes. -- * In particular, there is no reason for `\(\mathop{\mathbb{E}}(Y_i^0 | D_i = 1)\)` to be different from `\(\mathop{\mathbb{E}}(Y_i^0 | D_i = 0)\)` * Therefore the ***selection bias is equal to 0***. -- * In the same way, there is no reason for `\(\mathop{\mathbb{E}}[\delta_i]\)` to be different in the treatment and control group. * There `\(ATT = ATU\)`, implying the ***heterogenous treatment effect bias will also be 0***. --- # Randomization solves the problem of causal inference! With random assignment we have: $$ \mathop{\mathbb{E}}(Y_i^1|D_i=1) - \mathop{\mathbb{E}}(Y_i^0|D_i=0) = ATE$$ π We can directly estimate the ATE from the data! --- layout: false class: title-slide-section-red, middle # Randomized Experiments --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Randomized experiments - Often called **R**andomized **C**ontrolled **T**rials (RCT). - The first RCTs were conducted a long time ago (18th and 19th century), mainly in **Medecine**. - In the beginning of the 20th century they were popularized by famous statisticians like **J. Neyman** or **R.A. Fisher**. - Since then they have had a growing influence and have progressively become a reliable [tool for public policy evaluation](https://www.povertyactionlab.org/fr). - As for economics, the **2019 Nobel Price in Economics** was awarded to three exponents of RCTs, [Abhijit Banerjee, Esther Duflo and Michael Kremer](https://www.economist.com/finance-and-economics/2019/10/17/a-nobel-economics-prize-goes-to-pioneers-in-understanding-poverty), "for their experimental approach to alleviating global poverty". --- # Back to class size and students' achievement Last week we regressed average student math and reading scores on class size. `$$\textrm{math score}_i = b_0 + b_1 \textrm{class size}_i + e_i$$` We briefly discuss why `\(b_1^{OLS}\)` could only establish an ***association*** and not a ***causal relationship***. -- * **Student sorting**: There is selection into schools with different sized classes. Suppose parents have a prior that smaller classes are better, they will try to get their kids into those schools. -- * **Teacher sorting**: Teachers may sort in schools with smaller classes because itβs easier to teach a small rather than a large class, and if there is competition for those places then higher quality teachers will have an advantage. -- * **Location effect**: Large classes may be more common in wealthier and bigger cities, while small classes may be more likely in poorer rural areas. -- An RCT would take care of all these biases! --- # The Project STAR Experiment Tennessee **S**tudent/**T**eacher **A**chievement **R**atio Experiment (see [Krueger (1999)](http://piketty.pse.ens.fr/files/Krueger1999.pdf)) * Funded by Tennesse legislature for a total cost of approx. $12 million. * The experiment started in the 1985-1986 school year and lasted four years. -- * 11,600 students and their teachers where **randomly assigned** to one of the following 3 groups from kindergarten through third grade: 1. ***Small class***: 13-17 students per teacher, 2. ***Regular class***: 22-25 students, 3. ***Regular/aide class***: 22-25 students with a full-time teacher's *aide*. -- * Randomization occurred within schools. * Students' math and reading skills were tested around March each year. -- * There was a problem of ***non-random attrition*** but we will ignore it. --- class:inverse # Task 3 (10 minutes) 1. Load the *STAR* data from [here](https://www.dropbox.com/s/bf1fog8yasw3wjj/star_data.csv?dl=1) and assign it to an object called `star_df`. 1. Read the help for `AER::STAR` to understand what the variables correspond to. (Note: the data has been *reshaped* so don't mind the "k", "1", etc. in the variable names in the help.) 1. What's the unit of observation? Which variable contains: (i) the (random) class assignment, (ii) the student's class grade, (iii) the outcomes of interest? 1. How many observations are there? Why so many? 1. Why are there so many `NA`s? What do they correspond to? 1. Keep only cases with no `NA`s with the following code: `star_df <- star_df[complete.cases(star_df),]` 1. Let's check how well the randomization was done by doing ***balancing checks***. Compute the average percentage of girls, african americans, and free lunch qualifiers *by* grade *and* treatment class. *Hint*: The following computes the percentage of girls (without the relevant `dplyr` verbs) `share_female = mean(gender == "female") * 100`. ``` ## gender ethnicity birth grade star read math lunch school degree ## 1 female afam 1979.5 k <NA> NA NA <NA> <NA> <NA> ## 2 female afam 1979.5 1 <NA> NA NA <NA> <NA> <NA> ## 3 female afam 1979.5 2 <NA> NA NA <NA> <NA> <NA> ## 4 female afam 1979.5 3 regular 580 564 free suburban bachelor ## 5 female cauc 1980.0 k small 447 473 non-free rural bachelor ## 6 female cauc 1980.0 1 small 507 538 free rural bachelor ## ladder experience tethnicity system schoolid ## 1 <NA> NA <NA> NA NA ## 2 <NA> NA <NA> NA NA ## 3 <NA> NA <NA> NA NA ## 4 level1 30 cauc 22 54 ## 5 level1 7 cauc 30 63 ## 6 level1 7 cauc 30 63 ``` ``` ## gender ethnicity birth grade star read math lunch school degree ## 1 female afam 1979.5 k <NA> NA NA <NA> <NA> <NA> ## 2 female afam 1979.5 1 <NA> NA NA <NA> <NA> <NA> ## 3 female afam 1979.5 2 <NA> NA NA <NA> <NA> <NA> ## 4 female afam 1979.5 3 regular 580 564 free suburban bachelor ## 5 female cauc 1980.0 k small 447 473 non-free rural bachelor ## 6 female cauc 1980.0 1 small 507 538 free rural bachelor ## ladder experience tethnicity system schoolid ## 1 <NA> NA <NA> NA NA ## 2 <NA> NA <NA> NA NA ## 3 <NA> NA <NA> NA NA ## 4 level1 30 cauc 22 54 ## 5 level1 7 cauc 30 63 ## 6 level1 7 cauc 30 63 ``` ``` ## [1] 46392 ``` ``` ## [1] 46392 ``` ``` ## [1] 15 ``` ``` ## # A tibble: 12 x 5 ## # Groups: star [3] ## star grade share_female share_afam share_lunch ## <fct> <fct> <dbl> <dbl> <dbl> ## 1 regular 1 0.487 0.373 0.515 ## 2 regular 2 0.483 0.367 0.506 ## 3 regular 3 0.489 0.352 0.497 ## 4 regular k 0.485 0.285 0.460 ## 5 regular+aide 1 0.478 0.298 0.508 ## 6 regular+aide 2 0.477 0.338 0.482 ## 7 regular+aide 3 0.472 0.343 0.491 ## 8 regular+aide k 0.488 0.322 0.493 ## 9 small 1 0.486 0.319 0.479 ## 10 small 2 0.491 0.333 0.466 ## 11 small 3 0.501 0.316 0.468 ## 12 small k 0.480 0.302 0.467 ``` --- # The Project STAR Experiment We just saw that in an RCT the ATE is obtained by computing the differences in outcomes between the treatment and control groups. Let's only focus on: - One treatment group: **small classes**, - One control group: **regular classes**, - One grade: **kindergarten** (k). -- grade | test | mean regular | mean small | ATE --------|---------|---------|---------|--------- k | math | 484.45 | 493.34 | 8.9 k | read | 435.76 | 441.13 | 5.37 What's the interpretation for these ATEs? -- That's nice but can't we put this in regression form? --- # RCT in Regression Form $$ Y_i = D_i Y_i^1 + (1 - D_i) Y_i^0 $$ -- Rewriting this equation, we get: `$$\begin{align} Y_i &=Y_i^0 +D_i (Y_i^1 - Y_i^0) \\ &= Y_i^0 +D_i \delta_i \end{align}$$` -- Assuming `\(\delta_i = \delta, \forall i\)`, `$$Y_i = Y_i^0 + D_i \delta$$` -- Adding `\(\mathbb{E}[Y_i^0] - \mathbb{E}[Y_i^0] = 0\)` to the right-hand side: `$$\begin{align} Y_i &= \mathbb{E}[Y_i^0] + D_i \delta + Y_i^0 - \mathbb{E}[Y_i^0] \\ &= b_0 + \delta D_i + e_i \end{align}$$` where `\(b_0 = \mathbb{E}[Y_i^0]\)` and `\(e_i = Y_i^0 - \mathbb{E}[Y_i^0]\)` --- # The Project STAR Experiment: Regression The last equation looks exactly like the simple regression model we saw last week! Let's therefore estimate the ATE of small class size on math scores using a regression. -- .pull-left[ We want to estimate the following model: `\(mathscore_i = b_0 + \delta D_i + e_i\)`, with $$ D_i = \begin{cases} 1 \textrm{ if small class = 1} \\\\ 0 \textrm{ if small class = 0} \end{cases} $$ ```r star_df_k_small <- star_df %>% filter( star %in% c("regular", "small") & grade == "k") %>% mutate(small = (star == "small")) ``` ] -- .pull-right[ ```r lm(math ~ small, star_df_k_small) ``` ``` ## ## Call: ## lm(formula = math ~ small, data = star_df_k_small) ## ## Coefficients: ## (Intercept) smallTRUE ## 484.446 8.895 ``` ```r mean(star_df_k_small[ star_df_k_small$small == 1, "math"]) - mean(star_df_k_small[ star_df_k_small$small == 0, "math"]) ``` ``` ## [1] 8.895193 ``` ```r mean(star_df_k_small[ star_df_k_small$small == 0, "math"]) ``` ``` ## [1] 484.4464 ``` ] --- # The Project STAR Experiment: Regression From the estimation output we get the following: `\(\begin{align} \mathbb{E}[\textrm{math score} | D_i = 0]&= \mathbb{E}[b_0 + \delta D_i + e_i | D_i = 0] \\ &= b_0 + \delta \mathbb{E}[D_i| D_i = 0] + \mathbb{E}[e_i|D_i = 0] \\ &= b_0 \end{align}\)` -- `\(\begin{align} \mathbb{E}[\textrm{math score} | D_i = 1]&= \mathbb{E}[b_0 + \delta D_i + e_i | D_i = 1] \\ &= b_0 + \delta \mathbb{E}[D_i| D_i = 1] + \mathbb{E}[e_i|D_i = 1] \\ &= b_0 + \delta \end{align}\)` -- `\(\begin{align} ATE &= \mathbb{E}[\textrm{math score} | D_i = 1] - \mathbb{E}[\textrm{math score} | D_i = 0] \\ &= b_0 + \delta - b_0 \\ &= \delta \end{align}\)` -- We knew this already but we now understand why this is true βοΈ --- class:inverse # Task 4 (10 minutes) 1. Filter the dataset to only keep first graders and the small class and regular class groups. 1. Compute the average math score for both groups, and the difference between the two. (Use base `R`.) 1. Create a dummy variable `treatment` equal to `TRUE` if student is in treatment group (i.e. small class size) and `FALSE` if in control group (i.e. regular class size). 1. Regress math score on the treatment dummy variable. Are the results in line with the previous question? 1. How do you interpret these coefficients? --- # Shortcomings of RCTs RCTs have very strong ***internal validity***, that is they can convincingly establish causal links. However, they have some shortcomings: * RCT are often **infeasible**: * RCTs are **costly**, * RCTs may face some **ethical issues**: some *treatments* simply cannot be given to people, * RCTs take time and we may be **time constrained**. -- * **Interpretation** of the results: * ***External validity***: To what extent can the results from a given RCT be generalized to other contexts (countries, populations,...)? * Uncovering the mechanisms that are at stake may be difficult, * Imperfect randomization, attrition, ... --- # What comes next? * So if we cannot rely on RCTs to make our life easy, it means we have to find a way to make causal inference from ***observational data*** (as opposed to ***experimental data***). -- * It brings us back to models - In causal inference, the *model* is our idea of what the process that *generated the data* is. - We have to make some assumptions about what this is! -- 2 broad cases: * ***selection occurs on observable characteristics***: *multiple regression* (next week!) * ***selection occurs on unobservable characteristics***: *regression discontinuity design* or *difference-in-differences* (lectures 10 and 11!) --- <br> <br> .center[ <img src="../img/photos/correlation_funny.png" width="1000px" style="display: block; margin: auto;" /> ] --- class: title-slide-final, middle background-image: url(../img/logo/ScPo-econ.png) background-size: 250px background-position: 9% 19% # SEE YOU NEXT WEEK! | | | | :--------------------------------------------------------------------------------------------------------- | :-------------------------------- | | <a href="mailto:michele.fioretti@sciencespo.fr">.ScPored[<i class="fa fa-paper-plane fa-fw"></i>] | michele.fioretti@sciencespo.fr | | <a href="https://michelefioretti.github.io/ScPoEconometrics-Slides/">.ScPored[<i class="fa fa-link fa-fw"></i>] | Slides | | <a href="https://scpoecon.github.io/ScPoEconometrics">.ScPored[<i class="fa fa-link fa-fw"></i>] | Book | | <a href="http://twitter.com/ScPoEcon">.ScPored[<i class="fa fa-twitter fa-fw"></i>] | @ScPoEcon | | <a href="http://github.com/ScPoEcon">.ScPored[<i class="fa fa-github fa-fw"></i>] | @ScPoEcon |