class: center, middle, inverse, title-slide .title[ # ECON 4050: Introduction to Econometrics ] .subtitle[ ## Panel Data ] .author[ ### Adam Soliman, PhD ] .date[ ### Clemson University ] --- # Overview and Today's Lecture Thus far, for causality, we've learned: Randomized Control Trials (RCTs) → Gold standard, but often impractical Regression Discontinuity Designs → Leverages cutoffs for causal inference Difference-in-Differences → Exploits policy changes for causal effects We now turn to panel data techniques: Panel Data = Repeated observations of the same units over time Fixed Effects = Controlling for time-invariant differences across units Helps eliminate omitted variable bias by comparing each unit to itself over time --- # Type of Data #1: Cross-Sectional .pull-left[ So far, we have generally dealt with data that looks like this: <table> <thead> <tr> <th style="text-align:center;"> County </th> <th style="text-align:center;"> CrimeRate </th> <th style="text-align:center;"> ProbofArrest </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 0.0398849 </td> <td style="text-align:center;"> 0.289696 </td> </tr> <tr> <td style="text-align:center;"> 3 </td> <td style="text-align:center;"> 0.0163921 </td> <td style="text-align:center;"> 0.202899 </td> </tr> <tr> <td style="text-align:center;"> 5 </td> <td style="text-align:center;"> 0.0093372 </td> <td style="text-align:center;"> 0.406593 </td> </tr> <tr> <td style="text-align:center;"> 7 </td> <td style="text-align:center;"> 0.0219159 </td> <td style="text-align:center;"> 0.431095 </td> </tr> <tr> <td style="text-align:center;"> 9 </td> <td style="text-align:center;"> 0.0075178 </td> <td style="text-align:center;"> 0.631579 </td> </tr> </tbody> </table> ] .pull-right[ * We have a unit identifier (like `County` here), * Observables on each unit. * Usually called a **cross-sectional** dataset * Provides single snapshot view * Each row, in other words, is one *observation*. ] --- # Type of Data #2: Panel .pull-left[ Now, let's add a `time` index: `Year`. <table> <thead> <tr> <th style="text-align:center;"> County </th> <th style="text-align:center;"> Year </th> <th style="text-align:center;"> CrimeRate </th> <th style="text-align:center;"> ProbofArrest </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 81 </td> <td style="text-align:center;"> 0.0398849 </td> <td style="text-align:center;"> 0.289696 </td> </tr> <tr> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 82 </td> <td style="text-align:center;"> 0.0383449 </td> <td style="text-align:center;"> 0.338111 </td> </tr> <tr> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 83 </td> <td style="text-align:center;"> 0.0303048 </td> <td style="text-align:center;"> 0.330449 </td> </tr> <tr> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 84 </td> <td style="text-align:center;"> 0.0347259 </td> <td style="text-align:center;"> 0.362525 </td> </tr> <tr> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 85 </td> <td style="text-align:center;"> 0.0365730 </td> <td style="text-align:center;"> 0.325395 </td> </tr> <tr> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 86 </td> <td style="text-align:center;"> 0.0347524 </td> <td style="text-align:center;"> 0.326062 </td> </tr> <tr> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 87 </td> <td style="text-align:center;"> 0.0356036 </td> <td style="text-align:center;"> 0.298270 </td> </tr> <tr> <td style="text-align:center;"> 3 </td> <td style="text-align:center;"> 81 </td> <td style="text-align:center;"> 0.0163921 </td> <td style="text-align:center;"> 0.202899 </td> </tr> <tr> <td style="text-align:center;"> 3 </td> <td style="text-align:center;"> 82 </td> <td style="text-align:center;"> 0.0190651 </td> <td style="text-align:center;"> 0.162218 </td> </tr> </tbody> </table> ] .pull-right[ * Next to the unit identifier (`County`) we now have `Year` * Now a pair (`County`,`Year`) indexes one observation. * We call this a **panel** or **longitudinal** dataset * We can track units *over time*. ] --- # Crime Rates and Probability of Arrest * The above data can be loaded with ``` r data(crime4,package = "wooldridge") ``` * They are from [C. Cornwell and W. Trumball (1994), “Estimating the Economic Model of Crime with Panel Data”](https://www.amherst.edu/media/view/121570/original/CornwellTrumbullCrime%2BElasticities.pdf). * One question here: *how big is the deterrent effect of law enforcement*? If you know you are more likely to get arrested, will you be less likely to commit a crime? -- * This is tricky: Does high crime *cause* stronger police response, which acts as a deterrent, or is crime low because deterrent is strong to start with? * This is sometimes called a *simultaneous equation model* situation: police response impacts crime, but crime impacts police response `\begin{align} police &= \alpha_0 + \alpha_1 crime \\ crime &= \beta_0 + \beta_1 police \end{align}` --- # Crime Rates and Probability of Arrest .pull-left[ * Most literature prior to that paper estimated simultaneous equations off cross sectional data * Cornwell and Trumball are worried about **unobserved heterogeneity** between jurisdictions. * Why? What could possibly go wrong? ] -- .pull-right[ * Let's pick out 4 counties from their dataset * Let's look at the crime rate vs probability of arrest relationship * First for all of them together as a single cross section * Then taking advantage of the panel structure (i.e. each county over time). ] --- # Crime vs Arrest in Cross Section .left-thin[ *What to do*: subset data to 4 counties, then plot probability of arrest vs crime rate ``` r css = crime4 %>% filter(county %in% c(1,3,145, 23)) ggplot(css, aes(x = prbarr, y = crmrte)) + geom_point() + geom_smooth(method="lm", se=FALSE) + theme_bw() + labs(x = 'Probability of Arrest', y = 'Crime Rate') ``` ] .right-wide[ <img src="06-panel_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> ] --- # Crime vs Arrest in Cross Section: Positive Relationship! .pull-left[ * We see an upward sloping line! * Higher probability of arrest is associated to higher crime rates. * How strong is the effect? ] -- .pull-right[ ``` r xsection = lm(crmrte ~ prbarr, css) coef(xsection)[2] # gets slope coef ``` ``` ## prbarr ## 0.06480104 ``` * Increasing probability of arrest by 1 unit (i.e. 100 percentage point), increases the crime rate by 0.064801. So, if we double the probability of arrest, crime would increase by 0.064 crimes per person. * Increase of 10 percentage points in the probability of arrest (e.g. `prbarr` goes from 0.2 to 0.3) is associated with an increase in crime rate from 0.021 to 0.028, or a 33.33 percent increase in the crime rate. ] --- # Ok, but what does that *mean*? * Literally: counties with a higher probability of being arrested also have a higher crime rate. * So, does it mean that as there is more crime in certain areas, the police become more efficient at arresting criminals, and so the probability of getting arrested on any committed crime goes up? * What does police efficiency depend on? * Does the poverty level in a county matter for this? * The local laws? * 🤯 wow, there seem to be too many things left out of this simple picture. --- # Crime in a DAG <img src="06-panel_files/figure-html/cri-dag-1.svg" style="display: block; margin: auto;" /> <style type="text/css"> .reduced_opacity { opacity: 0.2; } </style> --- background-image: url(../img/crime-dag.png) background-size: 400px background-position: 50% 100% # Crime in a DAG .left-wide[ **Fixed Characteristics**: vary by county * `LocalStuff` are things that describe the County, like geography, and other persistent features. * `LawAndOrder`: commitment to *law and order politics* of local politicians * `CivilRights`: how many civil rights you have **Time-varying Characteristics**: vary by county and by year * `Police` budget: an elected politician has some discretion over police spending * `Poverty` varies with economy ] .right-thin[ ] --- # Within and Between Variation You will often hear the terms *within* and *between* variation in panel data contexts. .pull-left[ ## Within Variation * things that change *within each group* over time: * here we said police budgets * and poverty levels would change within each group and over time. ] .pull-right[ ## Between Variation * Things that are **fixed** for each group over time: * `LocalStuff` * `LawAndOrder` and * `CivilRights` * differ only across or **between** groups ] --- background-image: url("https://media.giphy.com/media/3oKIPlLZEbEbacWqOc/giphy.gif") background-position: 95% 50% background-size: 300px # Within and Between Variation: Give us a Plot. .left-wide[ <img src="06-panel_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" /> ] --- # Pooled OLS recovers *between* variation * Let's add the mean of `prbarr` and `crmrte` for each of those counties to the scatter plot! * And then a regression through those 4 points! -- <img src="06-panel_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> --- # Accounting for Grouped Data: Introducing the *Fixed Effect* .left-wide[ <img src="06-panel_files/figure-html/cri-dag2-1.svg" style="display: block; margin: auto;" /> ] .right-thin[ * Collect all group-specific time-invariant features in the factor `County`. * Takes care of all factors which do *not* vary over time within each unit. * We can **net out** the group effect! * We call `County` a **fixed effect**. ] --- class: separator, middle # Fixed Effects Estimation in `R` --- # OVB, IV and Panel Data We've seen *unobserved variable bias* (OVB). For example, if the true model read: $$ y_i = \beta_0 + \beta_1 x_i + c_i + u_i $$ if `\(c_i\)` unobservable and `\(Cov(x_i,c_i)\neq0 \Rightarrow E[u_i+c_i|x_i]\neq 0,\text{ with }u_i + c_i\)` total unobserved component. -- .pull-left[ ## Cross-Sectional Solution * where `\(c=A_i\)` and `\(x=s\)` was schooling. * *ability bias*. * Find IV correlated with schooling but not ability ] .pull-right[ ## Panel Data `$$y_{it} = \beta_1 x_{it} + c_i + u_{it},\quad t=1,2,...T$$` * `\(c_i\)`: *individual fixed effect*, *unobserved effect* or *unobserved heterogeneity*. * `\(c_i\)`: is fixed over time (ability `\(A_i\)` for example), but can be correlated with `\(x_{it}\)`! ] --- # Dummy Variable Regression .pull-left[ <br> <br> * Simplest approach: include a dummy variable for each group `\(i\)`. * This is literally *controlling for county* i * Each `\(i\)` has basically their own intercept `\(c_i\)` * In `R` you achieve this like so: ] .pull-right[ <br> <br> `$$y_{it} = \beta_1 x_{it} + c_i + u_{it},\quad t=1,2,...T$$` ``` r mod = list() mod$dummy <- lm(crmrte ~ prbarr + factor(county), css) # i is the unit ID broom::tidy(mod$dummy) ``` ``` ## # A tibble: 5 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 0.0449 0.00456 9.87 9.85e-10 ## 2 prbarr -0.0284 0.0136 -2.08 4.86e- 2 ## 3 factor(county)3 -0.0250 0.00254 -9.82 1.07e- 9 ## 4 factor(county)23 -0.00850 0.00166 -5.13 3.41e- 5 ## 5 factor(county)145 -0.00650 0.00160 -4.07 4.70e- 4 ``` ] --- # Dummy Variable Regression .left-wide[ <img src="06-panel_files/figure-html/dummy-1.svg" style="display: block; margin: auto;" /> ] .right-thin[ * *Within* each county, now is a **negative** relationship!! * Different intercepts (county 1 is the reference group), * Unique slope coefficient `\(\beta\)`. (you observe that the lines are parallel). * We are shifting lines down from the reference group 1. ] --- # First Differencing Solution If we only had `\(T=2\)` periods, we could just difference both periods, basically leaving us with `$$\begin{align}y_{i1} &= \beta_1 x_{i1} + c_i + u_{i1} \\y_{i2} &= \beta_1 x_{i2} + c_i + u_{i2} \\& \Rightarrow \\ y_{i1}-y_{i2} &= \beta_1 (x_{i1} - x_{i2}) + c_i-c_i + u_{i1}-u_{i2} \\\Delta y_{i} &= \beta_1 \Delta x_{i} + \Delta u_{i}\end{align}$$` where `\(\Delta\)` means *difference over time of* and to recover the parameter of interest `\(\beta_1\)` we would run ``` r lm(deltay ~ deltax, diff_data) ``` --- background-image: url(../img/crime-dag2.png) background-size: 300px background-position: 20% 99% # The Within Transformation .pull-left[ * With `\(T>2\)` we need a different approach * One important concept is called the *within* transformation * So, *controlling for group identity and only looking at time variation* * Remember DAG! ] .pull-right[ * Let `\(\bar{x}_i\)` the average *over time* of `\(i\)`'s `\(x\)` values: `$$\bar{x}_i = \frac{1}{T} \sum_{t=1}^T x_{it}$$` 1. for all variables compute their time-mean for each unit `\(i\)`: `\(\bar{x}_i,\bar{y}_i\)` etc 1. for each observation, substract that time mean from the actual value and define `\((x_{it} - \bar{x}_i),(y_{it}-\bar{y}_i)\)` 1. Finally, regress `\((x_{it} - \bar{x}_i)\)` on `\((y_{it}-\bar{y}_i)\)` ] --- # The Within Transformation in `R`: Manual Solution This *works* for our problem with fixed effect `\(c_i\)` because `\(c_i\)` is not time varying by assumption! hence it drops out: `$$y_{it}-\bar{y}_i = \beta_1 (x_{it} - \bar{x}_i) + c_i - c_i + u_{it}-\bar{u}_i$$` It's easy to do yourself! First let's compute the demeaned values: ``` r cdata <- css %>% group_by(county) %>% mutate(mean_crime = mean(crmrte), mean_prob = mean(prbarr)) %>% mutate(demeaned_crime = crmrte - mean_crime, demeaned_prob = prbarr - mean_prob) ``` Then, run both models with simple OLS: ``` r mod$xsect <- lm(crmrte ~ prbarr, data = cdata) mod$demeaned <- lm(demeaned_crime ~ demeaned_prob, data = cdata) ``` --- # The Within Transformation in `R`: Manual Solution .left-wide[
xsect
dummy
demeaned
(Intercept)
0.009
0.045
0.000
(0.005)
(0.005)
(0.001)
prbarr
0.065
-0.028
(0.016)
(0.014)
demeaned_prob
-0.028
(0.013)
R2
0.390
0.893
0.159
] .left-thin[ * Estimate for `prbarr` is positive in the cross-section * Taking care of the unobservered heterogeneity `\(c_i\)`... * ...either by including an intercept for each `\(i\)` or by time-demeaning the data * we obtain: -0.028 ] --- # Interpreting the Within Estimates * How to interpret those negative slopes? * We look at a single unit `\(i\)` and ask: > if the arrest probability in `\(i\)` increases by 10 percentage points (i.e. from 0.2 to 0.3) from year `\(t\)` to `\(t+1\)`, we expect crimes per person to fall from 0.039 to 0.036, or by -7.69 percent (in the reference county number 1). <img src="06-panel_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" /> --- # Within Transformation Animated .left-wide[ <img src="../img/gifs/panel.gif" style="display: block; margin: auto;" /> ] --- # Within Transformation Animated .right-thin[ * The within transformation **centers** the data! * By time-demeaning `\(y\)` and `\(x\)`, we *project out* the fixed factors related to *county* * Only *within* county variation is left. ] --- # Fixed Effects Estimation in `R`: use a Package! * In real life you will hardly ever perform the within-transformation by yourself and will use a package instead! There are several options, but [`fixest` is fastest](https://github.com/lrberge/fixest). * Also, we can have *more than one fixed effect*! For a cool example with *three* fixed effects see the package [vignette](https://cran.r-project.org/web/packages/fixest/vignettes/fixest_walkthrough.html) -- * Let's go over an example with real data: ``` r # load package and bring in data library(fixest) countyopioids <- read.csv("~/Library/CloudStorage/Dropbox/Clemson/Econometrics Course/data for tasks/countyleveldataonopioids.csv") # what we did before (pooled OLS) lpm <- lm(overdosedeaths ~ percapitapills, data = countyopioids) lpm <- feols(overdosedeaths ~ percapitapills, data = countyopioids) # this is the same as above, but cleaner for output # the new stuff, various fixed effects fe_year <- feols(overdosedeaths ~ percapitapills | year, data = countyopioids, cluster = ~county) fe_yearstate <- feols(overdosedeaths ~ percapitapills | state + year, data = countyopioids, cluster = ~county) fe_yearcounty <- feols(overdosedeaths ~ percapitapills | county + year, data = countyopioids) ## the below is the best way to generate tables with multiple fixest objects, output is on the next slide # etable(lpm, fe_year, fe_yearstate, fe_yearcounty) ``` ---
lpm
fe_year
fe_yearstate
fe_yearcounty
Dependent Var.:
overdosedeaths
overdosedeaths
overdosedeaths
overdosedeaths
Constant
24.39*** (0.8332)
percapitapills
0.0505** (0.0180)
0.0182 (0.0357)
-0.0117 (0.0387)
-0.1184*** (0.0340)
Fixed-Effects:
-----------------
---------------
----------------
-------------------
year
No
Yes
Yes
Yes
state
No
No
Yes
No
county
No
No
No
Yes
_______________
_________________
_______________
________________
___________________
S.E. type
IID
by: county
by: county
by: county
Observations
26,970
26,970
26,970
26,970
R2
0.00029
0.00477
0.12528
0.92966
Within R2
--
3.61e-5
1.3e-5
0.00139
--- # What is happening?! 1. Leftmost column (OLS, no FE) → Captures both true effects + confounding factors 2. Adding Year FE → Controls for national trends; effect weakens 3. Adding State FE → Controls for state-level policies/ time-invariant characteristics; effect further weakens 4. Adding County FE → Controls for more local, time-invariant characteristics; effect reverses, suggesting omitted variable bias in earlier models Potential explanation for findings: after controlling for state policies and local conditions, we might realize that counties with more overdose deaths actually received fewer pills over time (policy response) --- # Another Way to Think of Fixed Effects 1. Their Role - Fixed effects work by introducing dummy variables for each firm and each year - The model estimates separate intercepts for each firm and each year but does not report them explicitly - Since every observation belongs to exactly one firm and one year, these fixed effects fully absorb the intercept 2. Why No Intercept is Reported? - In a typical OLS regression, the intercept represents the expected value of the dependent variable when all explanatory variables are zero - With fixed effects, each firm has its own intercept, meaning the model does not need a single, global intercept - The mean differences across firms and years are captured by the fixed effects instead of a single intercept --- # Fixed Effects: Key Idea (Conceptual Review) * Goal: estimate causal effect when there are **unobserved, time-invariant differences** across units * Model: `\(y_{it} = \beta x_{it} + c_i + u_{it}\)` * Problem: - `\(c_i\)` (e.g. county characteristics) is **unobserved** - and may be **correlated with `\(x_{it}\)`** → OVB -- ## Solution: Fixed Effects * Compare each unit **to itself over time** * Remove `\(c_i\)` by focusing on **within-unit variation** * Interpretation: > How do changes in `\(x\)` within the same unit affect `\(y\)`? --- # What Variation Are We Using? Between vs Within * **Between variation**: - Differences *across* units (e.g. high-crime vs low-crime counties) * **Within variation**: - Changes *within* a unit over time -- ## Key Insight * Pooled OLS mixes both → biased if `\(c_i\)` matters * Fixed effects uses **only within variation** * So we are estimating: > If a county’s arrest probability increases, what happens to crime in that same county? --- # Interpreting a Fixed Effects Coefficient If the estimated coefficient is `\(\hat{\beta} = -0.25\)`, then we say a one-unit increase in `\(x\)`, within the same unit over time, is associated with a 0.25-unit decrease in `\(y\)` Important: - This is a within-unit interpretation - We are comparing a unit to itself over time - All time-invariant differences across units are already accounted for So we are not asking: Do units with higher `\(x\)` have higher or lower `\(y\)`? Instead, we are asking: When this same unit’s `\(x\)` changes, how does `\(y\)` change?