class: center, middle, inverse, title-slide # Statistics Review II ## EC 320: Introduction to Econometrics ### Winter 2022 --- class: inverse, middle # Prologue --- # Housekeeping .small[ Lectures & my office hours held remote this week; Lab in-person unless announced otherwise **Problem Set 1** available on Canvas. - Submission type: .html or .pdf - Due this Friday **Exercise 2** due this Wednesday Extra office hours? - Wednesday 19:00-20:00? Knit issues / R issues? - Meet me after class - Office hours ] --- class: inverse, middle # Statistics Review --- # Overview __Goal:__ Learn about a population. - In particular, learn about an unknown population _parameter_. __Challenge:__ Usually cannot access information about the entire population. __Solution:__ Sample from the population and estimate the parameter. - Draw `\(n\)` observations from the population, then use an estimator. --- # Sampling There are myriad ways to produce a sample,<sup>*</sup> but we will restrict our attention to __simple random sampling__, where 1. Each observation is a random variable. 2. The `\(n\)` random variables are independent. 3. Life becomes much simpler for the econometrician. .footnote[ <sup>*</sup> Only a subset of these can help produce reliable statistics. ] --- # Estimators An __estimator__ is a rule (or formula) for estimating an unknown population parameter given a sample of data. -- - Each observation in the sample is a random variable. -- - An estimator is a combination of random variables `\(\implies\)` it is a random variable. __Example:__ Sample mean $$ \bar{X} = \dfrac{1}{n} \sum_{i=1}^n X_i $$ - `\(\bar{X}\)` is an estimator for the population mean `\(\mu\)`. - Given a sample, `\(\bar{X}\)` yields an __estimate__ `\(\bar{x}\)` or `\(\hat{\mu}\)`, a specific number. --- class: clear-slide, middle A physicist, a chemist, and an econometrician go to an archery range... <img src="03-Statistics_Review_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> --- class: clear-slide, middle count: false A physicist, a chemist, and an econometrician go to an archery range... <img src="03-Statistics_Review_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> --- class: clear-slide, middle count: false A physicist, a chemist, and an econometrician go to an archery range... <img src="03-Statistics_Review_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> --- class: clear-slide, middle .pull-left[ .center[**Archer 1**] <img src="03-Statistics_Review_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> .center[**Archer 3**] <img src="03-Statistics_Review_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ .center[**Archer 2**] <img src="03-Statistics_Review_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" /> .center[**Archer 4**] <img src="03-Statistics_Review_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" /> ] --- # Population *vs.* Sample **Question:** Why do we care about *population vs. sample*? .pull-left[ <img src="03-Statistics_Review_files/figure-html/pop1-1.svg" style="display: block; margin: auto;" /> .center[**Population**] ] -- .pull-right[ <img src="03-Statistics_Review_files/figure-html/mean1-1.svg" style="display: block; margin: auto;" /> .center[**Population relationship**] <br> `\(\mu = 3.75\)` ] --- # Population *vs.* Sample **Question:** Why do we care about *population vs. sample*? .pull-left[ <img src="03-Statistics_Review_files/figure-html/sample1-1.svg" style="display: block; margin: auto;" /> .center[**Sample 1:** 10 random individuals] ] -- .pull-right[ <img src="03-Statistics_Review_files/figure-html/sample1 mean-1.svg" style="display: block; margin: auto;" /> .center[ **Population relationship** <br> `\(\mu = 3.75\)` **Sample relationship** <br> `\(\hat{\mu} = 8.34\)` ] ] --- # Population *vs.* Sample **Question:** Why do we care about *population vs. sample*? .pull-left[ <img src="03-Statistics_Review_files/figure-html/sample2-1.svg" style="display: block; margin: auto;" /> .center[**Sample 2:** 10 random individuals] ] -- .pull-right[ <img src="03-Statistics_Review_files/figure-html/sample2 mean-1.svg" style="display: block; margin: auto;" /> .center[ **Population relationship** <br> `\(\mu = 3.75\)` **Sample relationship** <br> `\(\hat{\mu} = -8.54\)` ] ] --- # Population *vs.* Sample **Question:** Why do we care about *population vs. sample*? .pull-left[ <img src="03-Statistics_Review_files/figure-html/sample3-1.svg" style="display: block; margin: auto;" /> .center[**Sample 3:** 10 random individuals] ] -- .pull-right[ <img src="03-Statistics_Review_files/figure-html/sample3 mean-1.svg" style="display: block; margin: auto;" /> .center[ **Population relationship** <br> `\(\mu = 3.75\)` **Sample relationship** <br> `\(\hat{\mu} = 4.62\)` ] ] --- class: clear-slide, middle Let's repeat this **10,000 times** and then plot the estimates. (This exercise is called a Monte Carlo simulation.) --- class: clear-slide, middle <img src="03-Statistics_Review_files/figure-html/simulation-1.svg" style="display: block; margin: auto;" /> --- # Population *vs.* Sample **Question:** Why do we care about *population vs. sample*? .pull-left[ <img src="03-Statistics_Review_files/figure-html/simulation2-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ - On average, the mean of the samples are close to the population mean. - But...some individual samples can miss the mark. - The difference between individual samples and the population creates __uncertainty__. ] --- # Population *vs.* Sample **Question:** Why do we care about *population vs. sample*? **Answer:** Uncertainty matters. - `\(\hat{\mu}\)` is a random variable that depends on the sample. - In practice, we don't know whether our sample is similar to the population or not. - Individual samples may have means that differ greatly from the population. - We will have to keep track of this uncertainty. --- # Properties of Estimators Imagine that we want to estimate an unknown parameter `\(\mu\)`, and we know the distributions of three competing estimators. __Which one should we use?__ <img src="03-Statistics_Review_files/figure-html/competing pdfs-1.svg" style="display: block; margin: auto;" /> --- # Properties of Estimators **Question:** What properties make an estimator reliable? -- **Answer 1: Unbiasedness.** On average (after *many* samples), does the estimator tend toward the correct value? **More formally:** Does the mean of estimator's distribution equal the parameter it estimates? $$ \mathop{\text{Bias}_\mu} \left( \hat{\mu} \right) = \mathop{\mathbb{E}}\left[ \hat{\mu} \right] - \mu $$ --- # Properties of Estimators **Question:** What properties make an estimator reliable? **Answer 1: Unbiasedness.** .pull-left[ **Unbiased estimator:** `\(\mathop{\mathbb{E}}\left[ \hat{\mu} \right] = \mu\)` <img src="03-Statistics_Review_files/figure-html/unbiased pdf-1.svg" style="display: block; margin: auto;" /> ] -- .pull-right[ **Biased estimator:** `\(\mathop{\mathbb{E}}\left[ \hat{\mu} \right] \neq \mu\)` <img src="03-Statistics_Review_files/figure-html/biased pdf-1.svg" style="display: block; margin: auto;" /> ] --- # Properties of Estimators **Question:** What properties make an estimator reliable? **Answer 2: Low Variance (a.k.a. Efficiency).** The central tendencies (means) of competing distributions are not the only things that matter. We also care about the **variance** of an estimator. $$ \mathop{\text{Var}} \left( \hat{\mu} \right) = \mathop{\mathbb{E}}\left[ \left( \hat{\mu} - \mathop{\mathbb{E}}\left[ \hat{\mu} \right] \right)^2 \right] $$ Lower variance estimators produce estimates closer to the mean in each sample. --- # Properties of Estimators **Question:** What properties make an estimator reliable? **Answer 2: Low Variance (a.k.a. Efficiency).** <img src="03-Statistics_Review_files/figure-html/variance pdf-1.svg" style="display: block; margin: auto;" /> --- # The Bias-Variance Tradeoff Should we be willing to take a bit of bias to reduce the variance? In econometrics, we generally prefer unbiased estimators. Some other disciplines think more about this tradeoff. <img src="03-Statistics_Review_files/figure-html/variance bias-1.svg" style="display: block; margin: auto;" /> --- # Unbiased Estimators In addition to the sample mean, there are several other unbiased estimators we will use often. - __Sample variance__ to estimate variance `\(\sigma^2\)`. - __Sample covariance__ to estimate covariance `\(\sigma_{XY}\)`. - __Sample correlation__ to estimate the population correlation coefficient `\(\rho_{XY}\)`. --- # Unbiased Estimators The sample variance `\(S_X^2\)` is an unbiased estimator of the population variance `\(\sigma^2\)`: `$$S_{X}^2 = \dfrac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2.$$` --- # Unbiased Estimators The sample covariance `\(S_{XY}\)` is an unbiased estimator of the population covariance `\(\sigma_{XY}\)`: `$$S_{XY} = \dfrac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}).$$` --- # Unbiased Estimators The sample correlation `\(r_{XY}\)` is an unbiased estimator of the population correlation coefficient `\(\rho_{XY}\)`: `$$r_{XY} = \dfrac{S_{XY}}{\sqrt{S_X^2} \sqrt{S_Y^2}}.$$` --- # Hypothesis Testing Given What do we make of an estimate of the population mean? - Is it meaningfully different than existing evidence on the population mean? - Is is _statistically distinguishable_ from previously hypothesized values of the population mean? - Is the estimate extreme enough to update our prior beliefs about the population mean? We can conduct statistical tests to address these questions. --- # Hypothesis Testing __Null hypothesis (H.sub[0]):__ `\(\mu = \mu_0\)` __Alternative hypothesis (H.sub[1]):__ `\(\mu \neq \mu_0\)` -- There are four possible outcomes of our test: 1. We __fail to reject__ the null hypothesis and the null is true. 2. We __reject__ the null hypothesis and the null is false. 3. We __reject__ the null hypothesis, but the null is actually true (**Type I error**). 4. We __fail to reject__ the null hypothesis, but the null is actually false (**Type II error**). --- # Hypothesis Testing We __fail to reject__ the null hypothesis and the null is true. - The defendant was acquitted and he didn't do the crime. -- We __reject__ the null hypothesis and the null is false. - The defendant was convicted and he did the crime. --- # Hypothesis Testing We __reject__ the null hypothesis, but the null is actually true. - The defendant was convicted, but he didn't do the crime! - **Type I error** (a.k.a. _false positive_) -- We __fail to reject__ the null hypothesis, but the null is actually false. - The defendant was acquitted, but he did the crime! - **Type II error** (a.k.a. _false negative_) --- # Hypothesis Testing `\(\hat{\mu}\)` is random: it could be anything, even if `\(\mu = \mu_0\)` is true. - But if `\(\mu = 0\)` is true, then `\(\hat{\mu}\)` is unlikely to take values far from zero. - As the variance of `\(\hat{\mu}\)` shrinks, we are even less likely to observe "extreme" values of `\(\hat{\mu}\)` (assuming `\(\mu = \mu_0\)`). -- Our test should take extreme values of `\(\hat{\mu}\)` as evidence against the null hypothesis, but it should also weight them by what we know about the variance of `\(\hat{\mu}\)`. - For now, we'll assume that the variable of interest `\(X\)` is normally distributed with mean `\(\mu\)` and standard deviation `\(\sigma^2\)`. --- # Hypothesis Testing Reject H.sub[0] if `\(\hat{\mu}\)` lies in the .hi[rejection region]. <img src="03-Statistics_Review_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> - The area of the rejection region is defined by the **significance level** of the test. - In a 5% test, the area is 0.05. - Significance level .mono[=] tolerance for Type I error. --- # Hypothesis Testing Reject H.sub[0] if `\(\left| z \right| =\left| \dfrac{\hat{\mu} - \mu_0}{\mathop{\text{sd}}(\hat{\mu})} \right| > 1.96\)`. <img src="03-Statistics_Review_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" /> What happens to `\(z\)` as `\(\left| \hat{\mu} - \mu_0 \right|\)` increases? What happens to `\(z\)` as `\(\mathop{\text{sd}}(\hat{\mu})\)` increases? --- # Hypothesis Testing The formula for the `\(z\)` statistic assumes that we know `\(\mathop{\text{sd}}(\hat{\mu})\)`. - In practice, we don't know `\(\mathop{\text{sd}}(\hat{\mu})\)`, so we have to estimate it. -- If the variance of `\(X\)` is `\(\sigma^2\)`, then `$$\sigma^2_{\hat{\mu}} = \dfrac{\sigma^2}{n}.$$` - We can estimate `\(\sigma^2\)` with the sample variance `\(S_{X}^2\)`. -- The sample variance of the sample mean is `$$S_{\hat{\mu}}^2 = \dfrac{1}{n(n-1)} \sum_{i=1}^n (X_i - \bar{X})^2.$$` --- # Hypothesis Testing The .hi[standard error] of `\(\hat{\mu}\)` is the square root of `\(S_{\hat{\mu}}^2\)`: `$$\mathop{\text{SE}}(\hat{\mu}) = \sqrt{ \dfrac{1}{n(n-1)} \sum_{i=1}^n (X_i - \bar{X})^2}.$$` - Standard error = sample standard deviation of an estimator. -- When we use `\(\mathop{\text{SE}}(\hat{\mu})\)` in place of `\(\mathop{\text{sd}}(\hat{\mu})\)`, the `\(z\)` statistic becomes a `\(t\)` statistic: `$$t = \dfrac{\hat{\mu} - \mu_0}{\mathop{\text{SE}}(\hat{\mu})}.$$` - Unlike the standard deviation of `\(\hat{\mu}\)`, `\(\mathop{\text{SE}}(\hat{\mu})\)` varies from sample to sample. - **Consequence:** `\(t\)` statistics do not necessarily have a normal distribution. --- # Hypothesis Testing ## .hi-green[Normal distribution] vs. .hi-purple[t distribution] - A normal distribution has the same shape for any sample size. - The shape of the t distribution depends the **degrees of freedom**. <img src="03-Statistics_Review_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" /> - Degrees of freedom .mono[=] 5. --- count: false # Hypothesis Testing ## .hi-green[Normal distribution] vs. .hi-purple[t distribution] - A normal distribution has the same shape for any sample size. - The shape of the t distribution depends the **degrees of freedom**. <img src="03-Statistics_Review_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" /> - Degrees of freedom .mono[=] 50. --- count: false # Hypothesis Testing ## .hi-green[Normal distribution] vs. .hi-purple[t distribution] - A normal distribution has the same shape for any sample size. - The shape of the t distribution depends the **degrees of freedom**. <img src="03-Statistics_Review_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" /> - Degrees of freedom .mono[=] 500. --- # Hypothesis Testing ## **t Tests** (two-sided) To conduct a t test, compare the `\(t\)` statistic to the appropriate .hi[critical value] of the t distribution. - To find the critical value in a t table, we need the degrees of freedom and the significance level `\(\alpha\)`. Reject H.sub[0] at the `\(\alpha \cdot 100\)`-percent level if `$$\left| t \right| = \left| \dfrac{\hat{\mu} - \mu_0}{\mathop{\text{SE}}(\hat{\mu})} \right| > t_\text{crit}.$$` --- # Hypothesis Testing ## On Your Own As the term progresses, we will encounter additional flavors of hypothesis testing and other related concepts. You may find it helpful to review the following topics from Math 243: - Confidence intervals - One-sided `\(t\)` tests - `\(p\)` values --- class: inverse, middle # Data and the .mono[tidyverse] --- # Data ## Experimental data Data generated in controlled, laboratory settings. -- Ideal for __causal identification__, but difficult to obtain in the social sciences. - Intractable logistical problems - Too expensive - Morally repugnant -- Experiments outside the lab: __randomized control trials__ and __A/B testing__. --- # Data ## Observational data Data generated in non-experimental settings. -- - Surveys - Censuses - Administrative records - Environmental data - Financial and sales transactions - Social media -- Mainstay of economic research, but __poses challenges__ to causal identification. --- # Tidy Data .more-left[
] .less-right[ .hi-orange[Rows] represent .hi-orange[observations]. .hi-green[Columns] represent .hi-green[variables]. Each .hi-purple[value] is associated with an .hi-orange[observation] and a .hi-green[variable]. ] --- # Cross Sectional Data .hi-purple[Sample of individuals from a population at a point in time.] Ideally, collected using __random sampling__. - Random sampling .mono[+] sufficient sample size .mono[=] representative sample. - Random sampling simplifies data analysis, but non-random samples are common (and difficult to work with). Used extensively in applied microeconomics.<sup>*</sup> __Main focus of this course.__ .footnote[ <sup>*</sup> Applied microeconomics .mono[=] Labor, health, education, public finance, development, industrial organization, and urban economics. ] --- # Cross Sectional Data
--- # Time Series Data .hi-purple[Observations of variables over time.] - Quarterly US GDP - Annual US infant mortality rates - Daily Amazon stock prices Complication: Observations are not independent draws. - GDP this quarter highly related to GDP last quarter. Used extensively in empirical macroeconomics. Requires more-advanced methods (EC 421 and EC 422). --- # Time Series Data
--- # Pooled Cross Sectional Data .hi-purple[Cross sections from different points in time.] Useful for studying policy changes and relationship that change over time. Requires more-advanced methods (EC 421 and many 400-level applied micro classes). --- # Pooled Cross Sectional Data
--- # Panel or Longitudinal Data .hi-purple[Time series for each cross-sectional unit.] - Example: daily attendance data for a sample of students. Difficult to collect, but useful for causal identification. - Can control for _unobserved_ characteristics. Requires more-advanced methods (EC 421 and many 400-level applied micro classes). --- # Panel or Longitudinal Data
--- # Tidy Data?
--- # Messy Data **Analysis-ready datasets are rare.** Most data are "messy." The focus of this class is data analysis, but .hi[data wrangling] is a non-trivial part of a data scientist/analyst's job. .mono[R] has a suite of packages that facilitate data wrangling. - `readr`, `tidyr`, `dplyr`, `ggplot2` .mono[+] others. - Known collectively as the `tidyverse`. --- # .mono[tidyverse] ## The [`tidyverse`](https://www.tidyverse.org): A package of packages `readr`: Functions to import data. `tidyr`: Functions to reshape messy data. `dplyr`: Functions to work with data. `ggplot2`: Functions to visualize data.