Statistics Review II

# Statistics Review II
## EC 320: Introduction to Econometrics
### Philip Economides
### Winter 2022

---

# Statistics Review

---
# Overview

A few key terms:

- .hi-pink[Population:] a (usually large) group of items or events we would like to know about.

- .hi-pink[Parameter:] a value that describes that population. The parameter of interest is the parameter that the researcher seeks to learn about.

- .hi-pink[Sample:] a survey of a subset of the population.

Usually we aim to draw observations .hi-blue[randomly] from the population, such that it becomes a .hi-blue[representative sample] of the population. More details on this later!

---
# Overview

__Focus:__ Populations vs Samples

- How can we make inferences about a .hi-pink[population] based on a small .hi-blue[sample] of the population?

- In particular, how do we learn about an unknown population _parameter_ of interest?

__Challenge:__ Usually cannot access information about the entire population.

__Solution:__ Sample from the population and estimate the parameter.

- Draw `$n$` observations from the population, then use an estimator.

---
# Sampling

There are myriad ways to produce a sample,<sup>*</sup> but we will restrict our attention to __simple random sampling__, where

1. Each observation is a random variable.

2. The `$n$` random variables are independent.

3. Life becomes much simpler for the econometrician.

---
# Estimators

An __estimator__ is a rule (or formula) for estimating an unknown population parameter given a sample of data.

- Each observation in the sample is a random variable.

- An estimator is a combination of random variables `$\implies$` it is a random variable.

__Example:__ Sample mean

$$
\bar{X} = \dfrac{1}{n} \sum_{i=1}^n X_i
$$

- `$\bar{X}$` is an estimator for the population mean `$\mu$`.

- Given a sample, `$\bar{X}$` yields an __estimate__ `$\bar{x}$` or `$\hat{\mu}$`, a specific number.

---
# Population *vs.* Sample

**Question:** Why do we care about *population vs. sample*?

]

]

---
# Population *vs.* Sample

**Question:** Why do we care about *population vs. sample*?

]

**Population relationship**
<br>
`$\mu = 3.75$`

**Sample relationship**
<br>
`$\hat{\mu} = 8.34$`

]

---
# Population *vs.* Sample

**Question:** Why do we care about *population vs. sample*?

]

**Population relationship**
<br>
`$\mu = 3.75$`

**Sample relationship**
<br>
`$\hat{\mu} = -8.54$`

]

---
# Population *vs.* Sample

**Question:** Why do we care about *population vs. sample*?

]

**Population relationship**
<br>
`$\mu = 3.75$`

**Sample relationship**
<br>
`$\hat{\mu} = 4.62$`

]

---
class: clear-slide, middle

Let's repeat this **10,000 times** and then plot the estimates.

(This exercise is called a Monte Carlo simulation.)

---
class: clear-slide, middle

---
# Population *vs.* Sample

**Question:** Why do we care about *population vs. sample*?

.pull-left[
<img src="03-Review_files/figure-html/simulation2-1.svg" style="display: block; margin: auto;" />
]

- On average, the mean of the samples are close to the population mean.

- But...some individual samples can miss the mark.

- The difference between individual samples and the population creates __uncertainty__.

]

---
# Population *vs.* Sample

**Question:** Why do we care about *population vs. sample*?

**Answer:** Uncertainty matters.

- `$\hat{\mu}$` is a random variable that depends on the sample.

- In practice, we don't know whether our sample is similar to the population or not.

- Individual samples may have means that differ greatly from the population.

- We will have to keep track of this uncertainty.

---
# Properties of Estimators

Imagine that we want to estimate an unknown parameter `$\mu$`, and we know the distributions of three competing estimators. __Which one should we use?__

---
# Properties of Estimators

**Question:** What properties make an estimator reliable?

**Answer 1: Unbiasedness.**

On average (after *many* samples), does the estimator tend toward the correct value?

**More formally:** Does the mean of estimator's distribution equal the parameter it estimates?

$$ \mathop{\text{Bias}_\mu} \left( \hat{\mu} \right) = \mathop{\mathbb{E}}\left[ \hat{\mu} \right] - \mu $$

---
# Properties of Estimators

**Question:** What properties make an estimator reliable?

**Answer 1: Unbiasedness.**

**Unbiased estimator:** `$\mathop{\mathbb{E}}\left[ \hat{\mu} \right] = \mu$`

]

**Biased estimator:** `$\mathop{\mathbb{E}}\left[ \hat{\mu} \right] \neq \mu$`

]

---

# Unbiasedness Example

Is the sample mean `$\frac{1}{n} \sum_{i=1}^n x_i = \hat{\mu}$` an unbiased estimator of the population mean `$E(x_i) = \mu$`?

$$
`\begin{aligned}
\mathop{\mathbb{E}}\left[ \hat{\mu} \right] &= \mathop{\mathbb{E}}\left[ \frac{1}{n} \sum_{i=1}^n x_i \right] \\
&=\frac{1}{n} \sum_{i=1}^n\mathop{\mathbb{E}}\left[ x_i \right]\\
&=\frac{1}{n} \sum_{i=1}^n \mu\\
&= \mu
\end{aligned}`
$$

---

# Properties of Estimators

**Question:** What properties make an estimator reliable?

**Answer 2: Efficiency (low variance).**

The central tendencies (means) of competing distributions are not the only things that matter. We also care about the **variance** of an estimator.

$$ \mathop{\text{Var}} \left( \hat{\mu} \right) = \mathop{\mathbb{E}}\left[ \left( \hat{\mu} - \mathop{\mathbb{E}}\left[ \hat{\mu} \right] \right)^2 \right] $$

Lower variance estimators produce estimates closer to the mean in each sample.

---
# Properties of Estimators

**Question:** What properties make an estimator reliable?

**Answer 2: Low Variance (a.k.a. Efficiency).**

---
# The Bias-Variance Tradeoff

Should we be willing to take a bit of bias to reduce the variance?

In econometrics, we generally prefer unbiased estimators. Some other disciplines think more about this tradeoff.

---
# Unbiased Estimators

<br>

In addition to the sample mean, there are several other unbiased estimators we will use often.

- __Sample variance__ to estimate variance `$\sigma^2$`.

- __Sample covariance__ to estimate covariance `$\sigma_{XY}$`.

- __Sample correlation__ to estimate the population correlation coefficient `$\rho_{XY}$`.

---
# Unbiased Estimators

Sample variance `$S_X^2$` is an unbiased estimator of the population variance `$\sigma^2$`

`$$S_{X}^2 = \dfrac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2.$$`

Sample covariance `$S_{XY}$` is an unbiased estimator of the population covariance `$\sigma_{XY}$`

`$$S_{XY} = \dfrac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}).$$`

Sample correlation `$r_{XY}$` is an unbiased estimator of the population correlation coefficient `$\rho_{XY}$`

`$$r_{XY} = \dfrac{S_{XY}}{\sqrt{S_X^2} \sqrt{S_Y^2}}.$$`

---

# Population Distributions

Suppose we have some estimator `$\hat{\theta}$` for a parameter `$\theta$`:

- We don’t know `$\theta$`, but say we assume that `$\hat{\theta}$` follows some probability distribution `$p(\hat{\theta})$`
- Next, suppose we hypothesize some value for `$\theta$`, say `$\theta = 2.5$`
- Now we use our estimator `$\hat{\theta}$` to calculate an estimate for `$\theta$`. Say that
we get `$\hat{\theta} = 45$`
- We "know" the distribution of `$\hat{\theta}$`, so we know the probability of getting `$\hat{\theta}= 45$` if really `$\theta= 2.5$`. So we can say ” if `$\theta$` really was 2.5,
then the probability of getting `$\hat{\theta} = 45$` is super super low. Thus the
probability that `$\theta= 2.5$` is super super low”.
- We are able to make a statement about the true value of `$\theta$` just by
knowing the distribution of our preferred estimator `$\hat{\theta}$`

Sweet, but what distribution should we be assuming?

---

# The Central Limit Theorem

.hi-blue[Theorem]<br>
*Let* `$x_1, x_2, \dots, x_n$` *be a random sample from a population with mean* `$\mathop{\mathbb{E}}\left[ X \right] = \mu$` *and variance* `$\text{Var}\left( X \right) = \sigma^2 < \infty$`*, let* `$\bar{X}$` *be the sample mean.* 
*Then, as* `$n\rightarrow \infty$`*, the function* `$\frac{\sqrt{n}\left(\bar{X}-\mu\right)}{S_x}$` *converges to a* .hi-pink[Normal Distribution] *with mean 0 and variance 1.*

- CLT states that when `$n \rightarrow \infty$`, the sample mean will be normally distributed.

- The Law of Large Number (LLN) states that as `$n \rightarrow \infty$`, the sample converges on the population mean.

- The only unknown parameter is `$\mu$`, and we can get around that by making probabilistic statements about it.

---

# Normal Distribution

- Domain spans the entire real line

- Centered on the distribution mean `$\mu$`

- The width of the distribution (fatness of its tails) is moderated `$\sigma^2$`
]

.pull-right[
<img src="03-Review_files/figure-html/normdist-1.svg" style="display: block; margin: auto;" />

The greater the variance, the wider the range of values that commonly appear, hence greater probability density mass.
]

---

# Normal Distribution

.pull-left[
__Rule 1:__ The probability that the random variable takes a value `$x_i$` is 0 for any `$x_i\in {\mathbb{R}}$`<br>
__Rule 2:__ The probability that the random variable takes a value within `$[x_i,x_j]$` range, where `$x_i \neq x_j$`, is the area under `$p(x)$` between those two values 
]

.pull-right[
<img src="03-Review_files/figure-html/normdist_II-1.svg" width="95%" style="display: block; margin: auto;" />
]

The area above represents `$p(x)=0.95$`. The values `$\{-1.96, 1.96\}$` represent the 95% confidence interval for `$\mu$`.

---

# Hypothesis Testing

<br>

How do we assess an estimate of the population mean?

- Is it meaningfully different than existing evidence on the population mean?
- Is is _statistically distinguishable_ from previously hypothesized values of the population mean?
- Is the estimate extreme enough to update our prior beliefs about the population mean?

We can conduct statistical tests to address these questions.

---

# Hypothesis Testing

__Null hypothesis (H.sub[0]):__ `$\mu = \mu_0$`

__Alternative hypothesis (H.sub[1]):__ `$\mu \neq \mu_0$`

There are four possible outcomes of our test:

1. We __fail to reject__ the null hypothesis and the null is true.

2. We __reject__ the null hypothesis and the null is false.

3. We __reject__ the null hypothesis, but the null is actually true (**Type I error**).

4. We __fail to reject__ the null hypothesis, but the null is actually false (**Type II error**).

---
# Hypothesis Testing

We __fail to reject__ the null hypothesis and the null is true.

- The defendant was acquitted and he didn't do the crime.

We __reject__ the null hypothesis and the null is false.

- The defendant was convicted and he did the crime.

We __reject__ the null hypothesis, but the null is actually true.

- The defendant was convicted, but he didn't do the crime!
- **Type I error** (a.k.a. _false positive_)

We __fail to reject__ the null hypothesis, but the null is actually false.

- The defendant was acquitted, but he did the crime!
- **Type II error** (a.k.a. _false negative_)

---
# Hypothesis Testing

`$\hat{\mu}$` is random: it could be anything, even if `$\mu = \mu_0$` is true.

- But if `$\mu = 0$` is true, then `$\hat{\mu}$` is unlikely to take values far from zero.

- As the variance of `$\hat{\mu}$` shrinks, we are even less likely to observe "extreme" values of `$\hat{\mu}$` (assuming `$\mu = \mu_0$`).

Our test should take extreme values of `$\hat{\mu}$` as evidence against the null hypothesis, but it should also weight them by what we know about the variance of `$\hat{\mu}$`.

- For now, we'll assume that the variable of interest `$X$` is normally distributed with mean `$\mu$` and standard deviation `$\sigma$`.

---
# Hypothesis Testing

Reject H.sub[0] if `$\hat{\mu}$` lies in the .hi[rejection region].

- The area of the rejection region is defined by the **significance level** of the test.
- In a 5% test, the area is 0.05. 
- Significance level .mono[=] tolerance for Type I error.

---
# Hypothesis Testing

Reject H.sub[0] if `$\left| z \right| =\left| \dfrac{\hat{\mu} - \mu_0}{\mathop{\text{sd}}(\hat{\mu})} \right| > 1.96$`.

What happens to `$z$` as `$\left| \hat{\mu} - \mu_0 \right|$` increases?

What happens to `$z$` as `$\mathop{\text{sd}}(\hat{\mu})$` increases?

---
# Hypothesis Testing

The formula for the `$z$` statistic assumes that we know `$\mathop{\text{sd}}(\hat{\mu})$`.

- In practice, we don't know `$\mathop{\text{sd}}(\hat{\mu})$`, so we have to estimate it.

If the variance of `$X$` is `$\sigma^2$`, then

`$$\sigma^2_{\hat{\mu}} = \dfrac{\sigma^2}{n}.$$`

- We can estimate `$\sigma^2$` with the sample variance `$S_{X}^2$`.

The sample variance of the sample mean is
 
`$$S_{\hat{\mu}}^2 = \dfrac{1}{n(n-1)} \sum_{i=1}^n (X_i - \bar{X})^2.$$`

---
# Hypothesis Testing

The .hi[standard error] of `$\hat{\mu}$` is the square root of `$S_{\hat{\mu}}^2$`:

`$$\mathop{\text{SE}}(\hat{\mu}) = \sqrt{ \dfrac{1}{n(n-1)} \sum_{i=1}^n (X_i - \bar{X})^2}.$$`

- Standard error = sample standard deviation of an estimator.

When we use `$\mathop{\text{SE}}(\hat{\mu})$` in place of `$\mathop{\text{sd}}(\hat{\mu})$`, the `$z$` statistic becomes a `$t$` statistic:

`$$t = \dfrac{\hat{\mu} - \mu_0}{\mathop{\text{SE}}(\hat{\mu})}.$$`

- Unlike the standard deviation of `$\hat{\mu}$`, `$\mathop{\text{SE}}(\hat{\mu})$` varies from sample to sample.
- **Consequence:** `$t$` statistics do not necessarily have a normal distribution.

---
# Hypothesis Testing

## .hi-green[Normal distribution] vs. .hi-purple[t distribution]

- A normal distribution has the same shape for any sample size.
- The shape of the t distribution depends the **degrees of freedom**.

- Degrees of freedom .mono[=] 5.

---
count: false

# Hypothesis Testing

## .hi-green[Normal distribution] vs. .hi-purple[t distribution]

- A normal distribution has the same shape for any sample size.
- The shape of the t distribution depends the **degrees of freedom**.

- Degrees of freedom .mono[=] 50.

---
count: false

# Hypothesis Testing

## .hi-green[Normal distribution] vs. .hi-purple[t distribution]

- A normal distribution has the same shape for any sample size.
- The shape of the t distribution depends the **degrees of freedom**.

- Degrees of freedom .mono[=] 500.

---
# Hypothesis Testing

## **t Tests** (two-sided)

To conduct a t test, compare the `$t$` statistic to the appropriate .hi[critical value] of the t distribution.

- To find the critical value in a t table, we need the degrees of freedom and the significance level `$\alpha$`.

Reject H.sub[0] at the `$\alpha \cdot 100$`-percent level if

`$$\left| t \right| = \left| \dfrac{\hat{\mu} - \mu_0}{\mathop{\text{SE}}(\hat{\mu})} \right| > t_\text{crit}.$$`

---
# Hypothesis Testing

## On Your Own

As the term progresses, we will encounter additional flavors of hypothesis testing and other related concepts.

You may find it helpful to review the following topics from Math 243:

- Confidence intervals
- One-sided `$t$` tests
- `$p$` values

---
class: inverse, middle

# Working with Data

---
# Data

## Experimental data

Data generated in controlled, laboratory settings.

Ideal for __causal identification__, but difficult to obtain in the social sciences.

- Intractable logistical problems
- Too expensive
- Morally repugnant

Experiments outside the lab: __randomized control trials__ and __A/B testing__.

---
# Data

## Observational data

Data generated in non-experimental settings.

- Surveys
- Censuses
- Administrative records
- Environmental data
- Financial and sales transactions
- Social media

Mainstay of economic research, but __poses challenges__ to causal identification.

---
# Tidy Data

<div id="htmlwidget-19da6879aa9ec76511c8" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-19da6879aa9ec76511c8">{"x":{"filter":"none","vertical":false,"fillContainer":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51"],["Alabama","Alaska","Arizona","Arkansas","California","Colorado","Connecticut","Delaware","District of Columbia","Florida","Georgia","Hawaii","Idaho","Illinois","Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York","North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania","Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah","Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming"],[4779736,710231,6392017,2915918,37253956,5029196,3574097,897934,601723,19687653,9920000,1360301,1567582,12830632,6483802,3046355,2853118,4339367,4533372,1328361,5773552,6547629,9883640,5303925,2967297,5988927,989415,1826341,2700551,1316470,8791894,2059179,19378102,9535483,672591,11536504,3751351,3831074,12702379,1052567,4625364,814180,6346105,25145561,2763885,625741,8001024,6724540,1852994,5686986,563626],[135,19,232,93,1257,65,97,38,99,669,376,7,12,364,142,21,63,116,351,11,293,118,413,53,120,321,12,32,84,5,246,67,517,286,4,310,111,36,457,16,207,8,219,805,22,2,250,93,27,97,5]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th><span style=\"color: #007935 !important\">State<\/span><\/th>\n      <th><span style=\"color: #007935 !important\">Population<\/span><\/th>\n      <th><span style=\"color: #007935 !important\">Murders<\/span><\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":6,"lengthChange":false,"pagingType":"simple","columnDefs":[{"className":"dt-right","targets":[2,3]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"rowCallback":"function(row, data, displayNum, displayIndex, dataIndex) {\nvar value=data[1]; $(this.api().cell(row, 1).node()).css({'color':'#9370DB'});\nvar value=data[2]; $(this.api().cell(row, 2).node()).css({'color':'#9370DB'});\nvar value=data[3]; $(this.api().cell(row, 3).node()).css({'color':'#9370DB'});\nvar value=data[0]; $(this.api().cell(row, 0).node()).css({'color':'#FD5F00'});\n}"}},"evals":["options.rowCallback"],"jsHooks":[]}</script>

]

Each .hi-purple[value] is associated with an .hi-orange[observation] and a .hi-green[variable].

]

---
# Cross Sectional Data

Ideally, collected using __random sampling__.

- Random sampling .mono[+] sufficient sample size .mono[=] representative sample.

- Random sampling simplifies data analysis, but non-random samples are common (and difficult to work with).

Used extensively in applied microeconomics.<sup>*</sup>

__Main focus of this course.__

.footnote[
<sup>*</sup> Applied microeconomics .mono[=] Labor, health, education, public finance, development, industrial organization, and urban economics.
]

---
# Time Series Data

- Quarterly US GDP
- Annual US infant mortality rates
- Daily Amazon stock prices

Complication: Observations are not independent draws.

- GDP this quarter highly related to GDP last quarter.

Used extensively in empirical macroeconomics.

Requires more-advanced methods (EC 421 and EC 422).

---
# Pooled Cross Sectional Data

Useful for studying policy changes and relationship that change over time.

Requires more-advanced methods (EC 421 and many 400-level applied micro classes).

---
# Panel or Longitudinal Data

- Example: daily attendance data for a sample of students.

Difficult to collect, but useful for causal identification.

- Can control for _unobserved_ characteristics.

Requires more-advanced methods (EC 421 and many 400-level applied micro classes).

---
# Messy Data

**Analysis-ready datasets are rare.** Most data are "messy."

The focus of this class is data analysis, but .hi[data wrangling] is a non-trivial part of a data scientist/analyst's job.

- `readr`, `tidyr`, `dplyr`, `ggplot2` .mono[+] others.

- Known collectively as the `tidyverse`.

---
# .mono[tidyverse]

## The [`tidyverse`](https://www.tidyverse.org): A package of packages

`readr`: Functions to import data.

`tidyr`: Functions to reshape messy data.

`dplyr`: Functions to work with data.

`ggplot2`: Functions to visualize data.

---
# Workflow

- Step 1: Load packages with `pacman`

- Step 2: Import data with `readr`

- Step 3: Reshape data with `tidyr`

- Step 4: Manipulate data with `dplyr`

- Step 5: Visualize and analyze data with `ggplot2`

---
# Why Bother?

**Q:** Why not just use .mono[.hi-green[MS Excel]] for data wrangling?

**A:** .hi[Reproducibility]

- Easier to retrace your steps with .mono[R].

**A:** .hi[Portability]

- Easy to re-purpose .mono[R] code for new projects.

**A:** .hi[Scalability]

- .mono[Excel] chokes on big datasets.

**A:** .hi[.mono[R] Saves time] (eventually)

- Lower marginal costs in exchange for higher fixed costs.

---