Topic 8: Exploratory Analysis

class: center, middle, inverse, title-slide

.title[
# Topic 8: Exploratory Analysis
]
.subtitle[
## Part 2: See how variables relate to each other
]
.author[
### Nick Hagerty <br> ECNS 460/560 <br> Montana State University
]

---

name: toc

# Table of contents

1. [Describing relationships](#xy)

1. [Conditional expectations](#cef)

1. [Adjusting for other variables](#adjusting)

1. [Smoothing](#smooth)

---
class: inverse, middle
name: challenge

# A brief challenge

---

# Challenge

Load this data:

```r
puzzle = read_csv("https://bit.ly/3B4BraF")
```

</br>

**Using your exploratory analysis skills, find the most important thing about this dataset.**

---

# Challenge

---
class: inverse, middle
name: xy

# Describing relationships

---

# Anscombe's Quartet

All 4 of these datasets have the same means, standard deviations, correlation coefficient, linear regression line, and regression R-squared.

**Always plot your data!** Don't rely on summary statistics alone.

---

# Scatterplots

Long before you run a regression, you should be looking at the scatterplots.

```r
vienna = read_csv("https://osf.io/y6jvb/download")

ggplot(vienna, aes(x = rating_count, y = price)) + 
  geom_point()
```

---

# Scatterplots

Transformations are often extremely important to learning what's really going on.

```r
vienna = vienna |> mutate(ln_price = log(price),
                          ln_rating_count = log(rating_count))
ggplot(vienna, aes(x = ln_rating_count, y = ln_price)) + 
  geom_point()
```

---

# Scatterplots

Another example of how helpful transformations can be.

```r
library(gapminder)
gap07 = gapminder |>
  filter(year == 2007) |>
  mutate(gdp = pop * gdpPercap)
ggplot(gap07, aes(x = pop, y = gdp)) +
  geom_point()
```

---

# Scatterplots

Another example of how helpful transformations can be.

```r
gap07 = gap07 |>
  mutate(ln_gdp = log(gdp),
         ln_pop = log(pop))

ggplot(gap07, aes(x = pop, y = ln_gdp)) +
  geom_point()
```

---

# Scatterplots

Another example of how helpful transformations can be.

```r
gap07 = gap07 |>
  mutate(ln_gdp = log(gdp),
         ln_pop = log(pop))

ggplot(gap07, aes(x = ln_pop, y = ln_gdp)) +
  geom_point()
```

---

# Binscatter

When you have a lot of data, scatterplots get too crowded to be useful!

Download this data scraped from Airbnb for listings in London on the night of 4 March 2017.

```r
london = read_csv("https://osf.io/ey5p7/download")
london2 = london |>
  mutate(ln_price = log(price))
ggplot(london2, aes(x = longitude, y = ln_price)) +
  geom_point()
```

---

# Binscatter

An extremely useful alternative is the **binscatter** plot.

Binscatter divides your X variable into "bins" and plots the mean of Y within each bin.

---

# Binscatter

First, create the bins. They can be equally spaced or equally sized (quantiles).
- Quantiles are preferred, so each bin represents the same amount of underlying data.

```r
london2 = london2 |> mutate(longitude_bins = factor(ntile(longitude, 40)),
                            bin_color = factor(ntile(longitude, 40) %% 8))
ggplot(london2, aes(x = longitude, y = ln_price, color = bin_color)) +
  geom_point() + scale_color_brewer(palette = "Set2")
```

---

# Binscatter

Then, calculate the means of both X and Y within each bin.

```r
london3 = london2 |>
  group_by(longitude_bins) |>
  mutate(ln_price_binned = mean(ln_price, na.rm = TRUE),
         longitude_binned = mean(longitude, na.rm = TRUE))
ggplot(london3) +
  geom_point(aes(x = longitude, y = ln_price, color = bin_color)) +
  geom_point(aes(x = longitude_binned, y = ln_price_binned), size = 2) +
  scale_color_brewer(palette = "Set2")
```

---

# Binscatter

Let's show just the binned points:

```r
london_binned = london2 |>
  group_by(longitude_bins) |>
  summarize(across(c("ln_price", "longitude"), mean, na.rm = TRUE))
ggplot(london_binned, aes(x = longitude, y = ln_price)) +
  geom_point(size = 3)
```

---

# Binscatter

Easier: Use `binsreg` ([Cattaneo, Crump, Farrell, and Feng 2021](https://arxiv.org/abs/1902.09608))
- Correctly handles many of the finer statistical details.

```r
install.packages("binsreg")
```

```r
library(binsreg)
binsreg(london2$ln_price, london2$longitude)
```

```
## Call: binsreg
## 
## Binscatter Plot
## Bin/Degree selection method (binsmethod)  =  IMSE direct plug-in (select # of bins)
## Placement (binspos)                       =  Quantile-spaced
## Derivative (deriv)                        =  0
## 
## Group (by)                         =  Full Sample
## Sample size (n)                    =  53817
## # of distinct values (Ndist)       =  53817
## # of clusters (Nclust)             =  NA
## dots, degree (p)                   =  0
## dots, smoothness (s)               =  0
## # of bins (nbins)                  =  69
```

---

# Local regression

A useful alternative to binscatter is **local regression.**

Local regression gives a smooth line that closely resembles the binscatter plot. What does it do exactly? We'll come back to this.

```r
ggplot(london3, aes(x = longitude, y = ln_price)) +

geom_smooth()
```

---

# Local regression

A useful alternative to binscatter is **local regression.**

Local regression gives a smooth line that closely resembles the binscatter plot. What does it do exactly? We'll come back to this.

```r
ggplot(london3, aes(x = longitude, y = ln_price)) +
  geom_point(aes(x = longitude_binned, y = ln_price_binned), size = 2) +
  geom_smooth()
```

---

# Linear regression (OLS)

You can also add the **line of best fit**, aka the **least-squares regression line.**

Clearly this is more useful in some situations than others.

```r
ggplot(london3, aes(x = longitude, y = ln_price)) +
  geom_point(aes(x = longitude_binned, y = ln_price_binned), size = 2) +
  geom_smooth(method = "lm", formula = y ~ x)
```

---

# Linear regression (OLS)

You can also add the **line of best fit**, aka the **least-squares regression line.**

Clearly this is more useful in some situations than others.

```r
ggplot(london3, aes(x = longitude, y = ln_price)) +
  geom_smooth(data = filter(london3, longitude < -0.17), method = 'lm', formula = y~x) +
  geom_smooth(data = filter(london3, longitude > -0.17), method = 'lm', formula = y~x) +
  geom_point(aes(x = longitude_binned, y = ln_price_binned), size = 2)
```

---

# Linear regression (OLS)

You can also add the **line of best fit**, aka the **least-squares regression line.**

Clearly this is more useful in some situations than others.

**So if you just ran a regression without plotting your data, you would get a woefully misleading answer!**

---
class: inverse, middle
name: cef

# Conditional Expectations

Images and code in this section are from ["Why Regression?"](https://github.com/edrubin/EC607S21/tree/master/notes-lecture/03-why-regression) by Ed Rubin, used with permission, and are not included under this resource's overall CC license.

---

# The Conditional Expectation Function (CEF)

The CEF tells us the **expected value** (population mean) of an outcome variable `$Y_i$` at each value of `$X_i$`.
$$ \mathbb{E}[Y_i|X_i] $$

Examples:
- `$\mathbb{E}[\text{Price}_i|\text{Longitude}_i]$`
- `$\mathbb{E}[\text{Income}_i|\text{Education}_i]$`
- `$\mathbb{E}[\text{Birth weight}_i|\text{Air pollution}_i]$`

</br>

Binscatter and local regression give **nonparametric estimates** of the CEF.
- Nonparametric: functional form is not pre-specified but rather driven by the data.

---
class: clear, middle, center

The conditional distributions of `$\text{Y}_{i}$` for `$\text{X}_{i} = x$` in 8, ..., 22.

---
class: clear, middle, center

The CEF, `$\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]$`, connects these conditional distributions' means.

---
class: clear, middle, center

Focusing in on the CEF, `$\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]$`...

---

# The Conditional Expectation Function

A well-chosen CEF is (almost) everything you need for empirical work.

- For **prediction:** the CEF is the best predictor of `$Y_i$` given `$X_i$`.
  - In the sense of minimizing mean squared error (bias squared plus variance).

- For **causal inference:** when the variation in `$X$` is as-good-as-random, then the CEF shows how `$X$` causally affects `$Y$`.
  - Specifically, the slope of the CEF is the **average causal response** of `$Y$` to a 1-unit change in `$X$`, for each value of `$X$`.

</br>

So... **why are we always running regressions??**

---

# Why Use Regression?

**Regression gives the best linear approximation of the CEF.**

- Again, "best" in the sense of lowest possible mean squared error.

**Why?** The intuition:
- Regression minimizes the *vertical* sum of squared errors.
- The mean *also* minimizes the sum of squared errors for a given value of `$X_i$`.
- So the regression line approximates the mean over all values of `$X_i$`, or the CEF.
- Try out [this game](https://old.mathematik.tu-clausthal.de/en/mathematics-interactive/statistics/guessing-the-regression-line/).

---

# Why Use Regression?

**Regression provides the best linear approximation of the CEF.**

- Again, "best" in the sense of lowest possible mean squared error.

Regression is a **good way to summarize the CEF,** even when the CEF is not linear.
- Your regression can be useful even if it is not a literally correct model of the true data generating process.
- But, different regressions may be more or less useful (better or worse approximations of the CEF).

How useful is *your* regression in *your* situation?
- Look at a nonparametric estimate of your CEF to see if your regression is a good approximation.
- That is: binscatter or local regression.

---

# Why Use Regression?

My view:

**By the time you actually run a regression, the results should never be a surprise.**
- Because you have already learned what the underlying CEF looks like.

So what's the point of running regressions?
1. **Quantification:** To measure the average slope of the CEF.
2. **Statistical inference:** To compute standard errors.

</br>

What if linear regression is not a good approximation to the CEF?
- You can always add nonlinear *terms* to a regression to improve the approximation.
- Doesn't change the fact that regression provides a linear *approximation*, in the sense that all the terms enter the model in a linear combination.

---
class: inverse, middle
name: adjusting

# Adjusting for other variables

---

# Simpson's Paradox

A trend in a full dataset can disappear or even reverse when looking at the constituent groups individually.

Which trend is the "right" one?

.small[Image from [Wikipedia](https://commons.wikimedia.org/wiki/File:Simpsons_paradox_-_animation.gif) and used under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).]

---

# The Simpsons Paradox

A trend in a full dataset can disappear or even reverse when looking at the constituent groups individually.

<img src="img/the-simpsons-paradox.jfif" width="67%" style="display: block; margin: auto;" />
.small[Image from [RJ Andrews @infowetrust](https://twitter.com/infowetrust/status/984536880199876608) and not included in the CC license.]

---

# The Simpsons Pair O' Docs

A trend in a full dataset can disappear or even reverse when looking at the constituent groups individually.

<img src="img/the-simpsons-pair-o-docs.jfif" width="73%" style="display: block; margin: auto;" />
.small[Image from [RJ Andrews @infowetrust](https://twitter.com/infowetrust/status/984536880199876608) and not included in the CC license.]

---