---
title: "Lecture .mono[000]"
subtitle: "Why are we here?"
author: "Edward Rubin"
#date: "`r format(Sys.time(), '%d %B %Y')`"
date: "09 January 2020"
output:
  xaringan::moon_reader:
    css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css']
    # self_contained: true
    nature:
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
---
exclude: true

```{R, setup, include = F}
library(pacman)
p_load(
  broom, tidyverse,
  ggplot2, ggthemes, ggforce, ggridges, cowplot,
  latex2exp, viridis, extrafont, gridExtra,
  kableExtra, snakecase, janitor,
  data.table, dplyr,
  lubridate, knitr, future, furrr,
  estimatr, FNN, parsnip,
  huxtable, here, magrittr
)
# Define colors
red_pink   = "#e64173"
turquoise  = "#20B2AA"
orange     = "#FFA500"
red        = "#fb6107"
blue       = "#3b3b9a"
green      = "#8bb174"
grey_light = "grey70"
grey_mid   = "grey50"
grey_dark  = "grey20"
purple     = "#6A5ACD"
slate      = "#314f4f"
# Knitr options
opts_chunk$set(
  comment = "#>",
  fig.align = "center",
  fig.height = 7,
  fig.width = 10.5,
  warning = F,
  message = F
)
opts_chunk$set(dev = "svg")
options(device = function(file, width, height) {
  svg(tempfile(), width = width, height = height)
})
options(knitr.table.format = "html")
```
---
class: inverse, middle
# Admin
---
name: admin
# Admin

.hi-slate[In-class today]

- .b[Course website:] [https://github.com/edrubin/EC524W20/](https://github.com/edrubin/EC524W20/)
- [Syllabus](https://raw.githack.com/edrubin/EC524W20/master/syllabus/syllabus.pdf) (on website)

.hi-slate[.mono[TODO] list]

- [Assignment](https://github.com/edrubin/EC524W20/tree/master/projects/kaggle-house-prices) (from Tuesday) due Thursday
- Readings for next time:
  - ISL Ch1–Ch2
  - [Prediction Policy Problems](https://www.aeaweb.org/articles?id=10.1257/aer.p20151023) by Kleinberg .it[et al.] (2015)
---
class: inverse, middle
# What's the goal?
---
layout: true
# What's the goal?

---
name: different
## What's different?

We've got a whole class on .hi-purple[prediction]. Why?

--

Up to this point, we've focused on causal .hi[identification/inference] of $\beta$, _i.e._,

$$\color{#6A5ACD}{\text{Y}_{i}} = \text{X}_{i} \color{#e64173}{\beta} + u_i$$

meaning we want an unbiased (consistent) and precise estimate $\color{#e64173}{\hat\beta}$.

--

With .hi-purple[prediction], we shift our focus to accurately estimating outcomes.

In other words, how can we best construct $\color{#6A5ACD}{\hat{\text{Y}}_{i}}$?

---
## ... so?

So we want "nice"-performing estimates $\hat y$ instead of $\hat\beta$.

.qa[Q] Can't we just use the same methods (_i.e._, OLS)?

--

.qa[A] It depends.
--
 How well does your .hi[linear]-regression model approximate the underlying data? (And how do you plan to select your model?)

--

.note[Recall] Least-squares regression is a great .hi[linear] estimator.

---
layout: false
class: clear, middle

Data data be tricky.super[.pink[†]]—as can understanding many relationships.

.footnote[
.pink[†] "Tricky" might mean nonlinear... or many other things...
]


---
layout: true
class: clear
---
exclude: true

```{R, data prediction types, include = F, cache = T}
# Generate data
n = 1e3
set.seed(123)
tmp_df = tibble(
  x = runif(n = n, min = -10, max = 10),
  er = rnorm(n = n, sd = 70),
  y = 0.3* x^2 + sqrt(abs(er)) - 17
) %>% mutate(y = ifelse(y > 0, y, 5 - x + 0.03 * er)) %>% mutate(y = abs(y)^0.7)
# Estimate
knn_df = tibble(
  x = seq(-10, 10, by = 0.1),
  y = knn.reg(
    train = tmp_df[,"x"] %>% as.matrix(),
    test = tibble(x = seq(-10, 10, by = 0.1)),
    y = tmp_df[,"y"] %>% as.matrix(),
    k = 100
  )$pred
)
knn_c_df = tibble(
  x = seq(-10, 10, by = 0.1),
  y = knn.reg(
    train = tmp_df[,"x"] %>% as.matrix(),
    test = tibble(x = seq(-10, 10, by = 0.1)),
    y = tmp_df[,"y"] %>% as.matrix(),
    k = 10
  )$pred
)
# Random forest
rf_model = rand_forest(mode = "regression", mtry = 1, trees = 10000) %>%
  set_engine("ranger", seed = 12345, num.threads = 10) %>%
  fit_xy(y = tmp_df[,"y"], x = tmp_df[,"x"])
# Predict onto testing data
rf_df = tibble(
  y = predict(rf_model, new_data = tibble(x = seq(-10, 10, by = 0.1)))$.pred,
  x = seq(-10, 10, by = 0.1)
)
# The plot
gg_basic =
ggplot(data = tmp_df, aes(x, y)) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
geom_point(size = 3.5, shape = 19, alpha = 0.05) +
geom_point(size = 3.5, shape = 1, alpha = 0.2) +
theme_void(base_family = "Fira Sans Book") +
xlab("Y") + ylab("X") +
theme(
  axis.title.y.left = element_text(size = 18, hjust = 1, vjust = 0),
  axis.title.x.bottom = element_text(size = 18, hjust = 0.5, vjust = 0.5, angle = 0)
)
```
---
name: graph-example

.white[blah]
```{R, plot points, echo = F}
gg_basic
```
---
.pink[Linear regression]
```{R, plot ols, echo = F}
gg_basic +
geom_smooth(method = "lm", se = F, color = red_pink, size = 1.25)
```
---
.pink[Linear regression], .turquoise[linear regression] $\color{#20B2AA}{\left( x^4 \right)}$
```{R, plot ols poly, echo = F}
gg_basic +
geom_smooth(method = "lm", se = F, color = red_pink, size = 1.25) +
geom_smooth(method = "lm", formula = y ~ poly(x, 4), se = F, color = turquoise, size = 1.25)
```
---
.pink[Linear regression], .turquoise[linear regression] $\color{#20B2AA}{\left( x^4 \right)}$, .purple[KNN (100)]
```{R, plot knn, echo = F}
gg_basic +
geom_smooth(method = "lm", se = F, color = red_pink, size = 1.25) +
geom_smooth(method = "lm", formula = y ~ poly(x, 4), se = F, color = turquoise, size = 1.25) +
geom_path(data = knn_df, color = purple, size = 1.25)
```
---
.pink[Linear regression], .turquoise[linear regression] $\color{#20B2AA}{\left( x^4 \right)}$, .purple[KNN (100)], .orange[KNN (10)]
```{R, plot knn more, echo = F}
gg_basic +
geom_smooth(method = "lm", se = F, color = red_pink, size = 1.25) +
geom_smooth(method = "lm", formula = y ~ poly(x, 4), se = F, color = turquoise, size = 1.25) +
geom_path(data = knn_df, color = purple, size = 1.25) +
geom_path(data = knn_c_df, color = orange, size = 1.25)
```
---
.pink[Linear regression], .turquoise[linear regression] $\color{#20B2AA}{\left( x^4 \right)}$, .purple[KNN (100)], .orange[KNN (10)], .slate[random forest]
```{R, plot rf, echo = F}
gg_basic +
geom_smooth(method = "lm", se = F, color = red_pink, size = 1.25) +
geom_smooth(method = "lm", formula = y ~ poly(x, 4), se = F, color = turquoise, size = 1.25) +
geom_path(data = knn_df, color = purple, size = 1.25) +
geom_path(data = knn_c_df, color = orange, size = 1.25) +
geom_path(data = rf_df, color = slate, size = 1.25)
```
---
class: clear, middle

.note[Note] That example only had one predictor...

---
layout: false
name: tradeoffs
# What's the goal?
## Tradeoffs

In prediction, we constantly face many tradeoffs, _e.g._,
- .hi[flexibility] and .hi-slate[parametric structure] (and interpretability)
- performance in .hi[training] and .hi-slate[test] samples
- .hi[variance] and .hi-slate[bias]

--

As your economic training should have predicted, in each setting, we need to .b[balance the additional benefits and costs] of adjusting these tradeoffs.

--

Many machine-learning (ML) techniques/algorithms are crafted to optimize with these tradeoffs, but the practitioner (you) still needs to be careful.
---
name: more-goals
# What's the goal?

There are many  reasons to step outside the world of linear regression...

--

.hi-slate[Multi-class] classification problems
- Rather than {0,1}, we need to classify $y_i$ into 1 of K classes
- _E.g._, ER patients: {heart attack, drug overdose, stroke, nothing}

--

.hi-slate[Text analysis] and .hi-slate[image recognition]
- Comb though sentences (pixels) to glean insights from relationships
- _E.g._, detect sentiments in tweets or roof-top solar in satellite imagery

--

.hi-slate[Unsupervised learning]
- You don't know groupings, but you think there are relevant groups
- _E.g._, classify spatial data into groups
---
layout: true
class: clear, middle

---
name: example-articles
```{R, xray image, echo = F, out.width = '90%'}
knitr::include_graphics("images/ml-xray.png")
```
---
```{R, cars image, echo = F, out.width = '90%'}
knitr::include_graphics("images/ml-cars.png")
```
---
```{R, ny image, echo = F, out.width = '90%'}
knitr::include_graphics("images/ml-writing.png")
```
---
```{R, gender-race image, echo = F, out.width = '90%'}
knitr::include_graphics("images/ml-issues.jpeg")
```
---
# Takeaways?

What are your main takeaways from these examples?

--

.note[Mine]

- Interactions and nonlinearities likely matter
- .it[Engineering] features/variables can be important
- Flexibility is huge—but we still want to avoid overfitting

---
class: clear, middle

.qa[Q] What have you learned/noticed in your first project?

---
class: clear, middle

.note[Next time] Start formal building blocks of prediction.

---
name: sources
layout: false

# Sources

Sources (articles) of images

- [Deep learning and radiology ](https://www.smart2zero.com/news/algorithm-beats-radiologists-diagnosing-x-rays)
- [Parking lot detection](https://www.smart2zero.com/news/algorithm-beats-radiologists-diagnosing-x-rays)
- [.it[New Yorker] writing](https://www.newyorker.com/magazine/2019/10/14/can-a-machine-learn-to-write-for-the-new-yorker)
- [Gender Shades](http://gendershades.org/overview.html)

---
# Table of contents

.col-left[
.small[
#### Admin
- [Today and upcoming](#admin)

#### What's the goal?
- [What's difference?](#different)
- [Graphical example](#graph-example)
- [Tradeoffs](#tradeoffs)
- [More goals](#more-goals)
- [Examples](#example-articles)

#### Other
- [Image sources](#sources)
]
]

---
exclude: true

```{R, save pdfs, include = F}
# PDF with pauses
xaringan::decktape("000-slides.html", "000-slides.pdf")
# PDF without pauses
pagedown::chrome_print("000-slides.html", "000-slides-no-pause.pdf")
```