--- title: "Lecture .mono[006]" subtitle: "Classification" author: "Edward Rubin" #date: "`r format(Sys.time(), '%d %B %Y')`" date: "13 February 2020" output: xaringan::moon_reader: css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css'] # self_contained: true nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- exclude: true ```{R, setup, include = F} library(pacman) p_load( ISLR, broom, tidyverse, ggplot2, ggthemes, ggforce, ggridges, cowplot, scales, latex2exp, viridis, extrafont, gridExtra, plotly, ggformula, kableExtra, DT, data.table, dplyr, snakecase, janitor, lubridate, knitr, future, furrr, MASS, estimatr, caret, glmnet, huxtable, here, magrittr, parallel ) # Define colors red_pink = "#e64173" turquoise = "#20B2AA" orange = "#FFA500" red = "#fb6107" blue = "#181485" navy = "#150E37FF" green = "#8bb174" yellow = "#D8BD44" grey_light = "grey70" grey_mid = "grey50" grey_dark = "grey20" purple = "#6A5ACD" slate = "#314f4f" # Knitr options opts_chunk$set( comment = "#>", fig.align = "center", fig.height = 7, fig.width = 10.5, warning = F, message = F ) opts_chunk$set(dev = "svg") options(device = function(file, width, height) { svg(tempfile(), width = width, height = height) }) options(knitr.table.format = "html") ``` --- layout: true # Admin --- class: inverse, middle --- name: admin-today ## Material .b[Last time] Shrinkage methods - Ridge regression - (The) lasso - Elasticnet .b[Today] Classification methods - Introduction to classification - Linear probability models - Logistic regression .note[Also:] Class will end today at 11:30am..super[.pink[†]] .footnote[ .pink[†] 🎉? ] --- name: admin-soon ## Upcoming .b[Readings] .note[Today] .it[ISL] Ch. 4 .b[Problem sets] - .it[Shrinkage methods] Due today - .it[Classification] Due next week --- layout: true # Classification --- class: inverse, middle --- name: intro ## Intro .attn[Regression problems] seek to predict the number an outcome will take—integers (_e.g._, number of cats), reals (_e.g._, home/cat value), _etc._ .super[.pink[†]] .footnote[ .pink[†] Maybe: Binary indicators... ] -- .attn[Classification problems] instead seek to predict the category of an outcome - .b[Binary outcomes]
success/failure; true/false; A or B; cat or .it[not cat]; _etc._ - .b[Multi-class outcomes]
yes, no, .it[or maybe]; colors; letters; type of cat;.super[.pink[††]] _etc._ .footnote[ .tran[† Maybe: Binary indicators...] .pink[††] It turns out, all of machine learning is about cats. ] This type of outcome is often called a .it[qualitative] or .it[categorical] response. --- name: examples ## Examples For the past few weeks, we've been immersed in regression problems. It's probably helpful to mention a few .hi[examples of classification problems]. -- - Using life/criminal history (and demographics?):
Can we predict whether a defendant is .b[granted bail]? -- - Based upon a set of symptoms and observations:
Can we predict a patient's .b[medical condition](s)? -- - From the pixels in an image:
Can we classify images as .b[bagel, puppy, or other]? --- ## Approach One can imagine two.super[.pink[†]] related .hi[approaches to classification] .footnote[ .pink[†] At least. ] 1. Predict .b[which category] the outcome will take. 1. Estimate the .b[probability of each category] for the outcome. -- That said, the general approach will - Take a set of training observations $(x_1,y_1),\, (x_2,y_2),\,\ldots,\,(x_n,y_n)$ - Build a classifier $\hat{y}_o=\mathop{f}(x_o)$ all while balancing bias and variance..super[.pink[††]] .footnote[ .tran[† At least.] .pink[††] Sound familiar? ] --- layout: false class: clear, middle .qa[Q] If everything is so similar, can't we use regression methods? .white[No] --- class: clear, middle .qa[Q] If everything is so similar, can't we use regression methods? .qa[A] .it[Sometimes.] -- .it[Other times:] No. -- Plus you still need new tools. --- layout: true # Classification ## Why not regression? --- name: no-regress Regression methods are not made to deal with .b[multiple categories]. .ex[Ex.] Consider three medical diagnoses: .pink[stroke], .purple[overdose], and .orange[seizure]. Regression needs a numeric outcome—how should we code our categories? -- .left-third[ .center.note[Option 1] $$Y=\begin{cases} \displaystyle 1 & \text{if }\color{#e64173}{\text{ stroke}} \\ \displaystyle 2 & \text{if }\color{#6A5ACD}{\text{ overdose}} \\ \displaystyle 3 & \text{if }\color{#FFA500}{\text{ seizure}} \\ \end{cases}$$ ] -- .left-third[ .center.note[Option 2] $$Y=\begin{cases} \displaystyle 1 & \text{if }\color{#6A5ACD}{\text{ overdose}} \\ \displaystyle 2 & \text{if }\color{#e64173}{\text{ stroke}} \\ \displaystyle 3 & \text{if }\color{#FFA500}{\text{ seizure}} \\ \end{cases}$$ ] -- .left-third[ .center.note[Option 3] $$Y=\begin{cases} \displaystyle 1 & \text{if }\color{#FFA500}{\text{ seizure}} \\ \displaystyle 2 & \text{if }\color{#e64173}{\text{ stroke}} \\ \displaystyle 3 & \text{if }\color{#6A5ACD}{\text{ overdose}} \\ \end{cases}$$ ] -- The categories' ordering is unclear—let alone the actual valuation.
The choice of ordering and valuation can affect predictions. 😿 --- As we've seen, .b[binary outcomes] are simpler. -- .ex[Ex] If we are only choosing between .pink[stroke] and .purple[overdose] .left-wide[ .center.note[Option 1] $$Y=\begin{cases} \displaystyle 0 & \text{if }\color{#e64173}{\text{ stroke}} \\ \displaystyle 1 & \text{if }\color{#6A5ACD}{\text{ overdose}} \\ \end{cases}$$ ] .left-thin.center[

.center[and]] .left-wide[ .center.note[Option 2] $$Y=\begin{cases} \displaystyle 0 & \text{if }\color{#6A5ACD}{\text{ overdose}} \\ \displaystyle 1 & \text{if }\color{#e64173}{\text{ stroke}} \\ \end{cases}$$ ] .clear-up[ will provide the same results. ] --- name: lpm In these .b[binary outcome] cases, we .it[can] apply linear regression. These models are called .attn[linear probability models] (LPMs). The .b[predictions] from an LPM 1. estimate the conditional probability $y_i = 1$, _i.e._, $\mathop{\text{Pr}}\left(y_o = 1 \mid x_o\right)$ 1. are not restricted to being between 0 and 1.super[.pink[†]] 1. provide an ordering—and a reasonable estimate of probability .footnote[ .pink[†] Some people get very worked up about this point. ] -- .note[Other benefits:] Coefficients are easily interpreted + we know how OLS works. --- layout: true class: clear, middle --- Let's consider an example: the `Default` dataset from `ISLR` ```{R, datatable-default, echo = F, cache = T} set.seed(1) ISLR::Default %>% sample_n(100) %>% datatable( rownames = F, options = list(dom = 't') ) %>% formatRound(columns = 3:4, digits = c(2, 0)) ``` --- exclude: true ```{R, clean-default-data, include = F} # Clean data default_df = ISLR::Default %>% dplyr::mutate(i_default = 1 * (default == "Yes")) ``` --- .hi-purple[The data:] The outcome, default, only takes two values (only `r default_df$i_default %>% mean() %>% scales::percent(accuracy = 0.1)` default). ```{R, boxplot-default-balance, echo = F, cache = T} ggplot(data = default_df, aes(x = default, y = balance)) + geom_boxplot(outlier.shape = NA, fill = "grey90") + geom_jitter(width = 0.2, alpha = 0.1, color = purple) + xlab("Default") + scale_y_continuous("Balance", labels = scales::comma) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + coord_flip() ``` --- .hi-purple[The data:] The outcome, default, only takes two values (only `r default_df$i_default %>% mean() %>% scales::percent(accuracy = 0.1)` default). ```{R, plot-default-points, echo = F, cache = T} # Plot points ggplot(data = default_df, aes(x = balance, y = i_default)) + geom_point(alpha = 0.05, size = 3.5, color = purple) + geom_line(stat = "smooth", color = NA, method = lm, size = 1.5) + scale_y_continuous("Default") + scale_x_continuous("Balance", labels = scales::comma) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- .hi-pink[The linear probability model] struggles with prediction in this setting. ```{R, plot-default-lpm, echo = F, cache = T} ggplot(data = default_df, aes(x = balance, y = i_default)) + geom_point(alpha = 0.05, size = 3.5, color = purple) + geom_line(stat = "smooth", color = red_pink, method = lm, size = 1.5) + scale_y_continuous("Default") + scale_x_continuous("Balance", labels = scales::comma) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- .hi-orange[Logistic regression] .it[appears] to offer an improvement. ```{R, plot-default-logistic, echo = F, cache = T} ggplot(data = default_df, aes(x = balance, y = i_default)) + geom_point(alpha = 0.05, size = 3.5, color = purple) + geom_line(stat = "smooth", color = red_pink, method = lm, size = 1.5, alpha = 0.2) + geom_line(stat = "smooth", color = orange, method = "glm", method.args = list(family = "binomial"), size = 1.5) + scale_y_continuous("Default") + scale_x_continuous("Balance", labels = scales::comma) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- So... what's logistic regression? --- layout: true # Logistic regression --- class: inverse, middle --- name: logistic-intro ## Intro .attn[Logistic regression] .b[models the probability] that our outcome $Y$ belongs to a .b[specific category] (often whichever category we think of as `TRUE`). -- For example, we just saw a graph where $$ \begin{align} \mathop{\text{Pr}}\left(\text{Default} = \text{Yes} | \text{Balance}\right) = p(\text{Balance}) \end{align} $$ we are modeling the probability of `default` as a function of `balance`. -- We use the .b[estimated probabilities] to .b[make predictions], _e.g._, - if $p(\text{Balance})\geq 0.5$, we could predict "Yes" for Default - to be conservative, we could predict "Yes" if $p(\text{Balance})\geq0.1$ --- name: logistic-logistic ## What's .it[logistic]? We want to model probability as a function of the predictors $\left(\beta_0 + \beta_1 X\right)$. .col-centered[ .hi-pink[Linear probability model]
.pink[linear] transform. of predictors $$ \begin{align} p(X) = \beta_0 + \beta_1 X \end{align} $$ ] .col-centered[ .hi-orange[Logistic model]
.orange[logistic] transform. of predictors $$ \begin{align} p(X) = \dfrac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \end{align} $$ ] .clear-up[ What does this .it[logistic function] $\left(\frac{e^x}{1+e^x}\right)$ do? ] 1. ensures predictions are between 0 $(x\rightarrow-\infty)$ and 1 $(x\rightarrow\infty)$ 1. forces an S-shaped curved through the data (not linear) --- ## What's .it[logistic]? With a little math, you can show $$ \begin{align} p(X) = \dfrac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \implies \color{#e64173}{\log \left( \dfrac{p(X)}{1-p(X)}\right)} = \color{#6A5ACD}{\beta_0 + \beta_1 X} \end{align} $$ .note[New definition:] .hi-pink[log odds].super[.pink[†]] on the RHS and .hi-purple[linear predictors] on the LHS. .footnote[ .pink[†] The "log odds" is sometimes called "logit". ] -- 1. .b[interpretation] of $\beta_j$ is about .pink[log odds]—not probability -- 1. .b[changes in probability] due to $X$ depend on level of $X$.super[.pink[†]] .footnote[ .tran[† The "log odds" is sometimes called "logit".] .pink[††] It's nonlinear! ] --- name: logistic-estimation ## Estimation Before we can start predicting, we need to estimate the $\beta_j$s. $$ \begin{align} p(X) = \dfrac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \implies \color{#e64173}{\log \left( \dfrac{p(X)}{1-p(X)}\right)} = \color{#6A5ACD}{\beta_0 + \beta_1 X} \end{align} $$ We estimate logistic regression using .attn[maximum likelihood estimation]. -- .attn[Maximum likelihood estimation] (MLE) searches for the $\beta_j$s that make our data "most likely" given the model we've written. --- name: logistic-mle ## Maximum likelihood .attn[MLE] searches for the $\beta_j$s that make our data "most likely" using our model. $$ \begin{align} \color{#e64173}{\log \left( \dfrac{p(X)}{1-p(X)}\right)} = \color{#6A5ACD}{\beta_0 + \beta_1 X} \end{align} $$ -- 1. $\color{#6A5ACD}{\beta_j}$ tells us how $x_j$ affects the .pink[log odds] -- 1. odds $= \dfrac{p(X)}{1-p(X)}$. -- If $p(X) > 0.5$, then odds $>1$ and .pink[log odds] $> 0$. -- So we want choose $\color{#6A5ACD}{\beta_j}$ such that - .pink[log odds] are above zero for observations where $y_i=1$ - .pink[log odds] even larger for areas of $x_j$ where most $i$s have $y_i=1$ --- ## Formally: The likelihood function We estimate logistic regression by maximizing .attn[the likelihood function].super[.pink[†]] .footnote[ .pink[†] Generally, we actually will maximize the .it[log] of the likelihood function. ] $$ \begin{align} \mathop{\ell}(\beta_0,\beta_1) = \prod_{i:y_i=1} \mathop{p}(x_i) \prod_{i:y_i=0} (1-\mathop{p}(x_i)) \end{align} $$ The likelihood function is maximized by - making $p(x_i)$ large for individuals with $y_i = 1$ - making $p(x_i)$ small for individuals with $y_i = 0$ .it[Put simply:] Maximum likelihood maximizes a predictive performance, conditional on the model we have written down. --- name: logistic-r ## In R In R, you can run logistic regression using the `glm()` function. -- .note[Aside:] Related to `lm`, `glm` stands for .it[generalized] (linear model). -- "Generalized" essentially means that we're applying some transformation to $\beta_0 + \beta_1 X$ like logistic regression applies the logistic function. --- ## In R In R, you can run logistic regression using the `glm()` function. .b[Key arguments] (very similar to `lm()`) - specify a `formula`,.super[.pink[†]] _e.g._, `y ~ .` or `y ~ x + I(x^2)` - define `family = "binomial"` (so R knows to run logistic regression) - give the function some `data` .footnote[ .pink[†] Notice that we're back in the world of needing to select a model... ] -- ```{R, ex-glm} est_logistic = glm( i_default ~ balance, family = "binomial", #<< data = default_df ) ``` --- layout: false class: clear ```{R, summary-glm, highlight.output = 10:12} est_logistic %>% summary() ``` --- layout: true # Logistic regression --- name: logistic-prediction ## Estimates and predictions ```{R, beta-hats, include = F} # Unrounded b0 = est_logistic$coefficients[1] b1 = est_logistic$coefficients[2] # Rounded br0 = est_logistic$coefficients[1] %>% round(2) br1 = est_logistic$coefficients[2] %>% round(4) ``` Thus, our estimates are $\hat{\beta}_0 \approx `r br0`$ and $\hat{\beta}_1 \approx `r br1`$. .note[Remember:] These coefficients are for the .b[log odds]. -- If we want .hi[to make predictions] for $y_i$ (whether or not $i$ defaults),
then we first must .hi[estimate the probability] $\mathop{p}(\text{Balance})$ $$ \begin{align} \hat{p}(\text{Balance}) = \dfrac{e^{\hat{\beta}_0 + \hat{\beta}_1 \text{Balance}}}{1 + e^{\hat{\beta}_0 + \hat{\beta}_1 \text{Balance}}} \approx \dfrac{e^{`r br0` + `r br1` \cdot \text{Balance}}}{1 + e^{`r br0` + `r br1` \cdot \text{Balance}}} \end{align} $$ -- - If $\text{Balance} = 0$, we then estimate $\mathop{\hat{p}} \approx `r (exp(b0)/(1+exp(b0))) %>% round(6) %>% format(scientific = F)`$ - If $\text{Balance} = 2,000$, we then estimate $\mathop{\hat{p}} \approx `r (exp(b0 + b1 * 2e3)/(1+exp(b0 + b1 * 2e3))) %>% round(3)`$ - If $\text{Balance} = 3,000$, we then estimate $\mathop{\hat{p}} \approx `r (exp(b0 + b1 * 3e3)/(1+exp(b0 + b1 * 3e3))) %>% round(3)`$ .super[.pink[†]] .footnote[ .pink[†] You get a sense of the nonlinearity of the predictors' effects. ] --- layout: false class: clear, middle .hi-orange[Logistic regression]'s predictions of $\mathop{p}(\text{Balance})$ ```{R, plot-default-logistic-2, echo = F, cache = T} ggplot(data = default_df, aes(x = balance, y = i_default)) + geom_point(alpha = 0.05, size = 3.5, color = purple) + geom_line(stat = "smooth", color = red_pink, method = lm, size = 1.5, alpha = 0.2) + geom_line(stat = "smooth", color = orange, method = "glm", method.args = list(family = "binomial"), size = 1.5) + scale_y_continuous("Default") + scale_x_continuous("Balance", labels = scales::comma) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- class: clear, middle .note[Note:] Everything we've done so far extends to models with many predictors. --- layout: true # Logistic regression ## Prediction -- .note[Old news:] You can use `predict()` to get predictions out of `glm` objects. .b[New and important:] `predict()` produces multiple `type`.small[s] of predictions 1. `type = "response"` predicts .it[on the scale of the response variable]
for logistic regression, this means .b[predicted probabilities] (0 to 1) 1. `type = "link"` predicts .it[on the scale of the linear predictors]
for logistic regression, this means .b[predicted log odds] (-∞ to ∞) .attn[Beware:] The default is `type = "link"`, which you may not want. --- Putting it all together, we can get (estimated) probabilities $\hat{p}(X)$ ```{R, ex-p-hat} # Predictions on scale of response (outcome) variable p_hat = predict(est_logistic, type = "response") ``` which we can use to make predictions on $y$ ```{R, ex-y-hat} # Predict '1' if p_hat is greater or equal to 0.5 y_hat = as.numeric(p_hat >= 0.5) ``` --- layout: false class: clear, middle So how did we do? --- layout: true # Assessment --- class: inverse, middle --- name: how ## How did we do? We guessed `r mean(y_hat == default_df$i_default) %>% scales::percent(accuracy = 0.01)` of the observations correctly. -- .qa[Q] `r mean(y_hat == default_df$i_default) %>% scales::percent(accuracy = 0.01)` is pretty good, right? -- .qa[A] It depends... -- Remember that `r mean(default_df$i_default) %>% scales::percent(accuracy = 0.01)` of the observations actually defaulted. --
So we would get `r (1 - mean(default_df$i_default)) %>% scales::percent(accuracy = 0.01)` right by guessing "No" for everyone..super[.pink[†]] .footnote[ .pink[†] This idea is called the .it[null classifier]. ] -- ```{R, calc-sensitivity, include = F} tmp_df = tibble( y = default_df$i_default, y_hat = y_hat, y_hat_base = 0 ) ex_sensitivity = tmp_df %>% filter(y == 1) %>% transmute(y == y_hat) %>% unlist() %>% mean() ex_sensitivity_base = tmp_df %>% filter(y == 1) %>% transmute(y == y_hat_base) %>% unlist() %>% mean() ``` We .it[did] guess `r ex_sensitivity %>% scales::percent(accuracy = 0.01)` of the defaults -- , which is clearer better than 0%. -- .qa[Q] How can we more formally assess our model's performance? -- .qa[A] All roads lead to the .attn[confusion matrix]. --- name: confusion ## The confusion matrix The .attn[confusion matrix] is us a convenient way to display
.hi-orange[correct] and .hi-purple[incorrect] predictions for each class of our outcome. ```{R, default-confusion-matrix, include = F, cache = T} # Create data frame conf_df = data.frame( No = c("True Negative (TN)", "False Positive (FP)"), Yes = c("False Negative (FN)", "True Positive (TP)") ) rownames(conf_df) = c("No", "Yes") # Create the matrix conf_mat = conf_df %>% hux() %>% add_rownames("") %>% add_colnames() %>% insert_row(c("", "Truth", "Truth"), after = 0) %>% merge_cells(1, 2:3) %>% insert_column(c("", "", "Prediction", "Prediction"), after = 0) %>% merge_cells(3:4, 1) %>% set_bold(1:2, everywhere, T) %>% set_bold(everywhere, 1:2, T) %>% set_align(1:2, 1:4, "center") %>% set_valign(3:4, 1, "middle") %>% set_top_border(3, 3:4, 1) %>% set_left_border(3:4, 3, 1) ``` ```{R, cm-right-wrong, echo = F} conf_mat %>% set_text_color(3,3,orange) %>% set_text_color(3,4,purple) %>% set_text_color(4,3,purple) %>% set_text_color(4,4,orange) ``` -- The .attn[accuracy] of a method is the share of .orange[correct] predictions, _i.e._, .center[ .b[Accuracy] = (.hi-orange[TN] + .hi-orange[TP]) / (.hi-orange[TN] + .hi-orange[TP] + .hi-purple[FN] + .hi-purple[FP]) ] -- This matrix also helps display many other measures of assessment. --- ## The confusion matrix .attn[Sensitivity:] the share of positive outcomes $Y=1$ that we correctly predict. .center[ .b[Sensitivity] = .hi-orange[TP] / (.hi-orange[TP] + .hi-purple[FN]) ] ```{R, cm-sensitivity, echo = F} conf_mat %>% set_text_color(2:4,4,purple) %>% set_text_color(4,4,orange) ``` Sensitivity is also called .attn[recall] and the .attn[true-positive rate]. One minus sensitivity is the .attn[type-II error rate]. --- ## The confusion matrix .attn[Specificity:] the share of neg. outcomes $(Y=0)$ that we correctly predict. .center[ .b[Specificity] = .hi-orange[TN] / (.hi-orange[TN] + .hi-purple[FP]) ] ```{R, cm-specificity, echo = F} conf_mat %>% set_text_color(2:4,3,purple) %>% set_text_color(3,3,orange) ``` One minus specificity is the .attn[false-positive rate] or .attn[type-I error rate]. --- ## The confusion matrix .attn[Precision:] the share of predicted positives $(\hat{Y}=1)$ that are correct. .center[ .b[Precision] = .hi-orange[TP] / (.hi-orange[TP] + .hi-purple[FP]) ] ```{R, cm-precision, echo = F} conf_mat %>% set_text_color(4,2:4,purple) %>% set_text_color(4,4,orange) ``` --- ## Which assessment? .qa[Q] So .it[which] criterion should we use? -- .qa[A] You should use the .it[right] criterion for your context. - Are true positives more valuable than true negatives? --
.note[Sensitivity] will be key. -- - Do you want to have high confidence in predicted positives? --
.note[Precision] is your friend -- - Are all errors equal? --
.note[Accuracy] is perfect. -- There's a lot more, _e.g._, the .attn[F.sub[1] score] combines precision and sensitivity. --- name: cm-r ## Confusion in R `confusionMatrix()` from `caret` calculates the confusion matrix—and many other statistics. - `data`: a `factor` vector of predictions (use `as.factor()` if needed) - `reference`: a `factor` vector of true outcomes -- ```{R, est-cm} cm_logistic = confusionMatrix( # Our predictions data = y_hat %>% as.factor(), # Truth reference = default_df$i_default %>% as.factor() ) ``` --- layout: false class: clear ```{R, est-cm-out, echo = F} cm_logistic %>% capture.output() %>% extract(1:24) %>% paste(., collapse = " \n ") %>% cat() ``` --- layout: true # Assessment --- ## Thresholds Your setting also dictates the "optimal" threshold that moves a prediction from one class (_e.g._, Default = No) to another class (Default = Yes). The Bayes classifier suggests a probability threshold of 0.5. The Bayes classifier can't be beat in terms of .note[accuracy], but if you have goals other than accuracy, you should consider other thresholds. --- name: thresholds layout: false class: clear As we vary the threshold, our error rates (types .hi-purple[I], .hi-orange[II], and .hi-slate[overall]) change. ```{R, calc-threshold, include = F, cache = T} threshold_df = mclapply( X = seq(0, 1, by = 0.00001), FUN = function(cutoff) { # The predictions y_df = tibble( y = default_df$i_default, y_hat = (1 * (p_hat > cutoff)) ) # Results data.frame( cutoff = cutoff, full = mean(y_df$y != y_df$y_hat), type1 = y_df %>% filter(y == 0) %>% transmute(y != y_hat) %>% unlist() %>% mean(), type2 = y_df %>% filter(y == 1) %>% transmute(y != y_hat) %>% unlist() %>% mean() ) }, mc.cores = 12 ) %>% bind_rows() ``` ```{R, plot-threshold, echo = F} ggplot(data = threshold_df, aes(x = cutoff)) + geom_hline(yintercept = 0) + geom_line(aes(y = type1, color = "1"), size = 1) + geom_line(aes(y = type2, color = "2"), size = 1) + geom_line(aes(y = full, color = "3"), size = 0.3, linetype = "longdash") + scale_y_continuous("Error rate", labels = scales::percent) + scale_x_continuous("Threshold for positive prediction") + scale_color_manual( "Error rate:", labels = c("Type I (FP/N)", "Type II (FN/P)", "All"), values = c(purple, orange, slate) ) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- name: roc class: clear The .attn[ROC curve] plots the true- (TP/P) and the false-positive rates (FP/N). ```{R, calc-roc, include = F, cache = T} roc_df = threshold_df %>% transmute( fpr = type1, tpr = 1 - type2 ) %>% as_tibble() ``` ```{R, plot-roc, echo = F, fig.height = 6, cache = T} ggplot(data = roc_df, aes(x = fpr, y = tpr)) + geom_path(color = red_pink, size = 1) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + geom_abline(intercept = 0, slope = 1, color = slate, linetype = "dotted", size = 0.5) + scale_x_continuous("False positive rate (FP/N)", labels = scales::percent) + scale_y_continuous("True positive rate (TP/P)", labels = scales::percent) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` -- "Best performance" means the .pink[ROC curve] hugs the top-left corner. --- class: clear The .hi-orange[AUC] gives the .orange[area under the (ROC) curve]. ```{R, plot-auc, echo = F, fig.height = 6, cache = T} ggplot(data = roc_df, aes(x = fpr, y = tpr)) + geom_ribbon( aes(ymin = 0, ymax = tpr), fill = orange, alpha = 0.75 ) + geom_path(color = red_pink, size = 1) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + geom_abline(intercept = 0, slope = 1, color = slate, linetype = "dotted", size = 0.5) + scale_x_continuous("False positive rate (FP/N)", labels = scales::percent) + scale_y_continuous("True positive rate (TP/P)", labels = scales::percent) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` -- "Best performance" means the .orange[AUC] is near 1. Random chance: 0.5 --- class: clear, middle .qa[Q] So what information is AUC telling us? .tran[.b[A] Nothing] --- class: clear, middle .qa[Q] So what information is AUC telling us? .qa[A] AUC tells us how much we've .b[separated] the .it[positive] and .it[negative] labels. --- layout: true class: clear --- exclude: true ```{R, roc-setup, include = F} # Distributions of positive and negative lables d_pos = function(x, sep) dbeta(x, 15, 15-sep) d_neg = function(x, sep) dbeta(x, 15-sep, 15) # The implied TPR and FPR given a threshold tpr = function(threshold, sep) pbeta(threshold, 15, 15-sep, lower.tail = F) fpr = function(threshold, sep) pbeta(threshold, 15-sep, 15, lower.tail = F) ``` --- .ex[Example:] Distributions of probabilities for .hi-orange[negative] and .hi-purple[positive] outcomes. ```{R, roc-ex1-d, echo = F} ggplot( data = tibble(x = seq(0, 1, 0.001)), aes(x = x) ) + geom_area(stat = "function", fun = . %>% d_pos(., sep = 4), color = purple, fill = purple) + geom_area(stat = "function", fun = . %>% d_neg(., sep = 4), color = orange, fill = orange, alpha = 0.2) + geom_hline(yintercept = 0) + scale_x_continuous("Threshold for predicting 'Positive'", labels = scales::percent) + ylab("Density") + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- For any given .hi-pink[threshold] ```{R, roc-ex1-threshold, echo = F} ggplot( data = tibble(x = seq(0, 1, 0.001)), aes(x = x) ) + geom_area(stat = "function", fun = . %>% d_pos(., sep = 4), color = purple, fill = purple) + geom_area(stat = "function", fun = . %>% d_neg(., sep = 4), color = orange, fill = orange, alpha = 0.2) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0.4375, color = red_pink, size = 1.2) + scale_x_continuous("Threshold for predicting 'Positive'", labels = scales::percent) + ylab("Density") + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- For any given .hi-pink[threshold], we get .hi-yellow[false positives] ```{R, roc-ex1-threshold2, echo = F} ggplot( data = tibble(x = seq(0, 1, 0.001)), aes(x = x) ) + geom_area(stat = "function", fun = . %>% d_pos(., sep = 4), color = purple, fill = purple) + geom_area(stat = "function", fun = . %>% d_neg(., sep = 4), color = orange, fill = orange, alpha = 0.2) + geom_area( stat = "function", fun = . %>% d_neg(., sep = 4), color = yellow, fill = yellow, alpha = 0.8, xlim = c(0.4375, 1) ) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0.4375, color = red_pink, size = 1.2) + scale_x_continuous("Threshold for predicting 'Positive'", labels = scales::percent) + ylab("Density") + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- For any given .hi-pink[threshold], we get false positives and .hi-yellow[true positives]. ```{R, roc-ex1-threshold3, echo = F} ggplot( data = tibble(x = seq(0, 1, 0.001)), aes(x = x) ) + geom_area( stat = "function", fun = . %>% d_pos(., sep = 4), color = purple, fill = purple, xlim = c(0, 0.4375) ) + geom_area(stat = "function", fun = . %>% d_neg(., sep = 4), color = orange, fill = orange, alpha = 0.2) + geom_area( stat = "function", fun = . %>% d_pos(., sep = 4), color = yellow, fill = yellow, alpha = 0.8, xlim = c(0.4375, 1) ) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0.4375, color = red_pink, size = 1.2) + scale_x_continuous("Threshold for predicting 'Positive'", labels = scales::percent) + ylab("Density") + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- ```{R, roc-ex1-data, include = F} # Generate data roc1 = tibble( tpr = seq(0, 1, 0.001) %>% tpr(., sep = 4), fpr = seq(0, 1, 0.001) %>% fpr(., sep = 4) ) ``` ... moving through all possible thresholds generates the .hi-pink[ROC] (.hi-orange[AUC] ≈ `r roc1 %>% arrange(fpr) %>% mutate(dist = fpr - lag(fpr)) %>% transmute(new = tpr * dist) %$% new %>% sum(na.rm = T) %>% round(3)`). ```{R, roc-ex1-roc, echo = F} # Plot ggplot( data = roc1, aes(x = fpr, y = tpr) ) + geom_ribbon( aes(ymin = 0, ymax = tpr), fill = orange, alpha = 0.75 ) + geom_path(color = red_pink, size = 1) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + geom_abline(intercept = 0, slope = 1, color = slate, linetype = "dotted", size = 0.5) + scale_x_continuous("False positive rate (FP/N)", labels = scales::percent) + scale_y_continuous("True positive rate (TP/P)", labels = scales::percent) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- Increasing separation between .hi-orange[negative] and .hi-purple[positive] outcomes... ```{R, roc-ex2-d, echo = F} ggplot( data = tibble(x = seq(0, 1, 0.001)), aes(x = x) ) + geom_area(stat = "function", fun = . %>% d_pos(., sep = 8), color = purple, fill = purple) + geom_area(stat = "function", fun = . %>% d_neg(., sep = 8), color = orange, fill = orange, alpha = 0.2) + geom_hline(yintercept = 0) + scale_x_continuous("Threshold for predicting 'Positive'", labels = scales::percent) + ylab("Density") + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- ```{R, roc-ex2-data, include = F} # Generate data roc2 = tibble( tpr = seq(0, 1, 0.001) %>% tpr(., sep = 8), fpr = seq(0, 1, 0.001) %>% fpr(., sep = 8) ) ``` ... reduces error (shifts .hi-pink[ROC]) and increases .hi-orange[AUC] (≈ `r roc2 %>% arrange(fpr) %>% mutate(dist = fpr - lag(fpr)) %>% transmute(new = tpr * dist) %$% new %>% sum(na.rm = T) %>% round(3)`). ```{R, roc-ex2-roc, echo = F} # Plot ggplot( data = roc2, aes(x = fpr, y = tpr) ) + geom_ribbon( aes(ymin = 0, ymax = tpr), fill = orange, alpha = 0.75 ) + geom_path(color = red_pink, size = 1) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + geom_abline(intercept = 0, slope = 1, color = slate, linetype = "dotted", size = 0.5) + scale_x_continuous("False positive rate (FP/N)", labels = scales::percent) + scale_y_continuous("True positive rate (TP/P)", labels = scales::percent) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- Further increasing separation between .hi-orange[negative] and .hi-purple[positive] outcomes... ```{R, roc-ex3-d, echo = F} ggplot( data = tibble(x = seq(0, 1, 0.001)), aes(x = x) ) + geom_area(stat = "function", fun = . %>% d_pos(., sep = 10), color = purple, fill = purple) + geom_area(stat = "function", fun = . %>% d_neg(., sep = 10), color = orange, fill = orange, alpha = 0.2) + geom_hline(yintercept = 0) + scale_x_continuous("Threshold for predicting 'Positive'", labels = scales::percent) + ylab("Density") + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- ```{R, roc-ex3-data, include = F} # Generate data roc3 = tibble( tpr = seq(0, 1, 0.001) %>% tpr(., sep = 10), fpr = seq(0, 1, 0.001) %>% fpr(., sep = 10) ) ``` ... reduces error (shifts .hi-pink[ROC]) and increases .hi-orange[AUC] (≈ `r roc3 %>% arrange(fpr) %>% mutate(dist = fpr - lag(fpr)) %>% transmute(new = tpr * dist) %$% new %>% sum(na.rm = T) %>% round(3)`). ```{R, roc-ex3-roc, echo = F} # Plot ggplot( data = roc3, aes(x = fpr, y = tpr) ) + geom_ribbon( aes(ymin = 0, ymax = tpr), fill = orange, alpha = 0.75 ) + geom_path(color = red_pink, size = 1) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + geom_abline(intercept = 0, slope = 1, color = slate, linetype = "dotted", size = 0.5) + scale_x_continuous("False positive rate (FP/N)", labels = scales::percent) + scale_y_continuous("True positive rate (TP/P)", labels = scales::percent) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- Tiny separation ("guessing") between .hi-orange[negative] and .hi-purple[positive] outcomes... ```{R, roc-ex4-d, echo = F} ggplot( data = tibble(x = seq(0, 1, 0.001)), aes(x = x) ) + geom_area(stat = "function", fun = . %>% d_pos(., sep = 0.2), color = purple, fill = purple) + geom_area(stat = "function", fun = . %>% d_neg(., sep = 0.2), color = orange, fill = orange, alpha = 0.2) + geom_hline(yintercept = 0) + scale_x_continuous("Threshold for predicting 'Positive'", labels = scales::percent) + ylab("Density") + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- ```{R, roc-ex4-data, include = F} # Generate data roc4 = tibble( tpr = seq(0, 1, 0.001) %>% tpr(., sep = 0.2), fpr = seq(0, 1, 0.001) %>% fpr(., sep = 0.2) ) ``` ... increases error (shifts .hi-pink[ROC]) and pushes .hi-orange[AUC] toward 0.5 (here ≈ `r roc4 %>% arrange(fpr) %>% mutate(dist = fpr - lag(fpr)) %>% transmute(new = tpr * dist) %$% new %>% sum(na.rm = T) %>% round(3)`). ```{R, roc-ex4-roc, echo = F} # Plot ggplot( data = roc4, aes(x = fpr, y = tpr) ) + geom_ribbon( aes(ymin = 0, ymax = tpr), fill = orange, alpha = 0.75 ) + geom_path(color = red_pink, size = 1) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + geom_abline(intercept = 0, slope = 1, color = slate, linetype = "dotted", size = 0.5) + scale_x_continuous("False positive rate (FP/N)", labels = scales::percent) + scale_y_continuous("True positive rate (TP/P)", labels = scales::percent) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- Getting .hi-orange[negative] and .hi-purple[positive] outcomes backwards... ```{R, roc-ex5-d, echo = F} ggplot( data = tibble(x = seq(0, 1, 0.001)), aes(x = x) ) + geom_area(stat = "function", fun = . %>% d_pos(., sep = -10), color = purple, fill = purple) + geom_area(stat = "function", fun = . %>% d_neg(., sep = -10), color = orange, fill = orange, alpha = 0.2) + geom_hline(yintercept = 0) + scale_x_continuous("Threshold for predicting 'Positive'", labels = scales::percent) + ylab("Density") + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- ```{R, roc-ex5-data, include = F} # Generate data roc5 = tibble( tpr = seq(0, 1, 0.001) %>% tpr(., sep = -10), fpr = seq(0, 1, 0.001) %>% fpr(., sep = -10) ) ``` ... increases error (shifts .hi-pink[ROC]) and pushes .hi-orange[AUC] toward 0 (here ≈ `r roc5 %>% arrange(fpr) %>% mutate(dist = fpr - lag(fpr)) %>% transmute(new = tpr * dist) %$% new %>% sum(na.rm = T) %>% round(3)`). ```{R, roc-ex5-roc, echo = F} # Plot ggplot( data = roc5, aes(x = fpr, y = tpr) ) + geom_ribbon( aes(ymin = 0, ymax = tpr), fill = orange, alpha = 0.75 ) + geom_path(color = red_pink, size = 1) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + geom_abline(intercept = 0, slope = 1, color = slate, linetype = "dotted", size = 0.5) + scale_x_continuous("False positive rate (FP/N)", labels = scales::percent) + scale_y_continuous("True positive rate (TP/P)", labels = scales::percent) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- name: extras layout: false # R extras .b[AUC] You can calculate AUC in R using the `prSummary()` function from `caret`.
See [here](https://topepo.github.io/caret/measuring-performance.html#measures-for-predicted-classes) for an example. .b[Logistic elasticnet] `glmnet()` (for ridge , lasso, and elasticnet) extends to logistic regression.super[.pink[†]] by specifying the `family` argument of `glmnet`, _e.g._, ```{R, eval = F} # Example of logistic regression with lasso logistic_lasso = glmnet( y = y, x = x, family = "binomial", alpha = 1, lambda = best_lambda ) ``` .footnote[ .pink[†] Or many other generalized linear models. ] --- name: sources layout: false # Sources These notes draw upon - [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) (*ISL*)
James, Witten, Hastie, and Tibshirani - *[Receiver Operating Characteristic Curves Demystified (in Python)](https://towardsdatascience.com/receiver-operating-characteristic-curves-demystified-in-python-bd531a4364d0)* --- # Table of contents .col-left[ .smallest[ #### Admin - [Today](#admin-today) - [Upcoming](#admin-soon) #### Classification - [Introduction](#intro) - [Introductory examples](#examples) - [Why not linear regression](#no-regress) - [Linear probability models](#lpm) #### Logistic regression - [Intro](#logistic-intro) - [The logistic function](#logistic-logistic) - [Estimation](#logistic-estimation) - [Maximum likelihood](#logistic-mle) - [In R](#logistic-r) - [Prediction](#logistic-prediction) ] ] .col-right[ .smallest[ #### Assessment - [How did we do?](#how) - [The confusion matrix](#confusion) - [In R](#cm-r) - [Thresholds](#thresholds) - [ROC curves and AUC](#roc) #### Other - [Extras](#extras) - [Sources/references](#sources) ] ]