class: center, middle, inverse, title-slide .title[ # Topic 14: Classification ] .subtitle[ ## Part 1: Methods ] .author[ ### Nick Hagerty
ECNS 460/560 Fall 2023
Montana State University ] .date[ ###
.smallest[*Adapted from
“Prediction and machine-learning in econometrics”
by Ed Rubin, used with permission. These slides are excluded from this resource’s overall CC license.] ] --- exclude: true --- # Table of contents 1. [Classification](#intro) 1. [Logistic regression](#logistic) 1. [*k*-nearest neighbors](#knn) 1. [Model assessment](#assessment) 1. [Decision trees](#trees) 1. [Wrap-up](#end) --- layout: true # Classification --- class: inverse, middle --- name: intro .attn[Regression problems] seek to predict the number an outcome will take. .attn[Classification problems] instead seek to predict the category of an outcome. **Examples:** - Using life/criminal history (and demographics?):<br>Can we predict whether a defendant is .b[granted bail]? -- - Based upon a set of symptoms and observations:<br>Can we predict a patient's .b[medical condition](s)? -- - From the pixels in an image:<br>Can we classify images as .b[chihuahua or blueberry muffin]? --- layout: false class: clear <img src="images/chihuahua-muffin.jpg" width="72%" style="display: block; margin: auto;" /> --- layout: true # Classification ## Why not regression? --- name: no-regress Regression methods are not made to deal with .b[multiple categories]. .ex[Ex.] Consider three medical diagnoses: .pink[stroke], .purple[overdose], and .orange[seizure]. Regression needs a numeric outcome—how should we code our categories? `$$Y=\begin{cases} \displaystyle 1 & \text{if }\color{#e64173}{\text{ stroke}} \\ \displaystyle 2 & \text{if }\color{#6A5ACD}{\text{ overdose}} \\ \displaystyle 3 & \text{if }\color{#FFA500}{\text{ seizure}} \\ \end{cases}$$` The categories' ordering is unclear—let alone the actual valuation. --- name: lpm In .b[binary outcome] cases, we .it[can] apply linear regression. These models are called .attn[linear probability models] (LPMs). The .b[predictions] from an LPM 1. estimate the conditional probability `\(y_i = 1\)`, _i.e._, `\(\mathop{\text{Pr}}\left(y_o = 1 \mid x_o\right)\)` 1. are not restricted to being between 0 and 1 1. provide an ordering—and a reasonable estimate of probability --- layout: true class: clear, middle --- Let's consider an example: the `Default` dataset from `ISLR`
--- exclude: true --- .hi-purple[The data:] The outcome, default, only takes two values (only 3.3% default). <img src="14a-Classification_files/figure-html/boxplot-default-balance-1.svg" style="display: block; margin: auto;" /> --- .hi-purple[The data:] The outcome, default, only takes two values (only 3.3% default). <img src="14a-Classification_files/figure-html/plot-default-points-1.svg" style="display: block; margin: auto;" /> --- .hi-pink[The linear probability model] struggles with prediction in this setting. <img src="14a-Classification_files/figure-html/plot-default-lpm-1.svg" style="display: block; margin: auto;" /> --- layout: true # Logistic regression --- class: inverse, middle --- name: logistic ## Intro .attn[Logistic regression] .b[models the probability] that our outcome `\(Y\)` belongs to a .b[specific category] (often whichever category we think of as `TRUE`). -- For example, we just saw a graph where $$ `\begin{align} \mathop{\text{Pr}}\left(\text{Default} = \text{Yes} | \text{Balance}\right) = p(\text{Balance}) \end{align}` $$ we are modeling the probability of `default` as a function of `balance`. -- We use the .b[estimated probabilities] to .b[make predictions], _e.g._, - if `\(p(\text{Balance})\geq 0.5\)`, we could predict "Yes" for Default --- name: logistic-logistic ## What's .it[logistic]? We want to model probability as a function of the predictors `\(\left(\beta_0 + \beta_1 X\right)\)`. .col-centered[ .hi-pink[Linear probability model] <br> .pink[linear] transform. of predictors $$ `\begin{align} p(X) = \beta_0 + \beta_1 X \end{align}` $$ ] .col-centered[ .hi-orange[Logistic model] <br> .orange[logistic] transform. of predictors $$ `\begin{align} p(X) = \dfrac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \end{align}` $$ ] .clear-up[ What does this .it[logistic function] `\(\left(\frac{e^x}{1+e^x}\right)\)` do? ] 1. ensures predictions are between 0 `\((x\rightarrow-\infty)\)` and 1 `\((x\rightarrow\infty)\)` 1. forces an S-shaped curve through the data --- layout: false class: clear, middle .hi-orange[Logistic regression]'s predictions of `\(\mathop{p}(\text{Balance})\)` <img src="14a-Classification_files/figure-html/plot-default-logistic-2-1.svg" style="display: block; margin: auto;" /> --- layout: true # *k*-nearest neighbors --- class: inverse, middle --- name: knn ## Setup *k*-nearest neighbors (KNN) simply assigns a category based upon the nearest `\(K\)` neighbors' votes (their values). -- .note[More formally:] Using the `\(K\)` closest neighbors.super[.pink[†]] to test observation `\(\color{#6A5ACD}{\mathbf{x_0}}\)`, we calculate the share of the observations whose class equals `\(j\)`, $$ `\begin{align} \hat{\mathop{\text{Pr}}}\left(\mathbf{y} = j | \mathbf{X} = \color{#6A5ACD}{\mathbf{x_0}}\right) = \dfrac{1}{K} \sum_{i \in \mathcal{N}_0} \mathop{\mathbb{I}}\left( \color{#FFA500}{\mathbf{y}}_i = j \right) \end{align}` $$ These shares are our estimates for the unknown conditional probabilities. We then assign observation `\(\color{#6A5ACD}{\mathbf{x_0}}\)` to the class with the highest probability. .footnote[ .pink[†] In `\(\color{#6A5ACD}{\mathbf{X}}\)` space, a.k.a. Euclidean distance. ] --- name: knn-fig layout: false class: clear, middle .b[KNN in action] <br>.note[Left:] K=3 estimation for "x". .note[Right:] KNN decision boundaries. <img src="images/isl-knn.png" width="2867" style="display: block; margin: auto;" /> .smaller.it[Source: ISL] --- class: clear, middle The choice of K is very important—ranging from super flexible to inflexible. --- class: clear, middle .b[KNN error rates], as K increases <img src="images/isl-knn-error.png" width="85%" style="display: block; margin: auto;" /> .smaller.it[Source: ISL] --- layout: true # Model assessment --- class: inverse, middle --- name: assessment ## Can't use MSE anymore With categorical variables, MSE doesn't work—_e.g._, .center[ `\(\color{#FFA500}{\mathbf{y}} - \hat{\color{#FFA500}{\mathbf{y}}} =\)` .orange[(Chihuahua)] - .orange[(Blueberry muffin)] `\(=\)` ...? ] Clearly we need a different way to define model performance. --- The most common approach is exactly what you'd guess... .hi-slate[Training error rate] The share of training predictions that we get wrong. $$ `\begin{align} \dfrac{1}{n} \sum_{i=1}^{n} \mathbb{I}\!\left( \color{#FFA500}{y}_i \neq \hat{\color{#FFA500}{y}}_i \right) \end{align}` $$ where `\(\mathbb{I}\!\left( \color{#FFA500}{y}_i \neq \hat{\color{#FFA500}{y}}_i \right)\)` is an indicator function that equals 1 whenever our prediction is wrong. -- .hi-pink[Test error rate] The share of test predictions that we get wrong. .center[ Average `\(\mathbb{I}\!\left( \color{#FFA500}{y}_0 \neq \hat{\color{#FFA500}{y}}_0 \right)\)` in our .hi-pink[test data] ] --- name: how ## Example: the Default data Logistic regression (with only one predictor: balance) guesses 97.25% of the observations correctly. -- - Is this good? -- - Remember that 3.33% of the observations actually defaulted. So we would get 96.67% right by guessing "No" for everyone..super[.pink[†]] .footnote[ .pink[†] This idea is called the .it[null classifier]. ] -- - We .it[did] guess 30.03% of the defaults, which is clearer better than 0%. -- .qa[Q] How can we more formally assess our model's performance? .qa[A] All roads lead to the .attn[confusion matrix]. --- name: confusion ## The confusion matrix The .attn[confusion matrix] is us a convenient way to display <br>.hi-orange[correct] and .hi-purple[incorrect] predictions for each class of our outcome.
Truth
No
Yes
Prediction
No
True Negative (TN)
False Negative (FN)
Yes
False Positive (FP)
True Positive (TP)
-- The .attn[accuracy] of a method is the share of .orange[correct] predictions, _i.e._, .center[ .b[Accuracy] = (.hi-orange[TN] + .hi-orange[TP]) / (.hi-orange[TN] + .hi-orange[TP] + .hi-purple[FN] + .hi-purple[FP]) ] -- This matrix also helps display many other measures of assessment. --- ## The confusion matrix .attn[Sensitivity:] the share of positive outcomes `\(Y=1\)` that we correctly predict. .center[ .b[Sensitivity] = .hi-orange[TP] / (.hi-orange[TP] + .hi-purple[FN]) ]
Truth
No
Yes
Prediction
No
True Negative (TN)
False Negative (FN)
Yes
False Positive (FP)
True Positive (TP)
Sensitivity is also called .attn[recall] and the .attn[true-positive rate]. One minus sensitivity is the .attn[type-II error rate]. --- ## The confusion matrix .attn[Specificity:] the share of neg. outcomes `\((Y=0)\)` that we correctly predict. .center[ .b[Specificity] = .hi-orange[TN] / (.hi-orange[TN] + .hi-purple[FP]) ]
Truth
No
Yes
Prediction
No
True Negative (TN)
False Negative (FN)
Yes
False Positive (FP)
True Positive (TP)
One minus specificity is the .attn[false-positive rate] or .attn[type-I error rate]. --- ## The confusion matrix .attn[Precision:] the share of predicted positives `\((\hat{Y}=1)\)` that are correct. .center[ .b[Precision] = .hi-orange[TP] / (.hi-orange[TP] + .hi-purple[FP]) ]
Truth
No
Yes
Prediction
No
True Negative (TN)
False Negative (FN)
Yes
False Positive (FP)
True Positive (TP)
--- ## Which criterion? .qa[Q] So .it[which] criterion should we use? -- .qa[A] You should use the .it[right] criterion for your context. - Are true positives more valuable than true negatives? <br>.note[Sensitivity] will be key. - Do you want to have high confidence in predicted positives? <br>.note[Precision] is your friend - Are all errors equal? <br> .note[Accuracy] is perfect. [There's a lot more](https://yardstick.tidymodels.org/reference/index.html), _e.g._, the .attn[F.sub[1] score] combines precision and sensitivity. --- layout: true # Decision trees --- class: inverse, middle --- name: trees ## Fundamentals .attn[Decision trees] - split the .it[predictor space] (our `\(\mathbf{X}\)`) into regions - then predict the most-common value within a region -- .attn[Decision trees] 1. work for .hi[both classification and regression] 1. are inherently .hi[nonlinear] 1. are relatively .hi[simple] and .hi[interpretable] 1. often .hi[underperform] relative to competing methods, but 1. easily extend to .hi[very competitive ensemble methods] (*many* trees).super[🌲] .footnote[ 🌲 Though the ensembles will be much less interpretable. ] --- layout: true class: clear --- exclude: true --- .ex[Example:] .b[A simple decision tree] classifying credit-card default
--- name: ex-partition Let's see how the tree works -- —starting with our data (default: .orange[Yes] .it[vs.] .purple[No]). <img src="14a-Classification_files/figure-html/plot-raw-1.svg" style="display: block; margin: auto;" /> --- The .hi-pink[first partition] splits balance at $1,800. <img src="14a-Classification_files/figure-html/plot-split1-1.svg" style="display: block; margin: auto;" /> --- The .hi-pink[second partition] splits balance at $1,972, (.it[conditional on bal. > $1,800]). <img src="14a-Classification_files/figure-html/plot-split2-1.svg" style="display: block; margin: auto;" /> --- The .hi-pink[third partition] splits income at $27K .b[for] bal. between $1,800 and $1,972. <img src="14a-Classification_files/figure-html/plot-split3-1.svg" style="display: block; margin: auto;" /> --- These three partitions give us four .b[regions]... <img src="14a-Classification_files/figure-html/plot-split3b-1.svg" style="display: block; margin: auto;" /> --- .b[Predictions] cover each region (_e.g._, using the region's most common class). <img src="14a-Classification_files/figure-html/plot-split3c-1.svg" style="display: block; margin: auto;" /> --- .b[Predictions] cover each region (_e.g._, using the region's most common class). <img src="14a-Classification_files/figure-html/plot-split3d-1.svg" style="display: block; margin: auto;" /> --- .b[Predictions] cover each region (_e.g._, using the region's most common class). <img src="14a-Classification_files/figure-html/plot-split3e-1.svg" style="display: block; margin: auto;" /> --- .b[Predictions] cover each region (_e.g._, using the region's most common class). <img src="14a-Classification_files/figure-html/plot-split3f-1.svg" style="display: block; margin: auto;" /> --- layout: true class: clear, middle --- name: linearity .qa[Q] How do trees compare to linear models? .tran[.b[A] It depends how linear truth is.] --- .qa[Q] How do trees compare to linear models? .qa[A] It depends how linear the true boundary is. --- .b[Linear boundary:] trees struggle to recreate a line. <img src="images/compare-linear.png" width="5784" style="display: block; margin: auto;" /> .ex.small[Source: ISL, p. 315] --- .b[Nonlinear boundary:] trees easily replicate the nonlinear boundary. <img src="images/compare-nonlinear.png" width="5784" style="display: block; margin: auto;" /> .ex.small[Source: ISL, p. 315] --- layout: false class: inverse, middle name: end # Wrap-up --- # For the project Let's recap the models you now have to work with: For **regression** (continuous outcome variables): - Linear regression with shrinkage/regularization - Ridge or lasso (tune `\(\lambda\)`) - Elasticnet (tune `\(\lambda\)` and `\(\alpha\)`) - *k*-nearest neighbors (`mode="regression"`; tune `\(k\)`) For **classification** (categorical outcome variables): - Logistic regression with shrinkage/regularization - Ridge, lasso, elasticnet (as above) - *k*-nearest neighbors (`mode="classification"`; tune `\(k\)`) - (We haven't covered how to tune decision trees.) --- # Ensemble methods If we try multiple *types* of models, how do we choose among them? - You could just choose the one that performs best in cross-validation. - But different methods have different strengths. The 2nd-best model may still have useful information. - Often, the *best* prediction will **combine predictions from multiple models.** Simple ensemble model: - For **regression:** Take the average prediction across multiple models. - For **classification:** Take the majority vote across multiple models. --- # More advanced topics If you go further in machine learning, you can learn about: **Methods** - Random forests (generalization of decision trees) - Boosting and bagging - Support vector machines - Neural nets / deep learning - Unsupervised learning **Applications** - Text analysis (natural language processing) - Image (and video) processing