class: center, middle, inverse, title-slide .title[ # Lecture .mono[009] ] .subtitle[ ## Support vector machines ] .author[ ### Edward Rubin ] --- exclude: true --- name: admin # Admin ## Today .b[Topic] Support vector machines ## Upcoming .b[Readings] - .note[Today] .it[ISL] Ch. 9 - .note[Next] .it[100ML] Ch. [6](https://www.dropbox.com/s/uh48e6wjs4w13t5/Chapter6.pdf?dl=0) .b[Project] Project updates/questions? --- layout: true # Support vector machines --- class: inverse, middle --- name: intro ## Intro .attn[Support vector machines] (SVMs) are a .it[general class] of classifiers that essentially attempt to separate two classes of observations. -- > SVMs have been shown to perform well in a variety of settings, and are often considered one of the best "out of the box" classifiers. .grey-light.it[ISL, p. 337] -- The .attn[support vector machine] generalizes a much simpler classifier—the .attn[maximal margin classifier]. The .attn[maximal margin classifier] attempts to separate the .b[two classes] in our prediction space using .b[a single hyperplane]. --- name: hyperplanes ## What's a hyperplane? Consider a space with `\(p\)` dimensions. A .attn[hyperplane] is a `\(p-1\)` dimensional .b[subspace] that is 1. .b[flat] (no curvature) 1. .b[affine] (may or may not pass through the origin) -- .ex[Example] In `\(p=1\)` dimension, a .it.pink[hyperplane] is a <img src="slides_files/figure-html/hp-1a-1.svg" style="display: block; margin: auto;" /> --- ## What's a hyperplane? Consider a space with `\(p\)` dimensions. A .attn[hyperplane] is a `\(p-1\)` dimensional .b[subspace] that is 1. .b[flat] (no curvature) 1. .b[affine] (may or may not pass through the origin) .ex[Example] In `\(p=1\)` dimension, a .it.pink[hyperplane] is a point. <img src="slides_files/figure-html/hp-1b-1.svg" style="display: block; margin: auto;" /> --- ## What's a hyperplane? Consider a space with `\(p\)` dimensions. A .attn[hyperplane] is a `\(p-1\)` dimensional .b[subspace] that is 1. .b[flat] (no curvature) 1. .b[affine] (may or may not pass through the origin) .ex[Example] In `\(p=2\)` dimensions, a .it.pink[hyperplane] is a <img src="slides_files/figure-html/hp-2a-1.svg" style="display: block; margin: auto;" /> --- ## What's a hyperplane? Consider a space with `\(p\)` dimensions. A .attn[hyperplane] is a `\(p-1\)` dimensional .b[subspace] that is 1. .b[flat] (no curvature) 1. .b[affine] (may or may not pass through the origin) .ex[Example] In `\(p=2\)` dimensions, a .it.pink[hyperplane] is a line. <img src="slides_files/figure-html/hp-2b-1.svg" style="display: block; margin: auto;" /> --- ## What's a hyperplane? Consider a space with `\(p\)` dimensions. A .attn[hyperplane] is a `\(p-1\)` dimensional .b[subspace] that is 1. .b[flat] (no curvature) 1. .b[affine] (may or may not pass through the origin) .ex[Example] In `\(p=3\)` dimensions, a .it.pink[hyperplane] is a -- (2D) plane. --- ## Hyperplanes We can define a .attn[hyperplane] in `\(p\)` dimensions by constraining the linear combination of the `\(p\)` dimensions..super[.pink[†]] .footnote[ .pink[†] Plus some offset ("intercept") <br>.tran[.pink[††] Alternatively: The hyperplane is composed of such points.] ] -- For example, in two dimensions a hyperplane is defined by $$ `\begin{align} \beta_0 + \beta_1 X_{1} + \beta_2 X_{2} = 0 \end{align}` $$ which is just the equation for a line. -- Points `\(X=\left( X_1,\,X_2 \right)\)` that satisfy the equality .it[live] on the hyperplane..super[.pink[††]] .footnote[ .tran[.pink[†] Plus some offset ("intercept")] <br>.pink[††] Alternatively: The hyperplane is composed of such points. ] --- ## Separating hyperplanes More generally, in `\(p\)` dimensions, we defined a hyperplane by $$ `\begin{align} \beta_0 + \beta_1 X_{1} + \beta_2 X_{2} + \cdots + \beta_p X_{p} = 0 \tag{A} \end{align}` $$ If `\(X=\left( X_1,\,X_2,\, \ldots,\, X_p \right)\)` satisfies the equality, it is on the hyperplane. -- Of course, not every point in the `\(p\)` dimensions will satisfy `\(\text{A}\)`. - If `\(\beta_0 + \beta_1 X_{1} + \cdots + \beta_p X_{p} > 0\)`, then `\(X\)` is .hi-purple.it[above] the hyperplane. - If `\(\beta_0 + \beta_1 X_{1} + \cdots + \beta_p X_{p} < 0\)`, then `\(X\)` sits .hi-orange.it[below] the hyperplane. -- The hyperplane .it[separates] the `\(p\)`-dimensional space into two "halves". --- layout: true class: clear --- exclude: true --- .ex[Ex:] A .hi-pink[separating hyperplane] in two dimensions: `\(\color{#e64173}{3 + 2 X_1 - 4 X_2 = 0}\)` <img src="slides_files/figure-html/fig-sep-2d-1.svg" style="display: block; margin: auto;" /> --- exclude: true --- .ex[Ex:] A .hi-pink[separating hyperplane] in 3 dimensions: `\(\color{#e64173}{3 + 2 X_1 - 4 X_2 + 2 X_3 = 0}\)` <iframe src="fig-sep-3d.html" width="100%" height="600" id="igraph" scrolling="no" seamless="seamless" frameBorder="0"> </iframe> --- layout: true # Support vector machines --- name: sep-class ## Separating hyperplanes and classification .note[Idea:] .it[Separate] two classes of outcomes in the `\(p\)` dimensions of our predictor space using a separating hyperplane. -- To make a prediction for observation `\((x^o,\,y^o)=(x_1^o,\,x_2^o,\,\ldots,\,x_p^o,\,y^o):\)` We classify points that live .purple["above" of the plane] as one class, *i.e.*, .center[ If `\(\beta_0 + \beta_1 x^o_{1} + \cdots + \beta_p x^o_{p} \color{#6A5ACD}{> 0}\)`, then `\(\color{#6A5ACD}{\hat{y}^o =}\)` .purple[Class 1] ] We classify points .orange["below" the plane] as the other class, *i.e.*, .center[ If `\(\beta_0 + \beta_1 x^o_{1} + \cdots + \beta_p x^o_{p} \color{#FFA500}{< 0}\)`, then `\(\color{#FFA500}{\hat{y}^o =}\)` .orange[Class 2] ] -- .note[Note] This strategy assumes a separating hyperplane exists. --- exclude: true --- .it[If] .hi-pink[a separating hyperplane] exists, <img src="slides_files/figure-html/fig-plane-class-1-1.svg" style="display: block; margin: auto;" /> --- count: false .it[If] .hi-pink[a separating hyperplane] exists, then it defines a binary classifier. <img src="slides_files/figure-html/fig-plane-class-2-1.svg" style="display: block; margin: auto;" /> --- .it[If] .hi-pink[a separating hyperplane] exists, <img src="slides_files/figure-html/fig-many-planes-1-1.svg" style="display: block; margin: auto;" /> --- count: false .it[If] .hi-pink[a separating hyperplane] exists, then .hi-slate[many separating hyperplanes] exist. <img src="slides_files/figure-html/fig-many-planes-2-1.svg" style="display: block; margin: auto;" /> --- A .b[a separating hyperplane] may not exist. <img src="slides_files/figure-html/fig-non-existant-1.svg" style="display: block; margin: auto;" /> --- ## Decisions .note[Summary] A given hyperplane $$ `\begin{align} \color{#e64173}{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p} = 0 \end{align}` $$ produces a decision boundary. -- We can determine any point's `\(\left( x^o \right)\)` .it[side] of the boundary. $$ `\begin{align} \mathop{f}\left( x^o \right) = \color{#e64173}{\beta_0 + \beta_1 x^o_1 + \beta_2 x^o_2 + \cdots + \beta_p x^o_p} \end{align}` $$ We classify observationg `\(x^o\)` based upon whether `\(\mathop{f}\left( x^o \right)\)` is .purple[positive]/.orange[negative]. The magnitude of `\(\mathop{f}\left( x^o \right)\)` tells us about our .it[confidence] in the classification..super[.pink[†]] .footnote[ .pink[†] Larger magnitudes are farther from the boundary. ] --- name: which-hyperplane ## Which separating hyperplane? .qa[Q] How do we choose between the possible hyperplanes? -- .qa[A] .it[One solution:] Choose the separating hyperplane that is "farthest" from the training data points—maximizing "separation." -- The .attn[maximal margin hyperplane].super[.pink[†]] is the hyperplane that .footnote[ .pink[†] AKA the .it[optimal separating hyperplane] <br>.tran[.pink[††] The distance to the nearest observation.] ] 1. .hi[separates] the two classes of obsevations 1. .hi[maximizes] the .attn[margin]—the distance to the nearest observation.super[.pink[††]] .footnote[ .tran[.pink[†] AKA the .it[optimal separating hyperplane]] <br>.pink[††] Put differently: The smallest distance to a training observation. ] where .it[distance] is a point's perpendicular distance to the hyperplane. --- exclude: true --- layout: true class: clear --- The .hi-pink[maximal margin hyperplane]... <img src="slides_files/figure-html/fig-mmh-1-1.svg" style="display: block; margin: auto;" /> --- ...maximizes the .b.grey[margin] between the hyperplane and training data... <img src="slides_files/figure-html/fig-mmh-2-1.svg" style="display: block; margin: auto;" /> --- ...and is supported by three equidistant observations—the .b[support vectors]. <img src="slides_files/figure-html/fig-mmh-3-1.svg" style="display: block; margin: auto;" /> --- layout: true # Support vector machines --- ## The maximal margin hyperplane Formally, the .attn[maximal margin hyperplane] solves the problem: Maximize the margin `\(M\)` over the set of `\(\left\{ \beta_0,\, \beta_1,\, \ldots,\, \beta_p,\, M \right\}\)` such that -- $$ `\begin{align} \sum_{j=1}^{p} \beta_j^2 = 1 \tag{1} \end{align}` $$ $$ `\begin{align} y_i \left( \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} \right) \ge M \tag{2} \end{align}` $$ for all observations `\(i\)`. -- `\((2)\)` Ensures we separate (classify) observations correctly. -- `\((1)\)` allows us to interpret `\((2)\)` as "distance from the hyperplane". --- ## Fake constraints Note that our first "constraint" $$ `\begin{align} \sum_{j=1}^{p} \beta_j^2 = 1 \tag{1} \end{align}` $$ does not actually constrain `\(-1 \le \beta_j \le 1\)` (or the hyperplane). -- If we can define a hyperplane by $$ `\begin{align} \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \cdots + \beta_p x_{i,p} = 0 \end{align}` $$ then we can also rescale the same hyperplane with some constant `\(k\)` $$ `\begin{align} k \left( \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \cdots + \beta_p x_{i,p} \right) = 0 \end{align}` $$ --- ## The maximal margin classifier The maximal margin hyperplane produces the .attn[maximal margin classifier]. -- .note[Notes] 1. We are doing .b[binary classification]. 1. The decision boundary only uses the .b[support vectors]—very sensitive. 1. This classifier can struggle in .b[large dimensions] (big `\(p\)`). 1. A separating hyperplane does not always exist (.b[non-separable]). 1. Decision boundaries can be .b[nonlinear]. --- layout: false class: clear, middle Let's start by addressing non-separability... --- class: clear, middle Surely there's still a decent hyperplane-based classifier here, right? <img src="slides_files/figure-html/fig-non-existant-2-1.svg" style="display: block; margin: auto;" /> --- layout: true # Support vector machines --- name: soft-margins ## Soft margins When we cannot .it[perfectly] separate our classes, we can use .attn[soft margins], which are margins that "accept" some number of observations. -- .note[The idea:] We will allow observations to be 1. in the margin 2. on the wrong side of the hyperplane but each will come with a price. -- Using these .it[soft margins], we create a hyperplane-based classifier called the .attn[support vector classifier]..super[.pink[†]] .footnote[ .pink[†] Also called the .it[soft margin classifier]. ] --- exclude: true --- layout: false class: clear Our underlying population clearly does not have a separating hyperplane. <img src="slides_files/figure-html/fig-nosep-1-1.svg" style="display: block; margin: auto;" /> --- class: clear Our sample population also does not have a separating hyperplane. <img src="slides_files/figure-html/fig-nosep-2-1.svg" style="display: block; margin: auto;" /> --- class: clear Our .hi[hyperplane] <img src="slides_files/figure-html/fig-nosep-3a-1.svg" style="display: block; margin: auto;" /> --- class: clear Our .hi[hyperplane] with .hi-grey[soft margins]... <img src="slides_files/figure-html/fig-nosep-3b-1.svg" style="display: block; margin: auto;" /> --- class: clear Our .hi[hyperplane] with .hi-grey[soft margins] and .b[support vectors]. <img src="slides_files/figure-html/fig-nosep-3c-1.svg" style="display: block; margin: auto;" /> --- class: clear .b[Support vectors:] on (i) the margin or (ii) on the wrong side of the margin. <img src="slides_files/figure-html/fig-nosep-3d-1.svg" style="display: block; margin: auto;" /> --- layout: true # Support vector machines ## Support vector classifier --- name: svc The .attn[support vector classifier] selects a hyperplane by solving the problem Maximize the margin `\(M\)` over the set `\(\left\{ \beta_0, \beta_1, \ldots,\beta_p, \epsilon_1, \ldots, \epsilon_n, M \right\}\)` s.t. -- $$ `\begin{align} \sum_{j=1}^{p} \beta_j^2 = 1 \tag{3} \end{align}` $$ -- $$ `\begin{align} y_i \left( \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} \right) \ge M \left( 1 - \epsilon_i \right) \tag{4} \end{align}` $$ -- $$ `\begin{align} \epsilon_i \ge 0, \quad \sum_{i=1}^{n} \epsilon_i \le C \tag{5} \end{align}` $$ The `\(\epsilon_i\)` are .attn[slack variables] that allow `\(i\)` to .it[violate] the margin or hyperplane. -- <br> `\(C\)` gives is our budget for these violations. --- layout: true class: clear --- class: middle Let's consider constraints `\((4)\)` and `\((5)\)` work together... $$ `\begin{align} y_i \left( \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} \right) \ge M \left( 1 - \epsilon_i \right) \tag{4} \end{align}` $$ $$ `\begin{align} \epsilon_i \ge 0, \quad \sum_{i=1}^{n} \epsilon_i \le C \tag{5} \end{align}` $$ --- $$ `\begin{align} \color{#6A5ACD}{y_i} \left( \color{#6A5ACD}{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}} \right) \ge M \left( 1 - \color{#e64173}{\epsilon_i} \right), \quad \color{#e64173}{\epsilon_i} \ge 0, \quad \sum_{i=1}^{n} \color{#e64173}{\epsilon_i} \le C \end{align}` $$ .more-left.move-up-more[ <img src="slides_files/figure-html/plot-slack-1a-1.svg" width="100%" style="display: block; margin: auto;" /> ] .less-right.move-up[ For `\(\color{#e64173}{\epsilon_i} = 0:\)` - `\(M \left( 1 - \color{#e64173}{\epsilon_i} \right) > 0\)` - Correct side of hyperplane - Correct side of margin <br>(or on margin) - No cost `\((C)\)` - Distance `\(\ge M\)` - .ex[Examples?] `\(\color{#ffffff}{\left( \times \right)}\)` ] --- $$ `\begin{align} \color{#6A5ACD}{y_i} \left( \color{#6A5ACD}{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}} \right) \ge M \left( 1 - \color{#e64173}{\epsilon_i} \right), \quad \color{#e64173}{\epsilon_i} \ge 0, \quad \sum_{i=1}^{n} \color{#e64173}{\epsilon_i} \le C \end{align}` $$ .more-left.move-up-more[ <img src="slides_files/figure-html/plot-slack-1b-1.svg" width="100%" style="display: block; margin: auto;" /> ] .less-right.move-up[ For `\(\color{#e64173}{\epsilon_i} = 0:\)` - `\(M \left( 1 - \color{#e64173}{\epsilon_i} \right) > 0\)` - Correct side of hyperplane - Correct side of margin <br>(or on margin) - No cost `\((C)\)` - Distance `\(\ge M\)` - Correct side of margin: `\(\left( \color{#314f4f}{\times} \right)\)` - .it[On margin:] .small[.purple[1], .orange[6], .orange[9]] ] --- $$ `\begin{align} \color{#6A5ACD}{y_i} \left( \color{#6A5ACD}{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}} \right) \ge M \left( 1 - \color{#e64173}{\epsilon_i} \right), \quad \color{#e64173}{\epsilon_i} \ge 0, \quad \sum_{i=1}^{n} \color{#e64173}{\epsilon_i} \le C \end{align}` $$ .more-left.move-up-more[ <img src="slides_files/figure-html/plot-slack-2a-1.svg" width="100%" style="display: block; margin: auto;" /> ] .less-right.move-up[ For `\(0 \le \color{#e64173}{\epsilon_i} \le 1:\)` - `\(M \left( 1 - \color{#e64173}{\epsilon_i} \right) > 0\)` - Correct side of hyperplane - Wrong side of the margin <br>(.it[violates margin]) - Pays cost `\(\epsilon_i\)` - Distance `\(< M\)` - .ex[Examples?] ] --- $$ `\begin{align} \color{#6A5ACD}{y_i} \left( \color{#6A5ACD}{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}} \right) \ge M \left( 1 - \color{#e64173}{\epsilon_i} \right), \quad \color{#e64173}{\epsilon_i} \ge 0, \quad \sum_{i=1}^{n} \color{#e64173}{\epsilon_i} \le C \end{align}` $$ .more-left.move-up-more[ <img src="slides_files/figure-html/plot-slack-2b-1.svg" width="100%" style="display: block; margin: auto;" /> ] .less-right.move-up[ For `\(0 \le \color{#e64173}{\epsilon_i} \le 1:\)` - `\(M \left( 1 - \color{#e64173}{\epsilon_i} \right) > 0\)` - Correct side of hyperplane - Wrong side of the margin <br>(.it[violates margin]) - Pays cost `\(\epsilon_i\)` - Distance `\(< M\)` - .ex[Ex:] .small[.purple[2], .orange[3]] ] --- $$ `\begin{align} \color{#FFA500}{y_i} \left( \color{#6A5ACD}{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}} \right) \ge M \left( 1 - \color{#e64173}{\epsilon_i} \right), \quad \color{#e64173}{\epsilon_i} \ge 0, \quad \sum_{i=1}^{n} \color{#e64173}{\epsilon_i} \le C \end{align}` $$ .more-left.move-up-more[ <img src="slides_files/figure-html/plot-slack-3a-1.svg" width="100%" style="display: block; margin: auto;" /> ] .less-right.move-up[ For `\(\color{#e64173}{\epsilon_i} \ge 1:\)` - `\(M \left( 1 - \color{#e64173}{\epsilon_i} \right) < 0\)` - Wrong side of hyperplane - Pays cost `\(\epsilon_i\)` - Distance `\(\lesseqqgtr M\)` - .ex[Examples?] ] --- $$ `\begin{align} \color{#FFA500}{y_i} \left( \color{#6A5ACD}{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}} \right) \ge M \left( 1 - \color{#e64173}{\epsilon_i} \right), \quad \color{#e64173}{\epsilon_i} \ge 0, \quad \sum_{i=1}^{n} \color{#e64173}{\epsilon_i} \le C \end{align}` $$ .more-left.move-up-more[ <img src="slides_files/figure-html/plot-slack-3b-1.svg" width="100%" style="display: block; margin: auto;" /> ] .less-right.move-up[ For `\(\color{#e64173}{\epsilon_i} \ge 1:\)` - `\(M \left( 1 - \color{#e64173}{\epsilon_i} \right) < 0\)` - Wrong side of hyperplane - Pays cost `\(\epsilon_i\)` - Distance `\(\lesseqqgtr M\)` - .ex[Ex:] .small[.purple[4], .orange[5], .orange[7], .orange[8]] ] --- $$ `\begin{align} \color{#FFA500}{y_i} \left( \color{#6A5ACD}{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}} \right) \ge M \left( 1 - \color{#e64173}{\epsilon_i} \right), \quad \color{#e64173}{\epsilon_i} \ge 0, \quad \sum_{i=1}^{n} \color{#e64173}{\epsilon_i} \le C \end{align}` $$ .more-left.move-up-more[ <img src="slides_files/figure-html/plot-slack-4-1.svg" width="100%" style="display: block; margin: auto;" /> ] .less-right.move-up[ .b[Support vectors] - On margin - Violate margin - Wrong side of hyperplane determine the classifier. ] --- layout: true # Support vector machines ## Support vector classifier --- The tuning parameter `\(C\)` determines how much *slack* we allow. `\(C\)` is our budget for violating the margin—including observations on the wrong side of the hyperplane. -- .col-left[ .note[Case 1:] `\(C=0\)` - We allow no violations. - Maximal margin hyperplane. - Trains on few obs. ] -- .col-right[ .note[Case 2:] `\(C>0\)` - `\(\le C\)` violations of hyperplane. - *Softens* margins - Larger `\(C\)` uses more obs. ] -- .clear-up[ We tune `\(C\)` via CV to balance low bias (low `\(C\)`) and low variance (high `\(C\)`). ] --- layout: true class: clear --- exclude: true --- Starting with a low budget `\((C)\)`. <img src="slides_files/figure-html/fig-nosep-l-1.svg" style="display: block; margin: auto;" /> --- Now for a high budget `\((C)\)`. <img src="slides_files/figure-html/fig-nosep-h-1.svg" style="display: block; margin: auto;" /> --- class: middle The .hi[support-vector classifier] extends the .hi[maximal-margin classifier]: 1. Allowing for .b[misclassification] - Observations on the *wrong side of the hyperplane*. - Situations where there is no separating hyperplane. 1. Permitting .b[violations of the margin]. 1. Typically using .b[more observations] to determine decision boundary. --- class: middle However, we still are using a (single) linear boundary between our classes. --- exclude: true --- .ex[Ex:] Some data <img src="slides_files/figure-html/plot-nonlin-data-1.svg" style="display: block; margin: auto;" /> --- .ex[Ex:] Some data don't really work with .pink[*linear* decision boundaries]. <img src="slides_files/figure-html/plot-nonlin-lin-svm-1.svg" style="display: block; margin: auto;" /> --- .ex[Ex:] Some data may have .pink[*non-linear* decision boundaries]. <img src="slides_files/figure-html/plot-nonlin-quad-svm-1.svg" style="display: block; margin: auto;" /> --- .ex[Ex:] We could probably do even better with more flexibility. <img src="slides_files/figure-html/plot-nonlin-quad-svm2-1.svg" style="display: block; margin: auto;" /> --- layout: true # Support vector machines --- name: the-svm ## Flexibility In the regression setting, we increase our model's flexiblity by adding polynomials in our predictors, _e.g._, `\(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i + \hat{\beta}_2 x_i^2 + \hat{\beta}_3 x_i^3\)`. -- We can apply a very similar idea to our support vector classifier. .note[Previously:] Train the classifier on `\(X_1, X_2, \ldots, X_p\)`. .note[Idea:] Train the classifier on `\(X_1, X_1^2, X_2, X_2^2 \ldots, X_p, X_p^2\)` (and so on). -- The new classifier has a .b[linear decision boundary] in the expanded space. -- The boundary is going to be .b[nonlinear] within the original space. --- name: the-svm ## Introducing The .attn[support vector machine] runs with this idea of expanded flexiblity. (Why stop at quadratic functions—or polynomials?) Support vector machines .b[train a support vector classifier] on .b[expanded feature].super[.pink[†]] .b[spaces] that result from applying .attn[kernels] to the original features. .footnote[ .pink[†] feature = predictor ] --- name: inner ## Dot products It turns out that solving the support vector classifier only involves the <br>.attn[dot product] of our observations. The .attn[dot product] of two vectors is defined as $$ `\begin{align} \langle a,b \rangle = a_1 b_1 + a_2 b_2 + \cdots + a_p b_p = \sum_{i=1}^p a_i b_i \end{align}` $$ -- .ex[Ex:] The dot product of `\(a\)` = (1,2) and `\(b\)` = (3,4) is `\(\langle a,b \rangle\)` = 1×3 + 2×4 = 11. -- Dot products are often pitched as a measure of two vectors' similarity. --- ## Dot products and the SVC We can write the linear support vector classifier as $$ `\begin{align} f(x) = \beta_0 + \sum_{i=1}^n \alpha_i \langle x,x_i \rangle \end{align}` $$ and we fit the `\((n)\)` `\(\alpha_i\)` and `\(\beta_0\)` with the training observations' dot products..super[.pink[†]] .footnote[ .pink[†] The actually fitting is beyond what we're doing today. ] -- As you might guess, `\(\alpha_i\neq 0\)` only for support-vector obsevations. --- name: svm-gen ## Generalizing .note[Recall:] Linear support vector classifier `\(f(x) = \beta_0 + \sum_{i=1}^n \alpha_i \langle x,x_i \rangle\)` -- .attn[Support vector machines] generalize this linear classifier by simply replacing `\(\langle x,x_i \rangle\)` with (non-linear) .attn[kernel functions].super[.pink[†]] `\(K(x_i,x_{i'})\)`. .footnote[ .pink[†] Or just .it[kernels]. ] -- These magical .attn[kernel functions] are various ways to measure similarity between observations. -- - .note[Linear kernel:] `\(K(x_i,x_{i'}) = \sum_{i=1}^p x_{i,j} x_{i',j}\)` .grey-light[(back to SVC)] -- - .note[Polynomial kernel:] `\(K(x_i,x_{i'}) = \left( 1 + \sum_{i=1}^p x_{i,j} x_{i',j} \right)^2\)` -- - .note[Radial kernel:] `\(K(x_i,x_{i'}) = \mathop{\text{exp}}\left( -\gamma \sum_{j=1}^p \left( x_{i,j} - x_{i',j} \right)^2 \right)\)` --- layout: true class: clear --- exclude: true --- Our example data. <img src="slides_files/figure-html/plot-data-nonlin-1.svg" style="display: block; margin: auto;" /> --- With a linear kernel *plus* and interaction between `\(X_1\)` and `\(X_2\)`..super[.pink[†]] .footnote[ .pink[†] Exciting!! ] <img src="slides_files/figure-html/plot-svm-lin-kern-1.svg" style="display: block; margin: auto;" /> --- Polynomial kernel (of degree 2). <img src="slides_files/figure-html/plot-svm-poly-kern-1.svg" style="display: block; margin: auto;" /> --- And now for the radial kernel! <img src="slides_files/figure-html/plot-svm-radial-kern-1.svg" style="display: block; margin: auto;" /> --- Very small `\(\gamma\)` forces radial kernel to be *even more* local. <img src="slides_files/figure-html/plot-svm-radial-kern-sigma-1.svg" style="display: block; margin: auto;" /> --- layout: true # Support vector machines --- ## More generalizing So why make a big deal of kernels? Anyone can transform variables. -- While anyone can transform variables, you cannot transform variables to cover all spaces that our kernels cover. -- For example, the feature space of the radial kernel is infinite dimensional..super[.pink[†]] .footnote[ .pink[†] And implicit ] -- <img src="images/bill.gif" style="display: block; margin: auto;" /> --- name: svm-r ## In R As you probably guessed, `parsnip` offers several SVM options..super[†] - .b.pink[Linear].super[††] <br> .pink[`svm_linear(cost)`] with `"LiblineaR"` engine - .b.orange[Polynomial] <br> .orange[`svm_poly(cost, degree, scale_factor)`] with `"kernlab"` engine - .b.purple[Radial] <br> .purple[`svm_rbf(cost, rbf_sigma, scale_factor)`] with `"kernlab"` engine You can also find more kernels in the actual packages (or other packages). .footnote[ .pink[†] `e1071` is another popular R package for SVMs, but it does not work with `tidymodels`. <br> .pink[††] Remember that you can still add interactions with the linear kernel. ] --- layout: false class: clear, middle .note[Note:] Costs have units. You often want to center/scale your variables. --- # Support vector machines ## In R For a real example let's go back to the heart disease dataset. - Load the data. - Add a recipe .b[with normalization]. - Define the CV splits. --- # Support vector machines ## In R For a real example let's go back to the heart disease dataset. ```r # Load the dataset heart_df = read_csv("Heart.csv") %>% dplyr::select(-1) %>% rename(HeartDisease = AHD) %>% clean_names() # Define the recipe w/ imputation, dummies, and normalization heart_recipe = recipe(heart_disease ~ ., data = heart_df) %>% step_impute_median(all_predictors() & all_numeric()) %>% step_impute_mode(all_predictors() & all_nominal()) %>% step_dummy(all_predictors() & all_nominal()) %>% step_normalize(all_predictors()) # Define CV set.seed(12345) heart_splits = heart_df %>% vfold_cv(v = 5) ``` --- layout: false class: clear, middle .note[Note] The actual linear SVM function `linear_svm()` is under development and isn't working quite right. First-degree polynomial *is* linear. --- layout: false class: clear .b.pink[SVM with linear kernel] Define the .pink[model] and workflow—tuning `\(C\)`. .col-left[ ```r # The linear-SVM model heart_linear = svm_poly( mode = "classification", cost = tune(), degree = 1 ) %>% set_engine("kernlab") # The linear-SVM workflow wf_linear_svm = workflow() %>% add_model(heart_linear) %>% add_recipe(heart_recipe) ``` ] .col-right[ ```r # Tune the linear SVM tuning_linear_svm = tune_grid( wf_linear_svm, heart_splits, grid = data.frame( cost = 10^seq(-4, 2, length = 10) ), metrics = metric_set( f_meas, accuracy ) ) ``` ] --- class: clear .b[Performance of .pink[linear-kernel SVMs]] <img src="slides_files/figure-html/plot-svm-linear-1.svg" style="display: block; margin: auto;" /> --- layout: false class: clear .b.orange[SVM with polynomial kernel] Define the .orange[model] and workflow—tuning `\(C\)` *and* `degree`. .col-left[ ```r # The linear-SVM model heart_poly = svm_poly( mode = "classification", cost = tune(), degree = tune() ) %>% set_engine("kernlab") # The linear-SVM workflow wf_poly_svm = workflow() %>% add_model(heart_poly) %>% add_recipe(heart_recipe) ``` ] .col-right[ ```r # Tune the linear SVM tuning_poly_svm = tune_grid( wf_poly_svm, heart_splits, grid = expand_grid( cost = 10^seq(-4, 2, length = 10), degree = 1:3 ), metrics = metric_set( f_meas, accuracy ) ) ``` ] --- class: clear .b[*Accuracy* of .orange[polynomial-kernel SVMs]] <img src="slides_files/figure-html/plot-rbf-poly-1.svg" style="display: block; margin: auto;" /> --- layout: false class: clear .b.purple[SVM with RBF kernel] Define the .orange[model] and workflow—tuning `\(C\)` *and* `degree`. .col-left[ ```r # The linear-SVM model heart_rbf = svm_rbf( mode = "classification", cost = tune(), rbf_sigma = tune() ) %>% set_engine("kernlab") # The linear-SVM workflow wf_rbf_svm = workflow() %>% add_model(heart_rbf) %>% add_recipe(heart_recipe) ``` ] .col-right[ ```r # Tune the linear SVM tuning_rbf_svm = tune_grid( wf_rbf_svm, heart_splits, grid = expand_grid( cost = 10^seq(-4, 2, length = 10), rbf_sigma = 10^c(-3:0) ), metrics = metric_set( f_meas, accuracy ) ) ``` ] --- class: clear .b[*Accuracy* of .purple[RBF-kernel SVMs]] <img src="slides_files/figure-html/plot-svm-rbf-1.svg" style="display: block; margin: auto;" /> --- layout: false name: multiclass # Support vector machines ## Multi-class classification You will commonly see SVMs applied in settings with `\(K>2\)` classes. -- What can we do? -- We have options! -- .attn[One-versus-one classification] - Compares each *pair* of classes, one pair at a time. - Final prediction comes from the most-common pairwise prediction. -- .attn[One-versus-all classification] - Fits `\(K\)` unique SVMs—one for each class: `\(k\)` *vs.* not `\(k\)`. - Predicts the class for which `\(\mathop{f}_k(x)\)` is largest. --- name: more layout: false # SVM ## More material Visualizing decision boundaries - [From `scikit-learn`](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) - [From sub-subroutine](http://www.subsubroutine.com/sub-subroutine/2016/2/15/understanding-machine-learning-techniques-by-the-decision-boundaries-they-are-capable-of) [The `kernlab` paper](https://cran.r-project.org/web/packages/kernlab/vignettes/kernlab.pdf) (describes kernel parameters) [*A Practical Guide to Support Vector Classification*](https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf) .note[Next:] - Prep: [A nice neural-net video](https://www.youtube.com/watch?v=aircAruvnKk) - Fun: [A neural-net playground](https://playground.tensorflow.org) --- name: svr layout: false # SV Regression ## Extending SVM regression problems You can extend this SVM idea to regression settings. .note[Recall] OLS regression chooses `\(\beta\)`s to minimize SSE-based loss. SV regression chooses `\(\beta\)`s to minimize loss based upon residuals .it[larger than some defined magnitude] (think: the margin). --- name: sources layout: false # Sources These notes draw upon - [An Introduction to Statistical Learning](https://www.statlearning.com) (*ISL*)<br>James, Witten, Hastie, and Tibshirani --- # Table of contents .col-left[ .smallest[ #### Admin - [Today and upcoming](#admin) #### Other - [More materials](#more) - [Sources/references](#sources) ] ] .col-right[ .smallest[ #### SVM 1. [Intro](#intro) 1. [Hyperplanes](#hyperplanes) 1. [Hyperplanes and classification](#sep-class) 1. [Which hyperplane? (The maximal margin)](#which-hyperplane) 1. [Soft margins](#soft-margins) 1. [The support vector classifier](#svc) 1. [Support vector machines](#the-svm) - [Intro](#the-svm) - [Dot products](#inner) - [Generalization](#svm-gen) - [In R](#svm-r) 1. [Multi-class classification](#multiclass) 1. [SV Regression](#svr) ] ]