Load standard packages

### Load packages  Standard
library(tidyverse) # Collection of all the good stuff like dplyr, ggplot2 ect.
library(magrittr) # For adittional piping operators
library(data.table) # Good format to work with large datasets
library(skimr) # Nice descriptives

1 Machine Learning: Some clarifications

1.1 General

As with any concept, machine learning may have a slightly different definition, depending on whom you ask. A little compilation of definitions by academics and practioneers alike:

“Machine Learning at its most basic is the practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world.” - Nvidia
“Machine learning is the science of getting computers to act without being explicitly programmed.” - Stanford
“Machine learning is based on algorithms that can learn from data without relying on rules-based programming.”- McKinsey & Co.
“Machine learning algorithms can figure out how to perform important tasks by generalizing from examples.” - University of Washington
“The field of Machine Learning seeks to answer the question”How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?" - Carnegie Mellon University

1.2 Supervised vs. Unsupervised ML

1.2.1 An intuitive perspective

1.2.2 A functional perspective

1.2.3 Overview over types and classes of algorithms

Sometimes, the lines are blurry, given the vast amount of algorithms and techniques, which constantly expands.

1.2.4 Supervised ML

Concerned with labeling/classification/input-output-mapping/prediction tasks
Subject of the next session, so stay patient

1.2.5 Unsupervised ML

Tasks related to pattern recognition and data exploration, in dase there yet does not exist a right answer or problem structure. Main application

Dimensionality reduction: Finding patterns in the features of the data
Clustering: Finding homogenous subgroups within larger group

2 Dimensionality Reduction Techniques

2.1 Introduction

Dimensionality reduction techniques are foremost useful to (you might see it coming) reduce the dimensionality of our data. So, what does that mean? And why should we want to do that?

Dimensions here is a synonym for variables, so what we want to really do is have less variables. To do that, we have to find ways to express the same amount of information with fewer, but more information rich variables. This is particularly useful to:

Find patterns in the features of the data
Visualization of high-dimensional data
Pre-processing before supervised learning

2.1.1 Overview over Techniques

The type of analysis to be performed depends on the data set formats and structures. The most commonly used DR techniques are:

Principal Component Analysis (PCA): Is used to summarize the information contained in a continuous (i.e, quantitative) multivariate data by reducing the dimensionality of the data without loosing important information.
Correspondence Analysis (CA): An extension of the principal component analysis suited to analyse a large contingency table formed by two qualitative variables (or categorical data).
Multiple Correspondence Analysis (MCA): An adaptation of CA to a data table containing more than two categorical variables.
Multiple Factor Analysis (MFA): Dedicated to datasets where variables are organized into groups (qualitative and/or quantitative variables).
Hierarchical Multiple Factor Analysis (HMFA): An extension of MFA in a situation where the data are organized into a hierarchical structure.
Factor Analysis of Mixed Data (FAMD): A particular case of the MFA, dedicated to analyze a data set containing both quantitative and qualitative variables.

2.1.2 Principal Component Analysis (PCA)

2.1.2.1 General

A popular method is principal component analysis (PCA)
Three goals when finding lower dimensional representation of features:
1. Find linear combination of variables to createprincipal components
2. Maintain most variance in the data
3. Principal components are uncorrelated (i.e.orthogonal to each other)

2.1.2.2 The math and intuition behind it

The mathematics underlying it are somewhat complex, so I won’t go into too much detail, but the basics of PCA are as follows: you take a dataset with many variables, and you simplify that dataset by turning your original variables into a smaller number of “Principal Components”.

But what are these exactly? Principal Components are the underlying structure in the data. They are the directions where there is the most variance, the directions where the data is most spread out. This means that we try to find the straight line that best spreads the data out when it is projected along it. This is the first principal component, the straight line that shows the most substantial variance in the data.

Where many variables correlate with one another, they will all contribute strongly to the same principal component. Each principal component sums up a certain percentage of the total variation in the dataset. Where your initial variables are strongly correlated with one another, you will be able to approximate most of the complexity in your dataset with just a few principal components. Usually, the first principal component captures the main similarity in your data, the second the main difference.

These principal components can be computed via Eigenvalues and Eigenvectors. Just like many things in life, eigenvectors, and eigenvalues come in pairs: every eigenvector has a corresponding eigenvalue. Simply put, an eigenvector is a direction, such as “vertical” or “45 degrees”, while an eigenvalue is a number telling you how much variance there is in the data in that direction. The eigenvector with the highest eigenvalue is, therefore, the first principal component. The number of eigenvalues and eigenvectors that exits is equal to the number of dimensions the data set has. Consequently, we can reframe a dataset in terms of these eigenvectors and eigenvalues without changing the underlying information.

Note that reframing a dataset regarding a set of eigenvalues and eigenvectors does not entail changing the data itself, you’re just looking at it from a different angle, which should represent the data better.

2.2 Case study: (Digital) Nomad Life

2.2.1 Load, clean and inspect

Allright, lets load some data. Here, we will draw from some own work, where we explore the life of digital nomads. The paper is not written, but the preliminary work is summarized in this presentation. You probably already know the data from NomadList. Here, we look at the 2017 crawl of city data, which compiles the digital nomads ranking of cities according to a couple of dimensions. Lets take a look.

Roman’s web-crawl ata is always a bit messy, so we do some little cosmetics upfront.

data <- fread("data/nomad_cities.csv",
              sep="\t",
              dec=".",
              na.strings=c("NA", "DotMap(__next__=DotMap())", "DotMap()") )

data %<>%
  select(-V1, -nomadScore) %>%
  select(place, nomad_score, longitude, latitude, everything()) %>%
  mutate(racism = if_else(racism > 1, 1, racism))

Lets take a look:

head(data)

glimpse(data)

## Observations: 781
## Variables: 25
## $ place                  <chr> "Budapest", "Chiang Mai", "Prague", "Ta...
## $ nomad_score            <dbl> 1.00, 0.95, 0.94, 0.94, 0.94, 0.92, 0.9...
## $ longitude              <dbl> 19.040, 98.993, 14.438, 121.560, -97.74...
## $ latitude               <dbl> 47.498, 18.788, 50.076, 25.091, 30.267,...
## $ coffee_in_cafe         <dbl> 1.73, 0.85, 1.99, 1.88, 5.00, 4.00, 5.3...
## $ cost_beer              <dbl> 1.73, 0.85, 1.99, 1.88, 5.00, 4.00, 5.3...
## $ cost_coworking         <dbl> 152.41, 98.88, 159.13, 47.01, 200.00, 2...
## $ cost_expat             <int> 1273, 780, 1653, 1640, 3309, 4325, 2197...
## $ cost_nomad             <int> 1364, 777, 1639, 1545, 3028, 3238, 2554...
## $ female_friendly        <dbl> 1.00, 0.80, 1.00, 1.00, 0.80, 0.80, 0.8...
## $ fragile_states_index   <dbl> 52.7, 78.8, 40.8, NA, 34.0, 34.0, 39.8,...
## $ free_wifi_available    <dbl> 0.40, 0.60, 0.60, 1.00, 0.60, 1.00, 0.6...
## $ freedom_score          <dbl> 0.6, 0.2, 0.8, 0.6, 0.6, 0.6, 0.8, 0.6,...
## $ friendly_to_foreigners <dbl> 0.60, 0.60, 0.80, 0.80, 0.80, 0.80, 0.8...
## $ internet_speed         <int> 31, 14, 15, 16, 118, 81, 18, 23, 55, 24...
## $ leisure                <dbl> 0.80, 0.62, 1.00, 1.00, 1.00, 1.00, 0.6...
## $ lgbt_friendly          <dbl> 0.27, 0.60, 0.60, 0.80, 0.60, 1.00, 1.0...
## $ life_score             <dbl> 0.86, 0.75, 0.83, 0.93, 0.95, 1.00, 0.8...
## $ nightlife              <dbl> 1.00, 0.40, 1.00, 0.60, 1.00, 1.00, 0.8...
## $ peace_score            <dbl> 0.8, 0.4, 0.8, NA, 0.8, 0.8, 0.8, 0.8, ...
## $ places_to_work         <dbl> 1.0, 0.8, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,...
## $ press_freedom_index    <dbl> 28.17, 44.53, 16.66, 24.37, 22.49, 22.4...
## $ racism                 <dbl> 0.40, 0.40, 0.42, 0.00, 0.80, 0.80, 0.6...
## $ safety                 <dbl> 0.60, 0.80, 0.80, 1.00, 0.73, 0.73, 0.8...
## $ weed                   <int> 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, ...

Quite a set of interesting features, which are all numerically coded. Lets select the one we want to analyze and organize them a bit. Since it’s a lot of variables, I afterwards select only a subset on which we do some graphical exploration.

# Variables for analysis
vars <- c("cost_nomad", "cost_coworking", "cost_expat", "coffee_in_cafe", "cost_beer", # costs
          "places_to_work", "free_wifi_available", "internet_speed", # work
          "freedom_score", "peace_score", "safety", "fragile_states_index", "press_freedom_index", # safety & freedom
          "female_friendly", "lgbt_friendly", "friendly_to_foreigners", "racism", # friendly
          "leisure","life_score","nightlife","weed" # fun 
          )

# Variables for descriptives
vars.desc <- c("nomad_score", "cost_nomad", "places_to_work", "freedom_score", "friendly_to_foreigners", "life_score")

Ok, time for some exploration. Here I will introduce the GGally package, a wrapper for ggplot2 which has some functions for very nice visual summaries in matrix form.

First, lets look at a classical correlation matrix.

ggcorr(data[vars], label = TRUE, label_size = 3, label_round = 2, label_alpha = TRUE)

Even cooler, the ggpairs function creates you a scatterplot matrix plus all variable distributions and correlations. Before I used the package PerformanceAnalytics for that, but I like the ggplot-style more.

ggpairs(data[,vars.desc], 
        aes(alpha = 0.3), 
        ggtheme = theme_gray())

2.2.2 Dimensionality reduction

To execute the PCA, we’ll here use the FactoMineR package to compute PCA, and factoextra for extracting and visualizing the results. FactoMineR is a great and my favorite package for computing principal component methods in R. It’s very easy to use and very well documented. There are other alternatives around, but I since quite some time find it to be the most powerful and convenient one. factoextra is just a convenient ggplot wrapper that easily produces nice and informative diagnistic plots for a variety of DR and clustering techniques.

We realized already that we have some missing values in some obbservations. Normally, we could now do some missing value imputation (there are good R packages such as mice available), but for this tutorial, we skip that step, and do the lazy thing: Just delete all observations with missing values.

data %<>%
  drop_na()

Lets do that. Notice the scale.unit = TRUE argument, which you should ALWAYS use. Afterwards, we take a look at the resulting list object.

res.pca <- PCA(data[,vars], scale.unit = TRUE, graph = FALSE)
glimpse(res.pca)

## List of 5
##  $ eig : num [1:21, 1:3] 9.19 1.907 1.319 1.124 0.935 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:21] "comp 1" "comp 2" "comp 3" "comp 4" ...
##   .. ..$ : chr [1:3] "eigenvalue" "percentage of variance" "cumulative percentage of variance"
##  $ var :List of 4
##   ..$ coord  : num [1:21, 1:5] 0.668 0.419 0.593 0.76 0.76 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ cor    : num [1:21, 1:5] 0.668 0.419 0.593 0.76 0.76 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ cos2   : num [1:21, 1:5] 0.446 0.175 0.352 0.578 0.578 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ contrib: num [1:21, 1:5] 4.86 1.91 3.83 6.29 6.29 ...
##   .. ..- attr(*, "dimnames")=List of 2
##  $ ind :List of 4
##   ..$ coord  : num [1:754, 1:5] 0.433 -2.089 2.242 4.248 4.727 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ cos2   : num [1:754, 1:5] 0.0108 0.3115 0.1967 0.3479 0.5237 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ contrib: num [1:754, 1:5] 0.0027 0.063 0.0725 0.2605 0.3224 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ dist   : Named num [1:754] 4.17 3.74 5.05 7.2 6.53 ...
##   .. ..- attr(*, "names")= chr [1:754] "1" "2" "3" "5" ...
##  $ svd :List of 3
##   ..$ vs: num [1:21] 3.032 1.381 1.148 1.06 0.967 ...
##   ..$ U : num [1:754, 1:5] 0.143 -0.689 0.739 1.401 1.559 ...
##   ..$ V : num [1:21, 1:5] 0.22 0.138 0.196 0.251 0.251 ...
##  $ call:List of 9
##   ..$ row.w     : num [1:754] 0.00133 0.00133 0.00133 0.00133 0.00133 ...
##   ..$ col.w     : num [1:21] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ scale.unit: logi TRUE
##   ..$ ncp       : num 5
##   ..$ centre    : num [1:21] 2304.65 209.08 1874.17 3.29 3.29 ...
##   ..$ ecart.type: num [1:21] 1081.09 173.8 1255.83 1.92 1.92 ...
##   ..$ X         :'data.frame':   754 obs. of  21 variables:
##   .. ..$ cost_nomad            : int [1:754] 1364 777 1639 3028 3238 2554 3503 3427 2245 2956 ...
##   .. ..$ cost_coworking        : num [1:754] 152.4 98.9 159.1 200 250 ...
##   .. ..$ cost_expat            : int [1:754] 1273 780 1653 3309 4325 2197 2691 3764 1859 2760 ...
##   .. ..$ coffee_in_cafe        : num [1:754] 1.73 0.85 1.99 5 4 5.38 5 5 4.03 3.5 ...
##   .. ..$ cost_beer             : num [1:754] 1.73 0.85 1.99 5 4 5.38 5 5 4.03 3.5 ...
##   .. ..$ places_to_work        : num [1:754] 1 0.8 1 1 1 1 1 1 0.8 1 ...
##   .. ..$ free_wifi_available   : num [1:754] 0.4 0.6 0.6 0.6 1 0.6 0.4 1 0.24 0.6 ...
##   .. ..$ internet_speed        : int [1:754] 31 14 15 118 81 18 23 55 24 99 ...
##   .. ..$ freedom_score         : num [1:754] 0.6 0.2 0.8 0.6 0.6 0.8 0.6 0.6 0.8 0.6 ...
##   .. ..$ peace_score           : num [1:754] 0.8 0.4 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 ...
##   .. ..$ safety                : num [1:754] 0.6 0.8 0.8 0.73 0.73 0.8 0.8 0.6 0.8 0.4 ...
##   .. ..$ fragile_states_index  : num [1:754] 52.7 78.8 40.8 34 34 39.8 34 34 39.8 34 ...
##   .. ..$ press_freedom_index   : num [1:754] 28.2 44.5 16.7 22.5 22.5 ...
##   .. ..$ female_friendly       : num [1:754] 1 0.8 1 0.8 0.8 0.8 0.8 0.8 0.8 0.8 ...
##   .. ..$ lgbt_friendly         : num [1:754] 0.27 0.6 0.6 0.6 1 1 0.8 0.8 1 1 ...
##   .. ..$ friendly_to_foreigners: num [1:754] 0.6 0.6 0.8 0.8 0.8 0.8 1 1 0.8 0.8 ...
##   .. ..$ racism                : num [1:754] 0.4 0.4 0.42 0.8 0.8 0.6 0.8 0.8 1 0.8 ...
##   .. ..$ leisure               : num [1:754] 0.8 0.62 1 1 1 0.6 1 0.6 0.78 0.8 ...
##   .. ..$ life_score            : num [1:754] 0.86 0.75 0.83 0.95 1 0.88 0.95 0.92 0.85 0.87 ...
##   .. ..$ nightlife             : num [1:754] 1 0.4 1 1 1 0.8 1 1 0.8 0.6 ...
##   .. ..$ weed                  : int [1:754] 0 0 1 0 0 1 1 1 1 0 ...
##   ..$ row.w.init: num [1:754] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ call      : language PCA(X = data[, vars], scale.unit = TRUE, graph = FALSE)
##  - attr(*, "class")= chr [1:2] "PCA" "list "

Ok, lets see look at the “screeplot”, a diagnostic visualization that displays the variance explained by every component. We here use the factoextra package, like for all following visualizations with the fviz_ prefix. Notice that the output in every case is an ggplot2 object, which could be complemented with further layers.

fviz_screeplot(res.pca, 
               addlabels = TRUE, 
               ncp = 10, 
               ggtheme = theme_gray())

As expected, we see that the first component already captures a main share of the variance. Let’s look at the corresponding eigenvalues.

as_tibble(res.pca$eig)

For feature selection, our rule-of-thumb is to only include components with an eigenvalue > 1, meaning that we in this case would have reduced our data to 4 dimensions. Lets project them onto 2-dimensional space and take a look at the vector of our features.

fviz_pca_var(res.pca, 
             alpha.var = "cos2",
             col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE,
             ggtheme = theme_gray())

We see that they tend to cluster in 3 groups:

Fun, friendlyness, work (upper right)
Costs (lower right)
Safety and stability (left)

Lets look at the numeric values.

get_pca_var(res.pca)

## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

as_data_frame(res.pca$var$coord)

The results-object also contains the observations loading on the components.

get_pca_var(res.pca)

## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

as_data_frame(head(res.pca$ind$coord))

Let’s visualize our observations and the variable-loading together in the space of the first 2 components.

fviz_pca_biplot(res.pca,
                alpha.ind = "cos2",
                col.ind = "contrib",
                gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
                geom = "point", 
                ggtheme = theme_gray())

We cal also briefly check if our ndimensionality reductions is helpful to differentiate between nomadscore.

data %<>%
  mutate(popular = ifelse(nomad_score > mean(nomad_score), TRUE, FALSE))

fviz_pca_biplot(res.pca,
                alpha.ind = "cos2",
                geom = "point", 
                habillage = factor(data$popular), 
                addEllipses = TRUE,
                ggtheme = theme_gray())

3 Clustering

3.1 Introduction

3.1.1 Types of Clustering

Clustering can be broadly divided into two subgroups:

Hard clustering: in hard clustering, each data object or point either belongs to a cluster completely or not. For example in the Uber dataset, each location belongs to either one borough or the other.
Soft clustering: in soft clustering, a data point can belong to more than one cluster with some probability or likelihood value. For example, you could identify some locations as the border points belonging to two or more boroughs.

Clustering algorithms can also be categorized based on their cluster model, that is based on how they form clusters or groups. This tutorial only highlights some of the prominent clustering algorithms.

Connectivity-based clustering: the main idea behind this clustering is that data points that are closer in the data space are more related (similar) than to data points farther away. The clusters are formed by connecting data points according to their distance. At different distances, different clusters will form and can be represented using a dendrogram, which gives away why they are also commonly called hierarchical clustering. These methods do not produce a unique partitioning of the dataset, rather a hierarchy from which the user still needs to choose appropriate clusters by choosing the level where they want to cluster. Note: They are also not very robust towards outliers, which might show up as additional clusters or even cause other clusters to merge.
Centroid-based clustering: in this type of clustering, clusters are represented by a central vector or a centroid. This centroid might not necessarily be a member of the dataset. This is an iterative clustering algorithms in which the notion of similarity is derived by how close a data point is to the centroid of the cluster. k-means is a centroid based clustering, and will you see this topic more in detail later on in the tutorial.
Distribution-based clustering: this clustering is very closely related to statistics: distributional modeling. Clustering is based on the notion of how probable is it for a data point to belong to a certain distribution, such as the Gaussian distribution, for example. Data points in a cluster belong to the same distribution. These models have a strong theoritical foundation, however they often suffer from overfitting. Gaussian mixture models, using the expectation-maximization algorithm is a famous distribution based clustering method.
Density-based methods: search the data space for areas of varied density of data points. Clusters are defined as areas of higher density within the data space compared to other regions. Data points in the sparse areas are usually considered to be noise and/or border points. The drawback with these methods is that they expect some kind of density guide or parameters to detect cluster borders. DBSCAN and OPTICS are some prominent density based clustering.

So, what is the best to use? Hard to say. Clustering is an subjective task and there can be more than one correct clustering algorithm. Every algorithm follows a different set of rules for defining the ‘similarity’ among data points. The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentally, unless there is a mathematical reason to prefer one clustering algorithm over another. An algorithm might work well on a particular dataset but fail for a different kind of dataset. Since there is most times no wrong or right, the clustering that delivers the most useful results is the way to go.

3.2 K-means Clustering

K-means clustering is the most commonly used unsupervised machine learning algorithm for dividing a given dataset into k clusters, which must be provided by the user. The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized. There are several k-means algorithms available. However, the standard algorithm defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid. Its an iterative process containing the following steps

Specify k - the number of clusters to be created.
Select randomly k objects from the dataset as the initial cluster centers.

Assign each observation to their closest centroid, based on the Euclidean distance between the object and the centroid.
For each of the k clusters recompute the cluster centroid by calculating the new mean value of all the data points in the cluster.

Iteratively minimize the total within sum of square. Repeat Step 3 and Step 4, until the centroids do not change or the maximum number of iterations is reached (R uses 10 as the default value for the maximum number of iterations).

So, lets do that. As already mentioned, we have to upfront choose our k. However, there exists some guidance, for example the highest gain in “total within sum of sqares” (fast to calculate), the “siluette”, as well as the “gap statistics” (hard to calculate, takes time).

fviz_nbclust(scale(data[,vars]), 
             kmeans, 
             method = "wss")

Ok,w e here settle for 3 (executive desicion). Before we start, something weird upfront. The function takes the observation names from the rownames (which nobody uses anymore, and are depreciated by dplyr). So, remeber to define them just straight before you cluster, otherwise the next dplyr pipe will delete them again.

rownames(data) <- data %>% pull(place)

Lets run the algorythm.

km <- kmeans(scale(data[,vars]), centers = 3, nstart = 20)  
glimpse(km)

## List of 9
##  $ cluster     : Named int [1:754] 1 1 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "names")= chr [1:754] "Budapest" "Chiang Mai" "Prague" "Austin" ...
##  $ centers     : num [1:3, 1:21] -0.366 0.697 -0.69 -0.209 0.414 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:21] "cost_nomad" "cost_coworking" "cost_expat" "coffee_in_cafe" ...
##  $ totss       : num 15813
##  $ withinss    : num [1:3] 2407 4560 2458
##  $ tot.withinss: num 9425
##  $ betweenss   : num 6388
##  $ size        : int [1:3] 189 331 234
##  $ iter        : int 3
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

Again, lets visualize it. To have a meaningful way for 2d visualization, we again project the observations on the space of the first 2 components.

fviz_cluster(km, data = data[,vars],
             ggtheme = theme_gray())

Ok, we got 3 clusters. Let’s look what’s in them.

data %>%
  bind_cols(cluster = km$cluster) %>%
  select(vars.desc, cluster) %>%
  group_by(cluster) %>%
  mutate(n = n()) %>%
  summarise_all(funs(mean))

3.3 Hirarchical Clustering

3.3.1 Introduction

The key operation in hierarchical agglomerative clustering is to repeatedly combine the two nearest clusters into a larger cluster. There are three key questions that need to be answered first:

How do you represent a cluster of more than one point?
How do you determine the “nearness” of clusters?
When do you stop combining clusters?
Hopefully by the end this tutorial you will be able to answer all of these questions. Before applying hierarchical clustering let’s have a look at its working:

It starts by calculating the distance between every pair of observation points and store it in a distance matrix.
It then puts every point in its own cluster.

Then it starts merging the closest pairs of points based on the distances from the distance matrix and as a result the amount of clusters goes down by 1.

Then it recomputes the distance between the new cluster and the old ones and stores them in a new distance matrix.
Lastly it repeats steps 2 and 3 until all the clusters are merged into one single cluster.

There are several ways to measure the distance between clusters in order to decide the rules for clustering, and they are often called Linkage Methods. Some of the common linkage methods are:

Complete-linkage: calculates the maximum distance between clusters before merging.
Single-linkage: calculates the minimum distance between the clusters before merging. This linkage may be used to detect high values in your dataset which may be outliers as they will be merged at the end.
Average-linkage: calculates the average distance between clusters before merging.
Centroid-linkage: finds centroid of cluster 1 and centroid of cluster 2, and then calculates the distance between the two before merging (hint: usually not a good idea).

The choice of linkage method entirely depends on you and there is no hard and fast method that will always give you good results. Different linkage methods lead to different clusters.

Some further practical issues:

Data on different scales can cause undesirable resultsin clustering methods
Solution is to scale data so that features have same mean and standard deviation
Subtract mean of a feature from all observations, Divide each feature by the standard deviation ofthe feature
Normalized features have a mean of zero and a standard deviation of one

3.3.2 Performing a hirarchical clustering

However, let’s get it started and perform a cluster. We here use the hcut function, which includes most of the abovementioned mapproaches as options.

hc <- hcut(data[,vars], hc_func = "hclust", k = 3, stand = TRUE)

In hierarchical clustering, you categorize the objects into a hierarchy similar to a tree-like diagram which is called a dendrogram. The distance of split or merge (called height) is shown on the y-axis of the dendrogram below.

fviz_dend(hc, 
          rect = TRUE, 
          cex = 0.5)

Notice how the dendrogram is built and every data point finally merges into a single cluster with the height(distance) shown on the y-axis.

Let’s inspect what’s in the clusters.

data %>%
  bind_cols(cluster = hc$cluster) %>%
  select(vars.desc, cluster) %>%
  group_by(cluster) %>%
  mutate(n = n()) %>%
  summarise_all(funs(mean))

And again visualize them:

fviz_cluster(hc, data = data[,vars],
             ggtheme = theme_gray())

Looks very similar, even though the middle cluster is a bit more sqeezed in between now. We can also use our scatterplot diagnostics again, and color the observations by their cluster assignment.

ggpairs(data[,vars.desc], 
        lower = list(continuous = "smooth"), 
        aes(colour = as.factor(hc$cluster), alpha = 0.4),
        progress = FALSE,
        ggtheme = theme_gray() )

3.3.3 Hirarchical Clustering based in PCA

You might already have wondered: “COuld one combine a PCA with clustering techniques”? The answer is: “Yes!”. In practice, that actually works very fine, and often delivers more robust clusters. So, lets give it a shot. We could do it by hand, but the HCPC function already does that for us, and offers also a nice diagnostic viz.

hcpc <- HCPC(res.pca, 
             nb.clust = -1, #  self determined: higher relative loss of inertia
             graph = FALSE)

plot(hcpc, choice = "3D.map")

To finish up, lets plot it in a map, simplest way possible.

library(ggmap)
mapWorld <- borders("world", colour = "gray50", fill = "gray50")
mp <- ggplot() +   mapWorld 
mp + geom_point(aes(x = data$longitude, y = data$latitude) , color = hcpc$data.clust$clust)

3.3.4 Note: Comparing with K-Means clustering algorithm

You might have heard about the k-means clustering algorithm; if not, take a look at this tutorial. There are many fundamental differences between the two algorithms, although any one can perform better than the other in different cases. Some of the differences are:

Distance used: Hierarchical clustering can virtually handle any distance metric while k-means rely on euclidean distances.
Stability of results: k-means requires a random step at its initialization that may yield different results if the process is re-run. That wouldn’t be the case in hierarchical clustering.
Number of Clusters: While you can use elbow plots, Silhouette plot etc. to figure the right number of clusters in k-means, hierarchical too can use all of those but with the added benefit of leveraging the dendrogram for the same.
Computation Complexity: K-means is less computationally expensive than hierarchical clustering and can be run on large datasets within a reasonable time frame, which is the main reason k-means is more popular.

3.3.5 Note: Measuring Godness-of-Fit in clusters

Perhaps the most important part in any unsupervised learning task is the analysis of the results. After you have performed the clustering using any algorithm and any sets of parameters you need to make sure that you did it right. But how do you determine that?

Well, there are many measures to do this, perhaps the most popular one is the Dunn’s Index. Dunn’s index is the ratio between the minimum inter-cluster distances to the maximum intra-cluster diameter. The diameter of a cluster is the distance between its two furthermost points. In order to have well separated and compact clusters you should aim for a higher Dunn’s index.

Furthermore, graphical inspection often helps comparing the results of different algorithms and poarameters. Here you find some advanced diagnostic visualizations for hirarchical clustering.

Lastly, a clusters quality is to a large extend determined by its usefulness.

Internal Validity
External Validity

L1: Unsupervised Machine Learning

Daniel S. Hain (dsh@business.aau.dk)

oday