Load standard packages
### Load packages Standard
library(tidyverse) # Collection of all the good stuff like dplyr, ggplot2 ect.
library(magrittr) # For adittional piping operators
library(data.table) # Good format to work with large datasets
library(skimr) # Nice descriptives
As with any concept, machine learning may have a slightly different definition, depending on whom you ask. A little compilation of definitions by academics and practioneers alike:
Sometimes, the lines are blurry, given the vast amount of algorithms and techniques, which constantly expands.
Tasks related to pattern recognition and data exploration, in dase there yet does not exist a right answer or problem structure. Main application
Dimensionality reduction techniques are foremost useful to (you might see it coming) reduce the dimensionality of our data. So, what does that mean? And why should we want to do that?
Dimensions here is a synonym for variables, so what we want to really do is have less variables. To do that, we have to find ways to express the same amount of information with fewer, but more information rich variables. This is particularly useful to:
The type of analysis to be performed depends on the data set formats and structures. The most commonly used DR techniques are:
The mathematics underlying it are somewhat complex, so I won’t go into too much detail, but the basics of PCA are as follows: you take a dataset with many variables, and you simplify that dataset by turning your original variables into a smaller number of “Principal Components”.
But what are these exactly? Principal Components are the underlying structure in the data. They are the directions where there is the most variance, the directions where the data is most spread out. This means that we try to find the straight line that best spreads the data out when it is projected along it. This is the first principal component, the straight line that shows the most substantial variance in the data.
Where many variables correlate with one another, they will all contribute strongly to the same principal component. Each principal component sums up a certain percentage of the total variation in the dataset. Where your initial variables are strongly correlated with one another, you will be able to approximate most of the complexity in your dataset with just a few principal components. Usually, the first principal component captures the main similarity in your data, the second the main difference.
These principal components can be computed via Eigenvalues and Eigenvectors. Just like many things in life, eigenvectors, and eigenvalues come in pairs: every eigenvector has a corresponding eigenvalue. Simply put, an eigenvector is a direction, such as “vertical” or “45 degrees”, while an eigenvalue is a number telling you how much variance there is in the data in that direction. The eigenvector with the highest eigenvalue is, therefore, the first principal component. The number of eigenvalues and eigenvectors that exits is equal to the number of dimensions the data set has. Consequently, we can reframe a dataset in terms of these eigenvectors and eigenvalues without changing the underlying information.
Note that reframing a dataset regarding a set of eigenvalues and eigenvectors does not entail changing the data itself, you’re just looking at it from a different angle, which should represent the data better.
Allright, lets load some data. Here, we will draw from some own work, where we explore the life of digital nomads. The paper is not written, but the preliminary work is summarized in this presentation. You probably already know the data from NomadList. Here, we look at the 2017 crawl of city data, which compiles the digital nomads ranking of cities according to a couple of dimensions. Lets take a look.
Roman’s web-crawl ata is always a bit messy, so we do some little cosmetics upfront.
data <- fread("data/nomad_cities.csv",
sep="\t",
dec=".",
na.strings=c("NA", "DotMap(__next__=DotMap())", "DotMap()") )
data %<>%
select(-V1, -nomadScore) %>%
select(place, nomad_score, longitude, latitude, everything()) %>%
mutate(racism = if_else(racism > 1, 1, racism))
Lets take a look:
head(data)
glimpse(data)
## Observations: 781
## Variables: 25
## $ place <chr> "Budapest", "Chiang Mai", "Prague", "Ta...
## $ nomad_score <dbl> 1.00, 0.95, 0.94, 0.94, 0.94, 0.92, 0.9...
## $ longitude <dbl> 19.040, 98.993, 14.438, 121.560, -97.74...
## $ latitude <dbl> 47.498, 18.788, 50.076, 25.091, 30.267,...
## $ coffee_in_cafe <dbl> 1.73, 0.85, 1.99, 1.88, 5.00, 4.00, 5.3...
## $ cost_beer <dbl> 1.73, 0.85, 1.99, 1.88, 5.00, 4.00, 5.3...
## $ cost_coworking <dbl> 152.41, 98.88, 159.13, 47.01, 200.00, 2...
## $ cost_expat <int> 1273, 780, 1653, 1640, 3309, 4325, 2197...
## $ cost_nomad <int> 1364, 777, 1639, 1545, 3028, 3238, 2554...
## $ female_friendly <dbl> 1.00, 0.80, 1.00, 1.00, 0.80, 0.80, 0.8...
## $ fragile_states_index <dbl> 52.7, 78.8, 40.8, NA, 34.0, 34.0, 39.8,...
## $ free_wifi_available <dbl> 0.40, 0.60, 0.60, 1.00, 0.60, 1.00, 0.6...
## $ freedom_score <dbl> 0.6, 0.2, 0.8, 0.6, 0.6, 0.6, 0.8, 0.6,...
## $ friendly_to_foreigners <dbl> 0.60, 0.60, 0.80, 0.80, 0.80, 0.80, 0.8...
## $ internet_speed <int> 31, 14, 15, 16, 118, 81, 18, 23, 55, 24...
## $ leisure <dbl> 0.80, 0.62, 1.00, 1.00, 1.00, 1.00, 0.6...
## $ lgbt_friendly <dbl> 0.27, 0.60, 0.60, 0.80, 0.60, 1.00, 1.0...
## $ life_score <dbl> 0.86, 0.75, 0.83, 0.93, 0.95, 1.00, 0.8...
## $ nightlife <dbl> 1.00, 0.40, 1.00, 0.60, 1.00, 1.00, 0.8...
## $ peace_score <dbl> 0.8, 0.4, 0.8, NA, 0.8, 0.8, 0.8, 0.8, ...
## $ places_to_work <dbl> 1.0, 0.8, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,...
## $ press_freedom_index <dbl> 28.17, 44.53, 16.66, 24.37, 22.49, 22.4...
## $ racism <dbl> 0.40, 0.40, 0.42, 0.00, 0.80, 0.80, 0.6...
## $ safety <dbl> 0.60, 0.80, 0.80, 1.00, 0.73, 0.73, 0.8...
## $ weed <int> 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, ...
Quite a set of interesting features, which are all numerically coded. Lets select the one we want to analyze and organize them a bit. Since it’s a lot of variables, I afterwards select only a subset on which we do some graphical exploration.
# Variables for analysis
vars <- c("cost_nomad", "cost_coworking", "cost_expat", "coffee_in_cafe", "cost_beer", # costs
"places_to_work", "free_wifi_available", "internet_speed", # work
"freedom_score", "peace_score", "safety", "fragile_states_index", "press_freedom_index", # safety & freedom
"female_friendly", "lgbt_friendly", "friendly_to_foreigners", "racism", # friendly
"leisure","life_score","nightlife","weed" # fun
)
# Variables for descriptives
vars.desc <- c("nomad_score", "cost_nomad", "places_to_work", "freedom_score", "friendly_to_foreigners", "life_score")
Ok, time for some exploration. Here I will introduce the GGally
package, a wrapper for ggplot2
which has some functions for very nice visual summaries in matrix form.
First, lets look at a classical correlation matrix.
ggcorr(data[vars], label = TRUE, label_size = 3, label_round = 2, label_alpha = TRUE)
Even cooler, the ggpairs
function creates you a scatterplot matrix plus all variable distributions and correlations. Before I used the package PerformanceAnalytics
for that, but I like the ggplot-style more.
ggpairs(data[,vars.desc],
aes(alpha = 0.3),
ggtheme = theme_gray())
To execute the PCA, we’ll here use the FactoMineR
package to compute PCA, and factoextra
for extracting and visualizing the results. FactoMineR
is a great and my favorite package for computing principal component methods in R. It’s very easy to use and very well documented. There are other alternatives around, but I since quite some time find it to be the most powerful and convenient one. factoextra
is just a convenient ggplot
wrapper that easily produces nice and informative diagnistic plots for a variety of DR and clustering techniques.
We realized already that we have some missing values in some obbservations. Normally, we could now do some missing value imputation (there are good R
packages such as mice
available), but for this tutorial, we skip that step, and do the lazy thing: Just delete all observations with missing values.
data %<>%
drop_na()
Lets do that. Notice the scale.unit = TRUE
argument, which you should ALWAYS use. Afterwards, we take a look at the resulting list object.
res.pca <- PCA(data[,vars], scale.unit = TRUE, graph = FALSE)
glimpse(res.pca)
## List of 5
## $ eig : num [1:21, 1:3] 9.19 1.907 1.319 1.124 0.935 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:21] "comp 1" "comp 2" "comp 3" "comp 4" ...
## .. ..$ : chr [1:3] "eigenvalue" "percentage of variance" "cumulative percentage of variance"
## $ var :List of 4
## ..$ coord : num [1:21, 1:5] 0.668 0.419 0.593 0.76 0.76 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ cor : num [1:21, 1:5] 0.668 0.419 0.593 0.76 0.76 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ cos2 : num [1:21, 1:5] 0.446 0.175 0.352 0.578 0.578 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ contrib: num [1:21, 1:5] 4.86 1.91 3.83 6.29 6.29 ...
## .. ..- attr(*, "dimnames")=List of 2
## $ ind :List of 4
## ..$ coord : num [1:754, 1:5] 0.433 -2.089 2.242 4.248 4.727 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ cos2 : num [1:754, 1:5] 0.0108 0.3115 0.1967 0.3479 0.5237 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ contrib: num [1:754, 1:5] 0.0027 0.063 0.0725 0.2605 0.3224 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ dist : Named num [1:754] 4.17 3.74 5.05 7.2 6.53 ...
## .. ..- attr(*, "names")= chr [1:754] "1" "2" "3" "5" ...
## $ svd :List of 3
## ..$ vs: num [1:21] 3.032 1.381 1.148 1.06 0.967 ...
## ..$ U : num [1:754, 1:5] 0.143 -0.689 0.739 1.401 1.559 ...
## ..$ V : num [1:21, 1:5] 0.22 0.138 0.196 0.251 0.251 ...
## $ call:List of 9
## ..$ row.w : num [1:754] 0.00133 0.00133 0.00133 0.00133 0.00133 ...
## ..$ col.w : num [1:21] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ scale.unit: logi TRUE
## ..$ ncp : num 5
## ..$ centre : num [1:21] 2304.65 209.08 1874.17 3.29 3.29 ...
## ..$ ecart.type: num [1:21] 1081.09 173.8 1255.83 1.92 1.92 ...
## ..$ X :'data.frame': 754 obs. of 21 variables:
## .. ..$ cost_nomad : int [1:754] 1364 777 1639 3028 3238 2554 3503 3427 2245 2956 ...
## .. ..$ cost_coworking : num [1:754] 152.4 98.9 159.1 200 250 ...
## .. ..$ cost_expat : int [1:754] 1273 780 1653 3309 4325 2197 2691 3764 1859 2760 ...
## .. ..$ coffee_in_cafe : num [1:754] 1.73 0.85 1.99 5 4 5.38 5 5 4.03 3.5 ...
## .. ..$ cost_beer : num [1:754] 1.73 0.85 1.99 5 4 5.38 5 5 4.03 3.5 ...
## .. ..$ places_to_work : num [1:754] 1 0.8 1 1 1 1 1 1 0.8 1 ...
## .. ..$ free_wifi_available : num [1:754] 0.4 0.6 0.6 0.6 1 0.6 0.4 1 0.24 0.6 ...
## .. ..$ internet_speed : int [1:754] 31 14 15 118 81 18 23 55 24 99 ...
## .. ..$ freedom_score : num [1:754] 0.6 0.2 0.8 0.6 0.6 0.8 0.6 0.6 0.8 0.6 ...
## .. ..$ peace_score : num [1:754] 0.8 0.4 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 ...
## .. ..$ safety : num [1:754] 0.6 0.8 0.8 0.73 0.73 0.8 0.8 0.6 0.8 0.4 ...
## .. ..$ fragile_states_index : num [1:754] 52.7 78.8 40.8 34 34 39.8 34 34 39.8 34 ...
## .. ..$ press_freedom_index : num [1:754] 28.2 44.5 16.7 22.5 22.5 ...
## .. ..$ female_friendly : num [1:754] 1 0.8 1 0.8 0.8 0.8 0.8 0.8 0.8 0.8 ...
## .. ..$ lgbt_friendly : num [1:754] 0.27 0.6 0.6 0.6 1 1 0.8 0.8 1 1 ...
## .. ..$ friendly_to_foreigners: num [1:754] 0.6 0.6 0.8 0.8 0.8 0.8 1 1 0.8 0.8 ...
## .. ..$ racism : num [1:754] 0.4 0.4 0.42 0.8 0.8 0.6 0.8 0.8 1 0.8 ...
## .. ..$ leisure : num [1:754] 0.8 0.62 1 1 1 0.6 1 0.6 0.78 0.8 ...
## .. ..$ life_score : num [1:754] 0.86 0.75 0.83 0.95 1 0.88 0.95 0.92 0.85 0.87 ...
## .. ..$ nightlife : num [1:754] 1 0.4 1 1 1 0.8 1 1 0.8 0.6 ...
## .. ..$ weed : int [1:754] 0 0 1 0 0 1 1 1 1 0 ...
## ..$ row.w.init: num [1:754] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ call : language PCA(X = data[, vars], scale.unit = TRUE, graph = FALSE)
## - attr(*, "class")= chr [1:2] "PCA" "list "
Ok, lets see look at the “screeplot”, a diagnostic visualization that displays the variance explained by every component. We here use the factoextra
package, like for all following visualizations with the fviz_
prefix. Notice that the output in every case is an ggplot2
object, which could be complemented with further layers.
fviz_screeplot(res.pca,
addlabels = TRUE,
ncp = 10,
ggtheme = theme_gray())
As expected, we see that the first component already captures a main share of the variance. Let’s look at the corresponding eigenvalues.
as_tibble(res.pca$eig)
For feature selection, our rule-of-thumb is to only include components with an eigenvalue > 1, meaning that we in this case would have reduced our data to 4 dimensions. Lets project them onto 2-dimensional space and take a look at the vector of our features.
fviz_pca_var(res.pca,
alpha.var = "cos2",
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE,
ggtheme = theme_gray())
We see that they tend to cluster in 3 groups:
Lets look at the numeric values.
get_pca_var(res.pca)
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
as_data_frame(res.pca$var$coord)
The results-object also contains the observations loading on the components.
get_pca_var(res.pca)
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
as_data_frame(head(res.pca$ind$coord))
Let’s visualize our observations and the variable-loading together in the space of the first 2 components.
fviz_pca_biplot(res.pca,
alpha.ind = "cos2",
col.ind = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
geom = "point",
ggtheme = theme_gray())
We cal also briefly check if our ndimensionality reductions is helpful to differentiate between nomadscore
.
data %<>%
mutate(popular = ifelse(nomad_score > mean(nomad_score), TRUE, FALSE))
fviz_pca_biplot(res.pca,
alpha.ind = "cos2",
geom = "point",
habillage = factor(data$popular),
addEllipses = TRUE,
ggtheme = theme_gray())
Clustering can be broadly divided into two subgroups:
Clustering algorithms can also be categorized based on their cluster model, that is based on how they form clusters or groups. This tutorial only highlights some of the prominent clustering algorithms.
Connectivity-based clustering: the main idea behind this clustering is that data points that are closer in the data space are more related (similar) than to data points farther away. The clusters are formed by connecting data points according to their distance. At different distances, different clusters will form and can be represented using a dendrogram, which gives away why they are also commonly called hierarchical clustering. These methods do not produce a unique partitioning of the dataset, rather a hierarchy from which the user still needs to choose appropriate clusters by choosing the level where they want to cluster. Note: They are also not very robust towards outliers, which might show up as additional clusters or even cause other clusters to merge.
Centroid-based clustering: in this type of clustering, clusters are represented by a central vector or a centroid. This centroid might not necessarily be a member of the dataset. This is an iterative clustering algorithms in which the notion of similarity is derived by how close a data point is to the centroid of the cluster. k-means is a centroid based clustering, and will you see this topic more in detail later on in the tutorial.
Distribution-based clustering: this clustering is very closely related to statistics: distributional modeling. Clustering is based on the notion of how probable is it for a data point to belong to a certain distribution, such as the Gaussian distribution, for example. Data points in a cluster belong to the same distribution. These models have a strong theoritical foundation, however they often suffer from overfitting. Gaussian mixture models, using the expectation-maximization algorithm is a famous distribution based clustering method.
Density-based methods: search the data space for areas of varied density of data points. Clusters are defined as areas of higher density within the data space compared to other regions. Data points in the sparse areas are usually considered to be noise and/or border points. The drawback with these methods is that they expect some kind of density guide or parameters to detect cluster borders. DBSCAN
and OPTICS
are some prominent density based clustering.
So, what is the best to use? Hard to say. Clustering is an subjective task and there can be more than one correct clustering algorithm. Every algorithm follows a different set of rules for defining the ‘similarity’ among data points. The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentally, unless there is a mathematical reason to prefer one clustering algorithm over another. An algorithm might work well on a particular dataset but fail for a different kind of dataset. Since there is most times no wrong or right, the clustering that delivers the most useful results is the way to go.
K-means clustering is the most commonly used unsupervised machine learning algorithm for dividing a given dataset into k
clusters, which must be provided by the user. The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized. There are several k-means algorithms available. However, the standard algorithm defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid. Its an iterative process containing the following steps
k
- the number of clusters to be created.k
objects from the dataset as the initial cluster centers.k
clusters recompute the cluster centroid by calculating the new mean value of all the data points in the cluster.R
uses 10 as the default value for the maximum number of iterations).So, lets do that. As already mentioned, we have to upfront choose our k
. However, there exists some guidance, for example the highest gain in “total within sum of sqares” (fast to calculate), the “siluette”, as well as the “gap statistics” (hard to calculate, takes time).
fviz_nbclust(scale(data[,vars]),
kmeans,
method = "wss")
Ok,w e here settle for 3 (executive desicion). Before we start, something weird upfront. The function takes the observation names from the rownames (which nobody uses anymore, and are depreciated by dplyr
). So, remeber to define them just straight before you cluster, otherwise the next dplyr
pipe will delete them again.
rownames(data) <- data %>% pull(place)
Lets run the algorythm.
km <- kmeans(scale(data[,vars]), centers = 3, nstart = 20)
glimpse(km)
## List of 9
## $ cluster : Named int [1:754] 1 1 2 2 2 2 2 2 2 2 ...
## ..- attr(*, "names")= chr [1:754] "Budapest" "Chiang Mai" "Prague" "Austin" ...
## $ centers : num [1:3, 1:21] -0.366 0.697 -0.69 -0.209 0.414 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:21] "cost_nomad" "cost_coworking" "cost_expat" "coffee_in_cafe" ...
## $ totss : num 15813
## $ withinss : num [1:3] 2407 4560 2458
## $ tot.withinss: num 9425
## $ betweenss : num 6388
## $ size : int [1:3] 189 331 234
## $ iter : int 3
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
Again, lets visualize it. To have a meaningful way for 2d visualization, we again project the observations on the space of the first 2 components.
fviz_cluster(km, data = data[,vars],
ggtheme = theme_gray())
Ok, we got 3 clusters. Let’s look what’s in them.
data %>%
bind_cols(cluster = km$cluster) %>%
select(vars.desc, cluster) %>%
group_by(cluster) %>%
mutate(n = n()) %>%
summarise_all(funs(mean))
The key operation in hierarchical agglomerative clustering is to repeatedly combine the two nearest clusters into a larger cluster. There are three key questions that need to be answered first:
There are several ways to measure the distance between clusters in order to decide the rules for clustering, and they are often called Linkage Methods. Some of the common linkage methods are:
The choice of linkage method entirely depends on you and there is no hard and fast method that will always give you good results. Different linkage methods lead to different clusters.
Some further practical issues:
However, let’s get it started and perform a cluster. We here use the hcut
function, which includes most of the abovementioned mapproaches as options.
hc <- hcut(data[,vars], hc_func = "hclust", k = 3, stand = TRUE)
In hierarchical clustering, you categorize the objects into a hierarchy similar to a tree-like diagram which is called a dendrogram. The distance of split or merge (called height) is shown on the y-axis of the dendrogram below.
fviz_dend(hc,
rect = TRUE,
cex = 0.5)
Notice how the dendrogram is built and every data point finally merges into a single cluster with the height(distance) shown on the y-axis.
Let’s inspect what’s in the clusters.
data %>%
bind_cols(cluster = hc$cluster) %>%
select(vars.desc, cluster) %>%
group_by(cluster) %>%
mutate(n = n()) %>%
summarise_all(funs(mean))
And again visualize them:
fviz_cluster(hc, data = data[,vars],
ggtheme = theme_gray())
Looks very similar, even though the middle cluster is a bit more sqeezed in between now. We can also use our scatterplot diagnostics again, and color the observations by their cluster assignment.
ggpairs(data[,vars.desc],
lower = list(continuous = "smooth"),
aes(colour = as.factor(hc$cluster), alpha = 0.4),
progress = FALSE,
ggtheme = theme_gray() )
You might already have wondered: “COuld one combine a PCA with clustering techniques”? The answer is: “Yes!”. In practice, that actually works very fine, and often delivers more robust clusters. So, lets give it a shot. We could do it by hand, but the HCPC
function already does that for us, and offers also a nice diagnostic viz.
hcpc <- HCPC(res.pca,
nb.clust = -1, # self determined: higher relative loss of inertia
graph = FALSE)
plot(hcpc, choice = "3D.map")
To finish up, lets plot it in a map, simplest way possible.
library(ggmap)
mapWorld <- borders("world", colour = "gray50", fill = "gray50")
mp <- ggplot() + mapWorld
mp + geom_point(aes(x = data$longitude, y = data$latitude) , color = hcpc$data.clust$clust)
You might have heard about the k-means clustering algorithm; if not, take a look at this tutorial. There are many fundamental differences between the two algorithms, although any one can perform better than the other in different cases. Some of the differences are:
Perhaps the most important part in any unsupervised learning task is the analysis of the results. After you have performed the clustering using any algorithm and any sets of parameters you need to make sure that you did it right. But how do you determine that?
Well, there are many measures to do this, perhaps the most popular one is the Dunn’s Index. Dunn’s index is the ratio between the minimum inter-cluster distances to the maximum intra-cluster diameter. The diameter of a cluster is the distance between its two furthermost points. In order to have well separated and compact clusters you should aim for a higher Dunn’s index.
Furthermore, graphical inspection often helps comparing the results of different algorithms and poarameters. Here you find some advanced diagnostic visualizations for hirarchical clustering.
Lastly, a clusters quality is to a large extend determined by its usefulness.