class: center, middle, inverse, title-slide # MULTIVARIJATNE STATISTIČKE METODE ## Predavanje 6: Klaster analiza ### dr.sc. Luka Šikić ### Fakultet hrvatskih studija |
Github MV
--- <style type="text/css"> @media print { .has-continuation { display: block !important; } } remark-slide-content { font-size: 22px; padding: 20px 80px 20px 80px; } .remark-code, .remark-inline-code { background: #f0f0f0; } .remark-code { font-size: 16px; } .mid. remark-code { /*Change made here*/ font-size: 60% !important; } .tiny .remark-code { /*Change made here*/ font-size: 40% !important; } </style> # Pregled predavanja <br> <br> <br> 1. [Karakteristike klaster analize](#kars) 2. [Udaljenost između opservacija](#slic) 2. [Hijerarhijski klastering](#hier) 3. [K-means klastering](#kmen) --- class: inverse, center, middle name: kars # KLASTER ANALIZA <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Općenito) --- # Resursi <br> <br> <br> - *klaster analiza* je općenito (i u R) dobro dokumentirana <br> - sveobuhvatni vodič [Practical Guide To Cluster Analysis in R](https://xsliulab.github.io/Workshop/week10/r-cluster-book.pdf) <br> - dostupno je i mnoštvo tutorial-a na: [link 1](https://uc-r.github.io/kmeans_clustering) , [link 2](https://statsandr.com/blog/clustering-analysis-k-means-and-hierarchical-clustering-by-hand-and-in-r/), [link 3](https://towardsdatascience.com/how-to-use-and-visualize-k-means-clustering-in-r-19264374a53c), [link 4](https://www.datanovia.com/en/blog/cluster-analysis-in-r-practical-guide/) <br> - postoji i [CRAN Task](https://cran.r-project.org/web/views/Cluster.html) na temu klasteringa <br> - interaktivni tutorial na [DataCamp-u](https://www.datacamp.com/courses/cluster-analysis-in-r) <br> - [ogledni](https://zir.nsk.hr/islandora/object/pmf%3A9063/datastream/PDF/view) primjer klaster analize na hrvatskim podatcima --- # Što je klaster analiza? <img src="../Foto/klast2.png" width="500px" style="display: block; margin: auto;" /> --- # Što je klaster analiza? <img src="../Foto/klast3.png" width="500px" style="display: block; margin: auto;" /> <br> --- # Što je klaster analiza? <img src="../Foto/klast4.png" width="500px" style="display: block; margin: auto;" /> <br> --- # Što je klaster analiza? <img src="../Foto/klast5.png" width="500px" style="display: block; margin: auto;" /> <br> --- # Tijek klaster analize <br> <br> <br> <br> <img src="../Foto/klasterTok.png" width="650px" style="display: block; margin: auto;" /> --- class: inverse, center, middle name: slic # UDALJENOST IZMEĐU OPSEVACIJA <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Podloga za klastering) --- # Udaljenost vs. sličnost <br> <br> <br> <img src="../Foto/klastDist1.png" width="500px" style="display: block; margin: auto;" /> --- # Udaljenost vs. sličnost <br> <br> <br> <img src="../Foto/klastDist2.png" width="500px" style="display: block; margin: auto;" /> --- # Udaljenost vs. sličnost <br> <br> <br> <img src="../Foto/klastDist3.png" width="500px" style="display: block; margin: auto;" /> --- # Udaljenost vs. sličnost <br> <br> <br> <img src="../Foto/klastDist4.png" width="500px" style="display: block; margin: auto;" /> --- # Udaljenost vs. sličnost <br> <br> <br> <img src="../Foto/klastDist5.png" width="500px" style="display: block; margin: auto;" /> --- # Udaljenost vs. sličnost <br> <br> <br> <img src="../Foto/klastDist6.png" width="500px" style="display: block; margin: auto;" /> --- # Udaljenost vs. sličnost <br> <br> <br> <img src="../Foto/klastDist7.png" width="500px" style="display: block; margin: auto;" /> --- # Udaljenost vs. sličnost <br> <br> <br> <img src="../Foto/klastDist8.png" width="500px" style="display: block; margin: auto;" /> --- # Udaljenost vs. sličnost <br> <br> <br> <img src="../Foto/klastDist9.png" width="500px" style="display: block; margin: auto;" /> --- # Više od dvije opservacije ```r # Napravi podatke triIgraca <- data.frame(X = c(0,9,-2), Y = c(0,12,19), row.names = c("BLUE", "RED", "GREEN")) print(triIgraca) # Prikaži ``` ``` ## X Y ## BLUE 0 0 ## RED 9 12 ## GREEN -2 19 ``` ```r dist(triIgraca) # Izračunaj udaljenosti ``` ``` ## BLUE RED ## RED 15.00000 ## GREEN 19.10497 13.03840 ``` --- # Udaljenost između opservacija ```r # Napravi podatke visinaTezina <- data.frame(Visina = c(6,6,8), Tezina = c(200,202,200), row.names = c(1,2,3)) # Prikazi podatke print(visinaTezina) ``` ``` ## Visina Tezina ## 1 6 200 ## 2 6 202 ## 3 8 200 ``` --- # Udaljenost između opservacija <img src="../Foto/klastVisina.png" width="500px" style="display: block; margin: auto;" /> --- # Udaljenost između opservacija <img src="../Foto/klastVisina1.png" width="500px" style="display: block; margin: auto;" /> --- # Udaljenost između opservacija <img src="../Foto/klastVisina3.png" width="500px" style="display: block; margin: auto;" /> --- # Udaljenost između opservacija <img src="../Foto/klastVisina4.png" width="500px" style="display: block; margin: auto;" /> --- # Udaljenost između opservacija <img src="../Foto/klastVisina5.png" width="500px" style="display: block; margin: auto;" /> --- # Udaljenost između opservacija <br> <br> <br> <br> $$ visina(skalirano) = \frac{visina - (prosječna)visina}{std(visina)} $$ --- # Udaljenost između opservacija <br> <br> <img src="../Foto/klastDist10.png" width="500px" style="display: block; margin: auto;" /> --- # Udaljenost između opservacija <br> <br> <img src="../Foto/klastDist11.png" width="500px" style="display: block; margin: auto;" /> --- # Skaliranje <br> <br> .pull-left[ ```r # Prikaži podatke print(visinaTezina) ``` ``` ## Visina Tezina ## 1 6 200 ## 2 6 202 ## 3 8 200 ``` ] .pull-right[ ```r # Skalirani podatci scale(visinaTezina) ``` ``` ## Visina Tezina ## 1 -0.5773503 -0.5773503 ## 2 -0.5773503 1.1547005 ## 3 1.1547005 -0.5773503 ## attr(,"scaled:center") ## Visina Tezina ## 6.666667 200.666667 ## attr(,"scaled:scale") ## Visina Tezina ## 1.154701 1.154701 ``` ] --- # Kategorički podatci ```r # Napravi podatke zadovoljstvo <- data.frame( zadovoljstvo = c("Nisko","Nisko","Visoko","Nisko","Srednje"), sreća = c("Ne","Ne","Da","Ne","Ne")) # Prikaži podatke u tablici kable(zadovoljstvo) ``` |zadovoljstvo |sreća | |:------------|:-----| |Nisko |Ne | |Nisko |Ne | |Visoko |Da | |Nisko |Ne | |Srednje |Ne | --- # Jaccardova udaljenost u R ```r # Prikaži podatke print(zadovoljstvo) ``` ``` ## zadovoljstvo sreća ## 1 Nisko Ne ## 2 Nisko Ne ## 3 Visoko Da ## 4 Nisko Ne ## 5 Srednje Ne ``` ```r # Udaljenost za kategoričke podatke (NA) dist(zadovoljstvo, method = "binary") ``` ``` ## 1 2 3 4 ## 2 NA ## 3 NA NA ## 4 NA NA NA ## 5 NA NA NA NA ``` --- # Jaccardova udaljenost u R ```r library(dummies) dummieZadovoljstvo <- dummy.data.frame(zadovoljstvo) # Pretvori u dummie print(dummieZadovoljstvo) ``` ``` ## zadovoljstvoNisko zadovoljstvoSrednje zadovoljstvoVisoko srećaDa srećaNe ## 1 1 0 0 0 1 ## 2 1 0 0 0 1 ## 3 0 0 1 1 0 ## 4 1 0 0 0 1 ## 5 0 1 0 0 1 ``` --- # Jaccardova udaljenost u R ```r # Prikaži podatke print(zadovoljstvo) ``` ``` ## zadovoljstvo sreća ## 1 Nisko Ne ## 2 Nisko Ne ## 3 Visoko Da ## 4 Nisko Ne ## 5 Srednje Ne ``` ```r # Udaljenost za kategorije dist(dummieZadovoljstvo, method = "binary") ``` ``` ## 1 2 3 4 ## 2 0.0000000 ## 3 1.0000000 1.0000000 ## 4 0.0000000 0.0000000 1.0000000 ## 5 0.6666667 0.6666667 1.0000000 0.6666667 ``` --- class: inverse, center, middle name: hier # HIJERARHIJSKI KLASTERING <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Grupiranje <br> <br> <img src="../Foto/klastH1.png" width="500px" style="display: block; margin: auto;" /> --- # Grupiranje <br> <br> <img src="../Foto/klastH2.png" width="500px" style="display: block; margin: auto;" /> --- # Grupiranje <br> <br> <img src="../Foto/klastH3.png" width="500px" style="display: block; margin: auto;" /> --- # Grupiranje <br> <br> <img src="../Foto/klastH4.png" width="500px" style="display: block; margin: auto;" /> --- # Grupiranje <br> <br> <img src="../Foto/klastH5.png" width="500px" style="display: block; margin: auto;" /> --- # Grupiranje <br> <br> <img src="../Foto/klastH6.png" width="500px" style="display: block; margin: auto;" /> --- # Grupiranje <br> <br> <img src="../Foto/klastH7.png" width="500px" style="display: block; margin: auto;" /> --- # Grupiranje <br> <br> <img src="../Foto/klastH8.png" width="500px" style="display: block; margin: auto;" /> --- # Grupiranje <br> <br> <img src="../Foto/klastH9.png" width="500px" style="display: block; margin: auto;" /> --- # Grupiranje <br> <br> <img src="../Foto/klastH10.png" width="500px" style="display: block; margin: auto;" /> --- # Izgled klastera <br> <br> <img src="../Foto/klastIZ1.png" width="500px" style="display: block; margin: auto;" /> --- # Izgled klastera <br> <br> <img src="../Foto/klastIZ1.png" width="500px" style="display: block; margin: auto;" /> --- # Izgled klastera <br> <br> <img src="../Foto/klastIZ2.png" width="500px" style="display: block; margin: auto;" /> --- # Izgled klastera <br> <br> <img src="../Foto/klastIZ3.png" width="500px" style="display: block; margin: auto;" /> --- # Izgled klastera <br> <br> <img src="../Foto/klastIZ4.png" width="500px" style="display: block; margin: auto;" /> --- # Izgled klastera <br> <br> <img src="../Foto/klastIZ5.png" width="500px" style="display: block; margin: auto;" /> --- # Izgled klastera <br> <br> <img src="../Foto/klastIZ6.png" width="500px" style="display: block; margin: auto;" /> --- # Izgled klastera <br> <br> <img src="../Foto/klastIZ7.png" width="500px" style="display: block; margin: auto;" /> --- # Izgled klastera <br> <br> <img src="../Foto/klastIZ8.png" width="500px" style="display: block; margin: auto;" /> --- # Izgled klastera <br> <br> <img src="../Foto/klastIZ9.png" width="500px" style="display: block; margin: auto;" /> --- # Izgled klastera <br> <br> <img src="../Foto/klastIZ10.png" width="500px" style="display: block; margin: auto;" /> --- # Klastering procedura ```r # Napravi podatke igraci <- data.frame(X = c(-1,-2,8,7,-12,-15), Y = c(1,-3,6,-8,8,0)) # Prikaži podatke print(igraci) ``` ``` ## X Y ## 1 -1 1 ## 2 -2 -3 ## 3 8 6 ## 4 7 -8 ## 5 -12 8 ## 6 -15 0 ``` ```r # Izračunaj udaljenost distIgraci <- dist(igraci, method = "euclidean") # Provedi klastering hcIgraci <- hclust(distIgraci, method = 'complete') ``` --- # Klastering procedura ```r klasteri <- cutree(hcIgraci, k = 2) # Odredi broj klastera print(klasteri) # Prikaži podatke ``` ``` ## [1] 1 1 1 1 2 2 ``` ```r igraciKlasteri<- mutate(igraci, cluster = klasteri) # Prilagodba podataka # Podatci + klasteri print(igraciKlasteri) ``` ``` ## X Y cluster ## 1 -1 1 1 ## 2 -2 -3 1 ## 3 8 6 1 ## 4 7 -8 1 ## 5 -12 8 2 ## 6 -15 0 2 ``` --- # Klastering procedura ```r # Prikaži podatke grafički ggplot(igraciKlasteri, aes(x = X, y = Y, color = factor(klasteri))) + geom_point() + ggtitle("Prikaz igrača i klastera u koordinatnom sustavu") ``` <img src="06_KLASTER_files/figure-html/unnamed-chunk-55-1.png" style="display: block; margin: auto;" /> --- # Napravi dendogram <br> <br> <img src="../Foto/klastDend1.png" width="600px" style="display: block; margin: auto;" /> --- # Napravi dendogram <br> <br> <img src="../Foto/klastDend2.png" width="600px" style="display: block; margin: auto;" /> --- # Napravi dendogram <br> <br> <img src="../Foto/klastDend3.png" width="600px" style="display: block; margin: auto;" /> --- # Napravi dendogram <br> <br> <img src="../Foto/klastDend4.png" width="600px" style="display: block; margin: auto;" /> --- # Napravi dendogram <br> <br> <img src="../Foto/klastDend5.png" width="600px" style="display: block; margin: auto;" /> --- # Napravi dendogram <br> <br> <img src="../Foto/klastDend6.png" width="600px" style="display: block; margin: auto;" /> --- # Napravi dendogram <br> <br> <img src="../Foto/klastDend7.png" width="600px" style="display: block; margin: auto;" /> --- # Napravi dendogram <br> <br> <img src="../Foto/klastDend8.png" width="600px" style="display: block; margin: auto;" /> --- # Napravi dendogram ```r plot(hcIgraci) ``` <img src="06_KLASTER_files/figure-html/unnamed-chunk-64-1.png" style="display: block; margin: auto;" /> --- # Odredi broj klastera <br> <br> <img src="../Foto/klastDend9.png" width="500px" style="display: block; margin: auto;" /> --- # Odredi broj klastera <br> <br> <img src="../Foto/klastDend10.png" width="500px" style="display: block; margin: auto;" /> --- # Odredi broj klastera ```r library(dendextend) # Učitaj paket dendIgraci <- as.dendrogram(hcIgraci) # Napravi dendogram dendBoje <- color_branches(dendIgraci, h = 2) # Prilagodi dendogram plot(dendBoje) # Prikaži dendogram ``` <img src="06_KLASTER_files/figure-html/unnamed-chunk-67-1.png" style="display: block; margin: auto;" /> --- # Broj klastera ```r dendIgraci <- as.dendrogram(hcIgraci) # Napravi dendogram dendBoje <- color_branches(dendIgraci, k = 5) # Prilagodi dendogram plot(dendBoje) # Prikaži dendogram ``` <img src="06_KLASTER_files/figure-html/unnamed-chunk-68-1.png" style="display: block; margin: auto;" /> --- # Broj klastera ```r clusterPripisano <- cutree(hcIgraci, h = 15) # "Odreži" print(clusterPripisano) # Prikaz podataka ``` ``` ## [1] 1 1 1 1 2 2 ``` ```r igraciKlaster <- mutate(igraci, cluster = clusterPripisano) # Prilagodba podataka print(igraciKlaster) # Podatci + klasteri ``` ``` ## X Y cluster ## 1 -1 1 1 ## 2 -2 -3 1 ## 3 8 6 1 ## 4 7 -8 1 ## 5 -12 8 2 ## 6 -15 0 2 ``` --- class: inverse, center, middle name: kmen # K MEANS KLASTERING <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Najćešća klastering metoda) --- # Intuicija <br> <br> <img src="../Foto/klastK.png" width="500px" style="display: block; margin: auto;" /> --- # Intuicija <br> <br> <img src="../Foto/klastK1.png" width="500px" style="display: block; margin: auto;" /> --- # Intuicija <br> <br> <img src="../Foto/klastK2.png" width="500px" style="display: block; margin: auto;" /> --- # Intuicija <br> <br> <img src="../Foto/klastK3.png" width="500px" style="display: block; margin: auto;" /> --- # Intuicija <br> <br> <img src="../Foto/klastK4.png" width="500px" style="display: block; margin: auto;" /> --- # Intuicija <br> <br> <img src="../Foto/klastK5.png" width="500px" style="display: block; margin: auto;" /> --- # Intuicija <br> <br> <img src="../Foto/klastK6.png" width="500px" style="display: block; margin: auto;" /> --- # Intuicija <br> <br> <img src="../Foto/klastK7.png" width="500px" style="display: block; margin: auto;" /> --- # Intuicija <br> <br> <img src="../Foto/klastK8.png" width="500px" style="display: block; margin: auto;" /> --- # Intuicija <br> <br> <img src="../Foto/klastK9.png" width="500px" style="display: block; margin: auto;" /> --- # Provedi kmeans() ```r # Napravi podatke lineup <- data.frame(X =c(-1,-2,8,7,-12,-15,-13,15,21,12,-25,26), Y = c(1,-3,6,-8,8,0,-10,16,2,-15,1,0)) # Pregledaj podatke print(lineup) ``` ``` ## X Y ## 1 -1 1 ## 2 -2 -3 ## 3 8 6 ## 4 7 -8 ## 5 -12 8 ## 6 -15 0 ## 7 -13 -10 ## 8 15 16 ## 9 21 2 ## 10 12 -15 ## 11 -25 1 ## 12 26 0 ``` --- # Provedi kmeans() ```r # Provedi K-means model <- kmeans(lineup, centers = 2) print(model$cluster) # Prikaži klastere ``` ``` ## [1] 1 1 2 2 1 1 1 2 2 2 1 2 ``` ```r lineupKlasteri <- mutate(lineup, cluster = model$cluster) # Prilagodi podatke print(lineupKlasteri) # Prikaži klastere i podatke ``` ``` ## X Y cluster ## 1 -1 1 1 ## 2 -2 -3 1 ## 3 8 6 2 ## 4 7 -8 2 ## 5 -12 8 1 ## 6 -15 0 1 ## 7 -13 -10 1 ## 8 15 16 2 ## 9 21 2 2 ## 10 12 -15 2 ## 11 -25 1 1 ## 12 26 0 2 ``` --- # Odredi broj klastera <br> <br> <br> <img src="../Foto/klastK10.png" width="600px" style="display: block; margin: auto;" /> --- # Odredi broj klastera <br> <br> <br> <img src="../Foto/klastK11.png" width="600px" style="display: block; margin: auto;" /> --- # Odredi broj klastera <br> <br> <br> <img src="../Foto/klastK12.png" width="600px" style="display: block; margin: auto;" /> --- # Odredi broj klastera <br> <br> <br> <img src="../Foto/klastK13.png" width="600px" style="display: block; margin: auto;" /> --- # Odredi broj klastera <br> <br> <br> <img src="../Foto/klastK14.png" width="600px" style="display: block; margin: auto;" /> --- # Odredi broj klastera <br> <br> <br> <img src="../Foto/klastK15.png" width="600px" style="display: block; margin: auto;" /> --- # Odredi broj klastera <br> <br> ```r model <- kmeans(x = lineup, centers = 2) # Provedi model sa dva klastera model$tot.withinss # Kvaliteta modela ``` ``` ## [1] 1434.5 ``` --- # Odredi broj klastera ```r library(purrr) tot_withinss <- map_dbl(1:10, function(k){ model <- kmeans(x = lineup, centers = k) model$tot.withinss}) elbow_df <- data.frame(k = 1:10,tot_withinss = tot_withinss) print(elbow_df) ``` ``` ## k tot_withinss ## 1 1 3489.9167 ## 2 2 1434.5000 ## 3 3 881.2500 ## 4 4 622.5000 ## 5 5 496.4167 ## 6 6 387.5000 ## 7 7 317.0000 ## 8 8 96.5000 ## 9 9 145.1667 ## 10 10 51.5000 ``` --- # Odredi broj klastera ```r ggplot(elbow_df, aes(x = k, y = tot_withinss)) + geom_line() + scale_x_continuous(breaks = 1:10) ``` <img src="06_KLASTER_files/figure-html/unnamed-chunk-91-1.png" style="display: block; margin: auto;" /> --- # Silhouette analiza <br> <br> <br> <br> <img src="../Foto/klastS1.png" width="500px" style="display: block; margin: auto;" /> --- # Silhouette analiza <br> <br> <br> <br> <img src="../Foto/klastS2.png" width="500px" style="display: block; margin: auto;" /> --- # Silhouette analiza <br> ###### Unutarklasterska udaljenost <br> <img src="../Foto/klastS3.png" width="500px" style="display: block; margin: auto;" /> --- # Silhouette analiza <br> <br> <br> <br> <img src="../Foto/klastS4.png" width="500px" style="display: block; margin: auto;" /> --- # Silhouette analiza <br> ###### Udaljenost do susjednog klastera <br> <img src="../Foto/klastS5.png" width="500px" style="display: block; margin: auto;" /> --- # Silhouette analiza <br> ###### Udaljenost do susjednog klastera <br> <img src="../Foto/klastS6.png" width="500px" style="display: block; margin: auto;" /> --- # Silhouette analiza <br> ###### Udaljenost do susjednog klastera <br> <img src="../Foto/klastS7.png" width="500px" style="display: block; margin: auto;" /> --- # Silhouette analiza ```r library(cluster) pam_k3 <- pam(lineup, k = 3) pam_k3$silinfo$widths ``` ``` ## cluster neighbor sil_width ## 4 1 2 0.465320054 ## 2 1 3 0.321729341 ## 10 1 2 0.311385893 ## 1 1 3 0.271890169 ## 9 2 1 0.443606497 ## 8 2 1 0.398547473 ## 12 2 1 0.393982685 ## 3 2 1 -0.009151755 ## 11 3 1 0.546797052 ## 6 3 1 0.529967901 ## 5 3 1 0.359014657 ## 7 3 1 0.207878188 ``` --- # Silhouette analiza ```r sil_plot <- silhouette(pam_k3) plot(sil_plot) ``` <img src="06_KLASTER_files/figure-html/unnamed-chunk-100-1.png" style="display: block; margin: auto;" /> --- # Silhouette analiza <br> <br> <br> ```r pam_k3$silinfo$avg.width ``` ``` ## [1] 0.353414 ``` ```r # 1: Dobro klastersko poklapanje # 0: Granično poklapanje između klastera # -1: Loše poklapanje ``` --- # Silhouette analiza ```r sil_width <- map_dbl(2:10, function(k){ model <- pam(x = lineup, k = k) model$silinfo$avg.width}) sil_df <- data.frame(k = 2:10,sil_width = sil_width) print(sil_df) ``` ``` ## k sil_width ## 1 2 0.4164141 ## 2 3 0.3534140 ## 3 4 0.3535534 ## 4 5 0.3724115 ## 5 6 0.3436130 ## 6 7 0.3236397 ## 7 8 0.3275222 ## 8 9 0.2547311 ## 9 10 0.2099424 ``` --- # Silhouette analiza ```r # Vizualizacija ggplot(sil_df, aes(x = k, y = sil_width)) + geom_line() + scale_x_continuous(breaks = 2:10) ``` <img src="06_KLASTER_files/figure-html/unnamed-chunk-103-1.png" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Hvala na pažnji <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Nastavak: Višestruka linearna regresija)