class: center, middle, inverse, title-slide # MULTIVARIJATNE STATISTIČKE METODE ## Predavanje 8: Analiza sadržaja ### dr.sc. Luka Šikić ### Fakultet hrvatskih studija |
Github MV
--- <style type="text/css"> @media print { .has-continuation { display: block !important; } } remark-slide-content { font-size: 22px; padding: 20px 80px 20px 80px; } .remark-code, .remark-inline-code { background: #f0f0f0; } .remark-code { font-size: 16px; } .mid. remark-code { /*Change made here*/ font-size: 60% !important; } .tiny .remark-code { /*Change made here*/ font-size: 40% !important; } /* custom.css */ .left-code { color: #777; width: 38%; height: 92%; float: left; } .right-plot { width: 60%; float: right; padding-left: 1%; } .plot-callout { height: 225px; width: 450px; bottom: 5%; right: 5%; position: absolute; padding: 0px; z-index: 100; } .plot-callout img { width: 100%; border: 4px solid #23373B; } </style> # Pregled predavanja <br> <br> <br> 1. [Manipulacija tekstualnih podataka](#mani) 2. [Vizualizacija teksta](#viz) 3. [Analiza sentimenta](#sent) 4. [Tematska analiza](#topi) --- class: inverse, center, middle name: mani # MANIPULACIJA TEKSTUALNIH PODATAKA <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (*Podrška* za analizu teksta) --- # Software podrška (*paketi*) <br> <br> <br> <img src="../Foto/txt1.png" width="500px" style="display: block; margin: auto;" /> --- # Podatci ```r recenzije_Dta <- read_csv("../Podatci/Roomba Reviews.csv") head(recenzije_Dta,10) ``` ``` ## # A tibble: 10 x 5 ## Date Product Stars Title Review ## <chr> <chr> <dbl> <chr> <chr> ## 1 2/28/15 iRobot Roomba~ 5 Five Stars "You would not believe how w~ ## 2 1/12/15 iRobot Roomba~ 4 Four Stars "You just walk away and it d~ ## 3 12/26/~ iRobot Roomba~ 5 Awesome love it. "You have to Roomba proof yo~ ## 4 8/4/13 iRobot Roomba~ 3 Love-hate this va~ "Yes, it's a fascinating, al~ ## 5 12/22/~ iRobot Roomba~ 5 This vacuum is fa~ "Years ago I bought one of t~ ## 6 12/27/~ iRobot Roomba~ 5 Wow! "Wow.Wow. I never knew my f~ ## 7 8/17/15 iRobot Roomba~ 1 Terrible Product ~ "Wow.. I don't know what to ~ ## 8 12/28/~ iRobot Roomba~ 5 Super-impressed b~ "Wow, wow, WOW! I wanted to~ ## 9 1/19/14 iRobot Roomba~ 5 LOVE THIS "Wow, the Roomba is the best~ ## 10 7/2/15 iRobot Roomba~ 5 Stress is bad; Ro~ "Wow, it changes your life. ~ ``` --- # Pregled podataka ```r # filter() + summarize() recenzije_Dta %>% filter(Product == "iRobot Roomba 650 for Pets") %>% summarize(stars_mean = mean(Stars)) ``` ``` ## # A tibble: 1 x 1 ## stars_mean ## <dbl> ## 1 4.49 ``` ```r # group_by() + summarize() recenzije_Dta %>% group_by(Product) %>% summarize(stars_mean = mean(Stars)) ``` ``` ## # A tibble: 2 x 2 ## Product stars_mean ## * <chr> <dbl> ## 1 iRobot Roomba 650 for Pets 4.49 ## 2 iRobot Roomba 880 for Pets and Allergies 4.42 ``` --- # Pregled podataka <br> <br> ```r # nestrukturirani podatci recenzije_Dta %>% group_by(Product) %>% summarize(review_mean = mean(Review)) ``` ``` ## # A tibble: 2 x 2 ## Product review_mean ## * <chr> <dbl> ## 1 iRobot Roomba 650 for Pets NA ## 2 iRobot Roomba 880 for Pets and Allergies NA ``` --- # Pregled podataka ```r glimpse(recenzije_Dta) ``` ``` ## Rows: 1,833 ## Columns: 5 ## $ Date <chr> "2/28/15", "1/12/15", "12/26/13", "8/4/13", "12/22/15", "12/27~ ## $ Product <chr> "iRobot Roomba 650 for Pets", "iRobot Roomba 650 for Pets", "i~ ## $ Stars <dbl> 5, 4, 5, 3, 5, 5, 1, 5, 5, 5, 5, 5, 5, 5, 3, 5, 5, 4, 4, 2, 5,~ ## $ Title <chr> "Five Stars", "Four Stars", "Awesome love it.", "Love-hate thi~ ## $ Review <chr> "You would not believe how well this works", "You just walk aw~ ``` ```r recenzije_Dta %>% summarize(number_rows = n()) ``` ``` ## # A tibble: 1 x 1 ## number_rows ## <int> ## 1 1833 ``` --- # Pregled podataka ```r # summarize() + n() recenzije_Dta %>% group_by(Product) %>% summarize(number_rows = n()) ``` ``` ## # A tibble: 2 x 2 ## Product number_rows ## * <chr> <int> ## 1 iRobot Roomba 650 for Pets 633 ## 2 iRobot Roomba 880 for Pets and Allergies 1200 ``` ```r # count() recenzije_Dta %>% count(Product) ``` ``` ## # A tibble: 2 x 2 ## Product n ## * <chr> <int> ## 1 iRobot Roomba 650 for Pets 633 ## 2 iRobot Roomba 880 for Pets and Allergies 1200 ``` --- # Pregled podataka <br> <br> ```r recenzije_Dta %>% count(Product) %>% arrange(desc(n)) ``` ``` ## # A tibble: 2 x 2 ## Product n ## <chr> <int> ## 1 iRobot Roomba 880 for Pets and Allergies 1200 ## 2 iRobot Roomba 650 for Pets 633 ``` --- # tidytext paket <br> <br> <img src="../Foto/txt2.png" width="380px" style="display: block; margin: auto;" /> --- # Tokenizacija teksta <br> <br> - Bag of words: riječi u dokumentu su nezavisne <br> - svaki element tesksta je dokument <br> - svaka jedinstvena riječ je element (*term*) <br> - svaka pojava elementa je token <br> - stvaranje *bag of words* je tokenizacija --- # Tokenizacija teksta ```r # unnest_tokens() funkcija library(tidytext) tidy_review <- recenzije_Dta %>% unnest_tokens(word, Review) head(tidy_review,8) ``` ``` ## # A tibble: 8 x 5 ## Date Product Stars Title word ## <chr> <chr> <dbl> <chr> <chr> ## 1 2/28/15 iRobot Roomba 650 for Pets 5 Five Stars you ## 2 2/28/15 iRobot Roomba 650 for Pets 5 Five Stars would ## 3 2/28/15 iRobot Roomba 650 for Pets 5 Five Stars not ## 4 2/28/15 iRobot Roomba 650 for Pets 5 Five Stars believe ## 5 2/28/15 iRobot Roomba 650 for Pets 5 Five Stars how ## 6 2/28/15 iRobot Roomba 650 for Pets 5 Five Stars well ## 7 2/28/15 iRobot Roomba 650 for Pets 5 Five Stars this ## 8 2/28/15 iRobot Roomba 650 for Pets 5 Five Stars works ``` --- # Pobroji riječi ```r tidy_review %>% count(word) %>% arrange(desc(n)) ``` ``` ## # A tibble: 10,310 x 2 ## word n ## <chr> <int> ## 1 the 11785 ## 2 it 7905 ## 3 and 6794 ## 4 to 6440 ## 5 i 6034 ## 6 a 5884 ## 7 is 3347 ## 8 of 3229 ## 9 have 2470 ## 10 that 2410 ## # ... with 10,300 more rows ``` --- # anti_join() <br> <br> <br> <img src="../Foto/txt3.png" width="500px" style="display: block; margin: auto;" /> --- # anti_join() <br> ```r tidy_review2 <- recenzije_Dta %>% unnest_tokens(word, Review) %>% anti_join(stop_words) tidy_review2 ``` ``` ## # A tibble: 78,868 x 5 ## Date Product Stars Title word ## <chr> <chr> <dbl> <chr> <chr> ## 1 1/12/15 iRobot Roomba 650 for Pets 4 Four Stars walk ## 2 1/12/15 iRobot Roomba 650 for Pets 4 Four Stars rest ## 3 12/26/13 iRobot Roomba 650 for Pets 5 Awesome love it. roomba ## 4 12/26/13 iRobot Roomba 650 for Pets 5 Awesome love it. proof ## 5 12/26/13 iRobot Roomba 650 for Pets 5 Awesome love it. house ## 6 12/26/13 iRobot Roomba 650 for Pets 5 Awesome love it. awesome ## 7 12/26/13 iRobot Roomba 650 for Pets 5 Awesome love it. pet ## 8 12/26/13 iRobot Roomba 650 for Pets 5 Awesome love it. cleans ## 9 8/4/13 iRobot Roomba 650 for Pets 3 Love-hate this vaccuum fascinating ## 10 8/4/13 iRobot Roomba 650 for Pets 3 Love-hate this vaccuum albeit ## # ... with 78,858 more rows ``` --- # Stop riječi ```r head(stop_words,12) ``` ``` ## # A tibble: 12 x 2 ## word lexicon ## <chr> <chr> ## 1 a SMART ## 2 a's SMART ## 3 able SMART ## 4 about SMART ## 5 above SMART ## 6 according SMART ## 7 accordingly SMART ## 8 across SMART ## 9 actually SMART ## 10 after SMART ## 11 afterwards SMART ## 12 again SMART ``` --- # Pobroji riječi (*ponovno*) ```r tidy_review2 %>% count(word) %>% arrange(desc(n)) ``` ``` ## # A tibble: 9,672 x 2 ## word n ## <chr> <int> ## 1 roomba 2286 ## 2 clean 1204 ## 3 vacuum 989 ## 4 hair 900 ## 5 cleaning 809 ## 6 time 795 ## 7 house 745 ## 8 floors 657 ## 9 day 578 ## 10 floor 561 ## # ... with 9,662 more rows ``` --- class: inverse, center, middle name: viz # VIZALIZACIJA TEKSTA <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Pregled tekstualnih podataka) --- # Uredi podatke za prikaz ```r tidy_review <- recenzije_Dta %>% mutate(id = row_number()) %>% unnest_tokens(word, Review) %>% anti_join(stop_words) head(tidy_review,8) ``` ``` ## # A tibble: 8 x 6 ## Date Product Stars Title id word ## <chr> <chr> <dbl> <chr> <int> <chr> ## 1 1/12/15 iRobot Roomba 650 for Pets 4 Four Stars 2 walk ## 2 1/12/15 iRobot Roomba 650 for Pets 4 Four Stars 2 rest ## 3 12/26/13 iRobot Roomba 650 for Pets 5 Awesome love it. 3 roomba ## 4 12/26/13 iRobot Roomba 650 for Pets 5 Awesome love it. 3 proof ## 5 12/26/13 iRobot Roomba 650 for Pets 5 Awesome love it. 3 house ## 6 12/26/13 iRobot Roomba 650 for Pets 5 Awesome love it. 3 awesome ## 7 12/26/13 iRobot Roomba 650 for Pets 5 Awesome love it. 3 pet ## 8 12/26/13 iRobot Roomba 650 for Pets 5 Awesome love it. 3 cleans ``` --- # Histogram ```r word_counts <- tidy_review %>% count(word) %>% arrange(desc(n)) %>% filter(n<2000) ggplot(word_counts, aes(x = word, y = n)) + geom_col() ``` <img src="08_TXT_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- # Filtriraj ```r word_counts2 <- tidy_review %>% count(word) %>% filter(n > 300) %>% arrange(desc(n)) head(word_counts2,8) ``` ``` ## # A tibble: 8 x 2 ## word n ## <chr> <int> ## 1 roomba 2286 ## 2 clean 1204 ## 3 vacuum 989 ## 4 hair 900 ## 5 cleaning 809 ## 6 time 795 ## 7 house 745 ## 8 floors 657 ``` --- # Vizualizacija ```r # zakreni osi pomoću coord_flip() ggplot(word_counts2, aes(x = word, y = n)) + geom_col() + coord_flip() + ggtitle("Review Word Counts") ``` <img src="08_TXT_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> --- # Stop riječi .pull-left[ ```r # default stop riječi head(stop_words,8) ``` ``` ## # A tibble: 8 x 2 ## word lexicon ## <chr> <chr> ## 1 a SMART ## 2 a's SMART ## 3 able SMART ## 4 about SMART ## 5 above SMART ## 6 according SMART ## 7 accordingly SMART ## 8 across SMART ``` ] .pull-right[ ```r # dodaj custom stop riječi tribble( ~word, ~lexicon, "roomba", "CUSTOM", "2", "CUSTOM" ) ``` ``` ## # A tibble: 2 x 2 ## word lexicon ## <chr> <chr> ## 1 roomba CUSTOM ## 2 2 CUSTOM ``` ] --- # Stop riječi ```r custom_stop_words <- tribble( ~word, ~lexicon, "roomba", "CUSTOM", "2", "CUSTOM" ) stop_words2 <- stop_words %>% bind_rows(custom_stop_words) ``` ```r # Ukloni sve stop riječi tidy_review <- recenzije_Dta %>% mutate(id = row_number()) %>% select(id, Date, Product, Stars, Review) %>% unnest_tokens(word, Review) %>% anti_join(stop_words2) tidy_review %>% filter(word == "roomba") ``` ``` ## # A tibble: 0 x 5 ## # ... with 5 variables: id <int>, Date <chr>, Product <chr>, Stars <dbl>, ## # word <chr> ``` --- # Poboljšaj grafikon ```r # prvo koristi fct_reorder funkcija word_counts <- tidy_review %>% count(word) %>% filter(n > 300) %>% mutate(word2 = fct_reorder(word, n)) head(word_counts,8) ``` ``` ## # A tibble: 8 x 3 ## word n word2 ## <chr> <int> <fct> ## 1 880 525 880 ## 2 bin 428 bin ## 3 carpet 368 carpet ## 4 clean 1204 clean ## 5 cleaning 809 cleaning ## 6 day 578 day ## 7 dirt 384 dirt ## 8 dog 407 dog ``` --- # Poboljšaj grafikon ```r ggplot(word_counts, aes(x = word2, y = n)) + geom_col() + coord_flip() + ggtitle("Review Word Counts") ``` <img src="08_TXT_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> --- # Facet grafikoni ```r tidy_review %>% count(word, Product) %>% arrange(desc(n)) ``` ``` ## # A tibble: 12,719 x 3 ## word Product n ## <chr> <chr> <int> ## 1 clean iRobot Roomba 880 for Pets and Allergies 815 ## 2 vacuum iRobot Roomba 880 for Pets and Allergies 678 ## 3 hair iRobot Roomba 880 for Pets and Allergies 595 ## 4 cleaning iRobot Roomba 880 for Pets and Allergies 560 ## 5 880 iRobot Roomba 880 for Pets and Allergies 518 ## 6 house iRobot Roomba 880 for Pets and Allergies 494 ## 7 time iRobot Roomba 880 for Pets and Allergies 494 ## 8 floors iRobot Roomba 880 for Pets and Allergies 405 ## 9 love iRobot Roomba 880 for Pets and Allergies 403 ## 10 dust iRobot Roomba 880 for Pets and Allergies 399 ## # ... with 12,709 more rows ``` --- # Priprema za facet grafikon ```r tidy_review %>% count(word, Product) %>% group_by(Product) %>% top_n(5, n) ``` ``` ## # A tibble: 10 x 3 ## # Groups: Product [2] ## word Product n ## <chr> <chr> <int> ## 1 880 iRobot Roomba 880 for Pets and Allergies 518 ## 2 clean iRobot Roomba 650 for Pets 389 ## 3 clean iRobot Roomba 880 for Pets and Allergies 815 ## 4 cleaning iRobot Roomba 880 for Pets and Allergies 560 ## 5 floors iRobot Roomba 650 for Pets 252 ## 6 hair iRobot Roomba 650 for Pets 305 ## 7 hair iRobot Roomba 880 for Pets and Allergies 595 ## 8 time iRobot Roomba 650 for Pets 301 ## 9 vacuum iRobot Roomba 650 for Pets 311 ## 10 vacuum iRobot Roomba 880 for Pets and Allergies 678 ``` --- # Priprema za facet grafikon <br> <br> ```r word_counts <- tidy_review %>% count(word, Product) %>% group_by(Product) %>% top_n(10, n) %>% ungroup() %>% mutate(word2 = fct_reorder(word, n)) ``` ```r gg <- ggplot(word_counts, aes(x = word2, y = n, fill = Product)) + geom_col(show.legend = FALSE) + facet_wrap(~Product, scales = "free_y") + coord_flip() + ggtitle("Review Word Counts") ``` --- # Facet grafikon <img src="08_TXT_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" /> --- # Oblak riječi (*WordCloud*) <br> <br> .left-code[ ```r library(wordcloud) word_counts <- tidy_review %>% count(word) wordcloud( words = word_counts$word, freq = word_counts$n, max.words = 30) ``` ] .right-plot[ <img src="08_TXT_files/figure-html/plot-label-out-1.png" style="display: block; margin: auto;" /> ] --- # Oblak riječi (*WordCloud*) <br> <br> .left-code[ ```r # Više riječi wordcloud( words = word_counts$word, freq = word_counts$n, max.words = 70 ) ``` ] .right-plot[ <img src="08_TXT_files/figure-html/plot-label1-out-1.png" style="display: block; margin: auto;" /> ] --- # Oblak riječi (*WordCloud*) <br> <br> .left-code[ ```r # Boja wordcloud( words = word_counts$word, freq = word_counts$n, max.words = 30, colors = "blue" ) ``` ] .right-plot[ <img src="08_TXT_files/figure-html/plot-label2-out-1.png" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle name: sent # ANALIZA SENTIMENTA <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Sve ovisi o riječnicima) --- # Dohvati riječnike sentimenta ```r get_sentiments("bing") ``` ``` ## # A tibble: 6,786 x 2 ## word sentiment ## <chr> <chr> ## 1 2-faces negative ## 2 abnormal negative ## 3 abolish negative ## 4 abominable negative ## 5 abominably negative ## 6 abominate negative ## 7 abomination negative ## 8 abort negative ## 9 aborted negative ## 10 aborts negative ## # ... with 6,776 more rows ``` --- # Pregledaj riječnik <br> <br> ```r get_sentiments("bing") %>% count(sentiment) ``` ``` ## # A tibble: 2 x 2 ## sentiment n ## * <chr> <int> ## 1 negative 4781 ## 2 positive 2005 ``` --- # Dohvati drugi riječnik ```r library(textdata) get_sentiments("afinn") ``` ``` ## # A tibble: 2,477 x 2 ## word value ## <chr> <dbl> ## 1 abandon -2 ## 2 abandoned -2 ## 3 abandons -2 ## 4 abducted -2 ## 5 abduction -2 ## 6 abductions -2 ## 7 abhor -3 ## 8 abhorred -3 ## 9 abhorrent -3 ## 10 abhors -3 ## # ... with 2,467 more rows ``` --- # Pregledaj riječnik <br> <br> ```r get_sentiments("afinn") %>% summarize( min = min(value), max = max(value) ) ``` ``` ## # A tibble: 1 x 2 ## min max ## <dbl> <dbl> ## 1 -5 5 ``` --- # Dohvati treći riječnik <br> <br> ```r sentiment_counts <- get_sentiments("loughran") %>% count(sentiment) %>% mutate(sentiment2 = fct_reorder(sentiment, n)) gg2 <- ggplot(sentiment_counts, aes(x = sentiment2, y = n)) + geom_col() + coord_flip() + labs(title = "Sentiment Counts in Loughran", x = "Counts", y = "Sentiment") ``` --- # Vizualni pregled riječnika <img src="08_TXT_files/figure-html/unnamed-chunk-40-1.png" style="display: block; margin: auto;" /> --- # Spoji riječnik i podatke ```r tidy_review %>% inner_join(get_sentiments("loughran")) ``` ``` ## # A tibble: 3,960 x 6 ## id Date Product Stars word sentiment ## <int> <chr> <chr> <dbl> <chr> <chr> ## 1 5 12/22/15 iRobot Roomba 650 for Pets 5 slow negative ## 2 5 12/22/15 iRobot Roomba 650 for Pets 5 easily positive ## 3 5 12/22/15 iRobot Roomba 650 for Pets 5 random uncertainty ## 4 5 12/22/15 iRobot Roomba 650 for Pets 5 easy positive ## 5 5 12/22/15 iRobot Roomba 650 for Pets 5 easy positive ## 6 5 12/22/15 iRobot Roomba 650 for Pets 5 easy positive ## 7 6 12/27/15 iRobot Roomba 650 for Pets 5 invention positive ## 8 7 8/17/15 iRobot Roomba 650 for Pets 1 damage negative ## 9 7 8/17/15 iRobot Roomba 650 for Pets 1 damage negative ## 10 7 8/17/15 iRobot Roomba 650 for Pets 1 justice litigious ## # ... with 3,950 more rows ``` --- # Pregled spojenih podataka <br> ```r sentiment_review <- tidy_review %>% inner_join(get_sentiments("loughran")) sentiment_review %>% count(sentiment) ``` ``` ## # A tibble: 6 x 2 ## sentiment n ## * <chr> <int> ## 1 constraining 170 ## 2 litigious 53 ## 3 negative 1795 ## 4 positive 1568 ## 5 superfluous 1 ## 6 uncertainty 373 ``` --- # Pregled spojenih podataka ```r sentiment_review %>% count(word, sentiment) %>% arrange(desc(n)) ``` ``` ## # A tibble: 598 x 3 ## word sentiment n ## <chr> <chr> <int> ## 1 easy positive 297 ## 2 happy positive 107 ## 3 easier positive 97 ## 4 easily positive 92 ## 5 perfect positive 87 ## 6 random uncertainty 81 ## 7 impressed positive 77 ## 8 excellent positive 58 ## 9 trouble negative 58 ## 10 fantastic positive 56 ## # ... with 588 more rows ``` --- # Vizualizacija sentimenta <br> ```r sentiment_review2 <- sentiment_review %>% filter(sentiment %in% c("positive", "negative")) word_counts <- sentiment_review2 %>% count(word, sentiment) %>% group_by(sentiment) %>% top_n(10, n) %>% ungroup() %>% mutate(word2 = fct_reorder(word, n)) ``` ```r gg3 <- ggplot(word_counts, aes(x = word2, y = n, fill = sentiment)) + geom_col(show.legend = FALSE) + facet_wrap(~ sentiment, scales = "free") + coord_flip() + labs(title = "Sentiment Word Counts", x = "Words") ``` --- # Vizualizacija sentimenta <img src="08_TXT_files/figure-html/unnamed-chunk-46-1.png" style="display: block; margin: auto;" /> --- # Izračunaj sentiment prema rangu ```r tidy_review %>% inner_join(get_sentiments("bing")) %>% count(Stars, sentiment) ``` ``` ## # A tibble: 10 x 3 ## Stars sentiment n ## <dbl> <chr> <int> ## 1 1 negative 381 ## 2 1 positive 241 ## 3 2 negative 384 ## 4 2 positive 247 ## 5 3 negative 485 ## 6 3 positive 432 ## 7 4 negative 984 ## 8 4 positive 973 ## 9 5 negative 3705 ## 10 5 positive 5083 ``` --- # Široki format podataka ```r tidy_review %>% inner_join(get_sentiments("bing")) %>% count(Stars, sentiment) %>% spread(sentiment, n) ``` ``` ## # A tibble: 5 x 3 ## Stars negative positive ## <dbl> <int> <int> ## 1 1 381 241 ## 2 2 384 247 ## 3 3 485 432 ## 4 4 984 973 ## 5 5 3705 5083 ``` --- # Izračunaj opći sentiment ```r tidy_review %>% inner_join(get_sentiments("bing")) %>% count(Stars, sentiment) %>% spread(sentiment, n) %>% mutate(overall_sentiment = positive - negative) ``` ``` ## # A tibble: 5 x 4 ## Stars negative positive overall_sentiment ## <dbl> <int> <int> <int> ## 1 1 381 241 -140 ## 2 2 384 247 -137 ## 3 3 485 432 -53 ## 4 4 984 973 -11 ## 5 5 3705 5083 1378 ``` --- # Vizualiziraj ```r sentiment_stars <- tidy_review %>% inner_join(get_sentiments("bing")) %>% count(Stars, sentiment) %>% spread(sentiment, n) %>% mutate(overall_sentiment = positive - negative, stars = fct_reorder(as.factor(Stars), overall_sentiment)) ``` ```r gg4 <- ggplot(sentiment_stars, aes(x = stars, y = overall_sentiment, fill = as.factor(Stars))) + geom_col(show.legend = FALSE) + coord_flip() + labs(title = "Overall Sentiment by Stars", subtitle = "Reviews for Robotic Vacuums", x = "Stars", y = "Overall Sentiment") ``` --- # Vizualiziraj <img src="08_TXT_files/figure-html/unnamed-chunk-52-1.png" style="display: block; margin: auto;" /> --- class: inverse, center, middle name: topi # TEMATSKA ANALIZA <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Algoritamska identifikacija tema) --- # Terminologija <br> <br> <br> - Latent Dirichlet allocation (LDA) je najčešće korišten pristup za TA <br> - Skup dokumenata nazivamo korpus <br> - *Bag-of-words* pristup tretira svaku riječ u dokumentu zasebno <br> - TA identificira obrasce riječi koji se pojavljuju zajendo <br> - LDA je metoda nenadgledanog strojnog učenja --- # Klastering vs. Tematska analiza <br> <br> - **Klasteri** se formiraju na osnovi mjere udaljenosti (*kontinuirana mjera*) - Svaki objekt je pripisan nekom klasteru <br> <br> - **TA** se zasniva na frekvencijama riječi (*diskretna mjera*) - Svaki dokument je mješavina svake teme --- # Document-term matrica (DTM) <br> <br> ```r tidy_review %>% count(word, id) %>% cast_dtm(id, word, n) ``` ``` ## <<DocumentTermMatrix (documents: 1791, terms: 9670)>> ## Non-/sparse entries: 63116/17255854 ## Sparsity : 100% ## Maximal term length: NA ## Weighting : term frequency (tf) ``` --- # Document term matrica (DTM) <br> <br> ```r dtm_review <- tidy_review %>% count(word, id) %>% cast_dtm(id, word, n) %>% as.matrix() dtm_review[1:4, 2000:2004] ``` ``` ## Terms ## Docs conscious consecutive consensus consequences considerable ## 223 0 0 0 0 0 ## 1069 0 0 0 0 0 ## 425 0 0 0 0 0 ## 615 0 0 0 0 0 ``` --- # Provedi LDA analizu <br> <br> ```r library(topicmodels) lda_out <- LDA( dtm_review, k = 2, method = "Gibbs", control = list(seed = 42)) ``` ```r lda_out ``` ``` ## A LDA_Gibbs topic model with 2 topics. ``` --- # Pregledaj objekt <br> <br> .tiny[ ```r glimpse(lda_out) ``` ``` ## Formal class 'LDA_Gibbs' [package "topicmodels"] with 16 slots ## ..@ seedwords : NULL ## ..@ z : int [1:76175] 2 1 1 1 1 2 1 1 1 1 ... ## ..@ alpha : num 25 ## ..@ call : language LDA(x = dtm_review, k = 2, method = "Gibbs", control = list(seed = 42)) ## ..@ Dim : int [1:2] 1791 9670 ## ..@ control :Formal class 'LDA_Gibbscontrol' [package "topicmodels"] with 14 slots ## ..@ k : int 2 ## ..@ terms : chr [1:9670] "______________________________________________________________________first" "0.3" "0.5" "0_0" ... ## ..@ documents : chr [1:1791] "223" "1069" "425" "615" ... ## ..@ beta : num [1:2, 1:9670] -12.9 -10.4 -10.5 -12.8 -12.9 ... ## ..@ gamma : num [1:1791, 1:2] 0.388 0.564 0.444 0.395 0.335 ... ## ..@ wordassignments:List of 5 ## .. ..$ i : int [1:63116] 1 1 1 1 1 1 1 1 1 1 ... ## .. ..$ j : int [1:63116] 1 11 19 38 127 130 131 134 170 189 ... ## .. ..$ v : num [1:63116] 2 1 1 1 2 1 1 1 1 1 ... ## .. ..$ nrow: int 1791 ## .. ..$ ncol: int 9670 ## .. ..- attr(*, "class")= chr "simple_triplet_matrix" ## ..@ loglikelihood : num -549950 ## ..@ iter : int 2000 ## ..@ logLiks : num(0) ## ..@ n : int 76175 ``` ] --- # Pregledaj objekt ```r lda_topics <- lda_out %>% tidy(matrix = "beta") lda_topics %>% arrange(desc(beta)) ``` ``` ## # A tibble: 19,340 x 3 ## topic term beta ## <int> <chr> <dbl> ## 1 2 hair 0.0238 ## 2 1 cleaning 0.0201 ## 3 2 house 0.0197 ## 4 1 clean 0.0192 ## 5 2 floors 0.0174 ## 6 2 day 0.0153 ## 7 2 vacuum 0.0152 ## 8 2 floor 0.0148 ## 9 2 job 0.0142 ## 10 2 love 0.0141 ## # ... with 19,330 more rows ``` --- # Interpretacija tema <br> <br> ```r # Model sa dvije teme lda_topics <- LDA( dtm_review, k = 2, method = "Gibbs", control = list(seed = 42)) %>% tidy(matrix = "beta") word_probs <- lda_topics %>% group_by(topic) %>% top_n(15, beta) %>% ungroup() %>% mutate(term2 = fct_reorder(term, beta)) ``` --- # Interpretacija tema <br> <br> .left-code[ ```r ggplot( word_probs, aes( term2, beta, fill = as.factor(topic))) + geom_col(show.legend = FALSE) + facet_wrap(~ topic, scales = "free") + coord_flip() ``` ] .right-plot[ <img src="08_TXT_files/figure-html/plot-label5-out-1.png" style="display: block; margin: auto;" /> ] --- # Interpretacija tema <br> <br> ```r # Model sa tri teme lda_topics2 <- LDA( dtm_review, k = 3, method = "Gibbs", control = list(seed = 42)) %>% tidy(matrix = "beta") word_probs2 <- lda_topics2 %>% group_by(topic) %>% top_n(15, beta) %>% ungroup() %>% mutate(term2 = fct_reorder(term, beta)) ``` --- # Interpretacija tema <br> <br> .left-code[ ```r ggplot( word_probs2, aes( term2, beta, fill = as.factor(topic))) + geom_col(show.legend = FALSE) + facet_wrap(~ topic, scales = "free") + coord_flip() ``` ] .right-plot[ <img src="08_TXT_files/figure-html/plot-label4-out-1.png" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle # Hvala na pažnji <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Sljedeće predavanje: Survival analiza)