FETCHING DATA

Two main approaches:

webscraping (guerrila)

takes some skills
everything can be scraped (tailor made solutions)
takes more time and effort

API (gentelman`s way)

often not avaliable
usually not for free
easier to implement in production

webscraping

lets scrape this article and this from Večernji list

# 1. copy urls
url1 <- "https://www.vecernji.hr/vijesti/umro-bivsi-austrijski-potkancelar-i-prijatelja-hrvatske-dr-erhard-busek-1570780"

url2 <- "https://www.vecernji.hr/vijesti/u-kijevu-pogodena-stambena-zgrada-objavljena-snimka-raketiranja-nebodera-u-mariupolju-1570614"
# 2. request page 
page1 <- html_session(url1)

we need to write a function to grab parts of the article
this part takes some routine skill and revolves around rvest package

# 3. write a function to take parts of the article
parseArticle <- function(webpage) {
  
  title <- html_nodes(webpage, xpath = '//h1[@class="article__title"]') %>%
    html_text() %>% 
    trimws() %>% 
    ifelse(length(.) == 0, NA, .)
  
  date <- html_nodes(webpage, xpath = '//*[@class="article__header_date"]') %>%
    html_text() %>% 
    str_replace_all(pattern = "\\\r\\\n| u", replacement = "") %>% 
    trimws() %>% 
    ifelse(length(.) == 0, NA, .)
  
  noComment <- html_nodes(webpage, xpath = '//*[@class="article__comments_number"]') %>%
    html_text() %>% 
    trimws() %>% 
    str_extract("\\d+") %>% 
    as.numeric(.) %>% 
    ifelse(length(.) == 0, NA, .)
  
  views <- html_nodes(webpage, xpath = '//*[@class="article__header_views"]') %>%
    html_text() %>% 
    trimws() %>% 
    str_extract("\\d+") %>% 
    as.numeric(.) %>% 
    ifelse(length(.) == 0, NA, .)
  
  articleLabel <- html_nodes(webpage, xpath = '//*[@class="article__label"]') %>%
    html_text() %>% 
    trimws() %>% 
    ifelse(length(.) == 0, NA, .)
  
  author <- html_nodes(webpage, xpath = '//*[@class="article__author--link"]') %>%
    html_text() %>% 
    trimws()
  if (length(author) == 0) {author <- NA}
  
  articletext <- html_nodes(webpage, xpath = '//*[@class="article__body--main_content"]/p') %>%
    html_text() %>% 
    str_flatten(., "\n") %>% 
    ifelse(length(.) == 0, NA, .)
  
  keywords <- html_nodes(webpage, xpath = '//*[@class="article__tag_name"]') %>%
    html_text() %>% 
    trimws() %>% 
    str_flatten(., ";") %>% 
    ifelse(length(.) == 0, NA, .)
  
  
  articles <- cbind.data.frame(title, date, noComment, views, articleLabel, 
                              articleLabel, author, articletext, keywords, stringsAsFactors = FALSE)
  return(articles)
}

finally, lets apply the function and check the results

# 4. apply the function
data <- parseArticle(page1)
# 5. check the data
str(data)

## 'data.frame':    1 obs. of  9 variables:
##  $ title       : chr "Umro bivši austrijski potkancelar i prijatelja Hrvatske dr. Erhard Busek"
##  $ date        : chr "14. ožujka 2022. 16:48"
##  $ noComment   : num 0
##  $ views       : num 966
##  $ articleLabel: logi NA
##  $ articleLabel: logi NA
##  $ author      : chr "Snježana Herek"
##  $ articletext : chr "Bivši austrijski potkancelar i ministar znanosti i obrazovanja, dogradonačelnik Beča i čelnik Narodne stranke ("| __truncated__
##  $ keywords    : chr "Austrija;Erhard Busek"

data$title

## [1] "Umro bivši austrijski potkancelar i prijatelja Hrvatske dr. Erhard Busek"

data$views

## [1] 966

data$author

## [1] "Snježana Herek"

nchar(data$articletext)

## [1] 5148

normally we want to automate this approach and apply it to multiple articles
let`s first find a couple of more interesting articles: I, II, III

# assign urls of the articles
url3 <- "https://www.vecernji.hr/vijesti/glavni-tajnik-un-a-nuklearni-sukob-ponovno-izgleda-moguc-1570823"
url4 <- "https://www.vecernji.hr/vijesti/sto-je-clanak-5-nato-a-aktiviran-je-samo-jednom-a-ne-moze-se-primijeniti-na-ukrajinu-1570810"
url5 <- "https://www.vecernji.hr/vijesti/kotromanovic-ako-je-bila-rijec-o-naoruzanom-dronu-napad-je-to-na-clanicu-nato-a-1570731"
# bind all articles together
urls <- c(url1,url2,url3,url4,url5)
# check
urls

## [1] "https://www.vecernji.hr/vijesti/umro-bivsi-austrijski-potkancelar-i-prijatelja-hrvatske-dr-erhard-busek-1570780"              
## [2] "https://www.vecernji.hr/vijesti/u-kijevu-pogodena-stambena-zgrada-objavljena-snimka-raketiranja-nebodera-u-mariupolju-1570614"
## [3] "https://www.vecernji.hr/vijesti/glavni-tajnik-un-a-nuklearni-sukob-ponovno-izgleda-moguc-1570823"                             
## [4] "https://www.vecernji.hr/vijesti/sto-je-clanak-5-nato-a-aktiviran-je-samo-jednom-a-ne-moze-se-primijeniti-na-ukrajinu-1570810" 
## [5] "https://www.vecernji.hr/vijesti/kotromanovic-ako-je-bila-rijec-o-naoruzanom-dronu-napad-je-to-na-clanicu-nato-a-1570731"

# check
str(urls)

##  chr [1:5] "https://www.vecernji.hr/vijesti/umro-bivsi-austrijski-potkancelar-i-prijatelja-hrvatske-dr-erhard-busek-1570780" ...

let`s automate this procedure for multiple articles now and check the data

# read in urls
pages <- lapply(urls,html_session)
# grab all article parts
multipleArticles <- lapply(pages, parseArticle)
# make data.frame
dataArticles <- do.call(rbind, multipleArticles)
# check the data
dim(dataArticles)

## [1] 7 9

glimpse(dataArticles)

## Rows: 7
## Columns: 9
## $ title        <chr> "Umro bivši austrijski potkancelar i prijatelja Hrvatske ~
## $ date         <chr> "14. ožujka 2022. 16:48", "14. ožujka 2022. 23:03", "14. ~
## $ noComment    <dbl> 0, 106, 106, 8, 18, 129, 129
## $ views        <dbl> 966, 90714, 90714, 3111, 7753, 76573, 76573
## $ articleLabel <lgl> NA, NA, NA, NA, NA, NA, NA
## $ articleLabel <lgl> NA, NA, NA, NA, NA, NA, NA
## $ author       <chr> "Snježana Herek", "Vecernji.hr", "Hina", "Hina", "Vecernj~
## $ articletext  <chr> "Bivši austrijski potkancelar i ministar znanosti i obraz~
## $ keywords     <chr> "Austrija;Erhard Busek", "invazija;napad;Rusija;rat;Ukraj~

dataArticles$title

## [1] "Umro bivši austrijski potkancelar i prijatelja Hrvatske dr. Erhard Busek"                                        
## [2] "Teške borbe u Donbasu, američki dužnosnik tvrdi: Rusko napredovanje gotovo potpuno zaustavljeno"                 
## [3] "Teške borbe u Donbasu, američki dužnosnik tvrdi: Rusko napredovanje gotovo potpuno zaustavljeno"                 
## [4] "Glavni tajnik UN-a: Nuklearni sukob ponovno izgleda moguć"                                                       
## [5] "Što je članak 5. NATO-a? Aktiviran je samo jednom, a ne može se primijeniti na Ukrajinu"                         
## [6] "Kotromanović: Ako je dron bio naoružan, onda NATO mora aktivirati članak 5 bez obzira je li ruski ili ukrajinski"
## [7] "Kotromanović: Ako je dron bio naoružan, onda NATO mora aktivirati članak 5 bez obzira je li ruski ili ukrajinski"

dataArticles$views

## [1]   966 90714 90714  3111  7753 76573 76573

nchar(dataArticles$articletext)

## [1]  5148 27867 27867  1558  3970  5675  5675

API

lets quickly inspect the API documentation
then we need to compile the full API request and retrieve the data

# this is a private info
source(here::here("Creds/api.R"))
# identify from your Mediatoolkit App
groups <- "182718"
keywords <- "6521533"
# select time period
from_time <- as.character(as.numeric(as.POSIXlt("2022-03-13", format="%Y-%m-%d")))
to_time <- as.character(as.numeric(as.POSIXlt("2022-03-14", format="%Y-%m-%d")))
# number of articles to retrieve
count <- 3000
# connect all parts into request string
requestString <- paste0("https://api.mediatoolkit.com/organizations/126686/groups/",groups,
              "/keywords/",keywords,
              "/mentions?access_token=",token,
              "&from_time=",from_time,
              "&to_time=",to_time,
              "&count=",count,
              "&sort=time&type=all&offset=0&ids_only=false")
# check the request string
requestString

## [1] "https://api.mediatoolkit.com/organizations/126686/groups/182718/keywords/6521533/mentions?access_token=ddms5s0l3gejlz2z42ydt0bnwmf6ssqd62bdxteu7t8sumv5ii&from_time=1647126000&to_time=1647212400&count=3000&sort=time&type=all&offset=0&ids_only=false"

# make GET request to Mediatoolkit server API
API_request <- httr::GET(requestString)
# check the API request object
API_request

## Response [https://api.mediatoolkit.com/organizations/126686/groups/182718/keywords/6521533/mentions?access_token=ddms5s0l3gejlz2z42ydt0bnwmf6ssqd62bdxteu7t8sumv5ii&from_time=1647126000&to_time=1647212400&count=3000&sort=time&type=all&offset=0&ids_only=false]
##   Date: 2022-03-22 09:01
##   Status: 200
##   Content-Type: application/json;charset=utf-8
##   Size: 3.81 MB

# parse the request into JSON object
jS_text <- httr::content(API_request, as = "text", type = "aplication/json", encoding = "UTF-8")
# make a list from JSON object
dataList <- jsonlite::fromJSON(jS_text, flatten = TRUE)
# make a data.frame from list
data <- data.frame(dataList$data)

now we have the retrieved data in a data.frame object
let`s check what is inside

# size of the data
dim(data)

## [1] 3000   51

# variables and variable types
glimpse(data)

## Rows: 3,000
## Columns: 51
## $ response.comment_count              <int> 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 4~
## $ response.keywords                   <list> "i", "i", <"i", "I">, "i", "i", "~
## $ response.pinterest_count            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ response.reach                      <int> 697, 15, 1576, 1167, 411, 1734, 64~
## $ response.insert_time                <int> 1647212400, 1647212400, 1647212400~
## $ response.description                <chr> "Mediji u Srbiji pišu kako je novi~
## $ response.engagement_rate            <dbl> 4.5919283, 0.0000000, 1.2692537, 0~
## $ response.type                       <chr> "web", "web", "web", "web", "web",~
## $ response.title                      <chr> "PREDSJEDNIČKI IZBORI Škoro se zbo~
## $ response.original_photos            <list> "https://direktno.hr/upload/publi~
## $ response.photos                     <list> "https://mediatoolkit.com/img/0x5~
## $ response.mention                    <chr> "Ja sam čovek s biroa za nezaposle~
## $ response.original_photo             <chr> "https://direktno.hr/upload/publis~
## $ response.score                      <dbl> 1647212400, 1647212400, 1647212400~
## $ response.all_keyword_feed_locations <list> [<data.frame[1 x 2]>], [<data.fra~
## $ response.mozrank                    <dbl> 5.864264, 0.000000, 1.400000, 6.20~
## $ response.from                       <chr> "direktno.hr", "cazma.hr", "najbol~
## $ response.id                         <dbl> 9143912283, 9145111616, 9145128040~
## $ response.auto_sentiment             <chr> "negative", "positive", "positive"~
## $ response.database_insert_time       <int> 1647212792, 1647251035, 1647251371~
## $ response.keyword_name               <chr> "opće", "opće", "opće", "opće", "o~
## $ response.image                      <chr> "https://mediatoolkit.com/img/50x5~
## $ response.like_count                 <int> 24, 0, 6, 0, 6, 0, 34, 0, 0, 0, 0,~
## $ response.languages                  <list> "hr", "hr", "hr", "hr", "hr", "hr~
## $ response.group_name                 <chr> "Luka", "Luka", "Luka", "Luka", "L~
## $ response.pr_value                   <int> 10, 0, 8, 58, 2, 17, 3, 142, 58, 1~
## $ response.photo                      <chr> "https://mediatoolkit.com/img/0x50~
## $ response.influence_score            <int> 3, 1, 1, 7, 1, 2, 1, 7, 7, 3, 7, 2~
## $ response.url                        <chr> "https://direktno.hr/eu-i-svijet/p~
## $ response.virality                   <dbl> 0.65606254, 0.00000000, 0.70409712~
## $ response.share_count                <int> 6, 0, 12, 0, 2, 13, 3, 0, 0, 0, 0,~
## $ response.linkedin_count             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ response.source_reach               <int> 4300, 200, 10, 80000, 10, 800, 90,~
## $ response.domain                     <chr> "direktno.hr", "cazma.hr", "najbol~
## $ response.tag_feed_locations         <list> [], [], [], [], [], [], [], [], [~
## $ response.interaction                <int> 32, 0, 20, 0, 8, 13, 37, 0, 0, 0, ~
## $ response.locations                  <list> "HR", "HR", "HR", "HR", "HR", "HR~
## $ response.author                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ response.youtube_channel_id         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ response.full_mention               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ response.view_count                 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ response.twid                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ response.reddit_comment_id          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ response.reddit_parent_link_id      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ response.subreddit                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ response.reddit_type                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ response.reddit_score               <int> NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ response.reddit_fullname            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ response.is_placeholder             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ response.reddit_comment_count       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ response.reddit_link_id             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA~

# basic descriptives (again)
data %>% 
  group_by(response.type) %>%
  count %>%
  arrange(desc(n)) %>%
  head()

## # A tibble: 6 x 2
## # Groups:   response.type [6]
##   response.type     n
##   <chr>         <int>
## 1 web            1319
## 2 twitter         818
## 3 facebook        261
## 4 reddit          247
## 5 youtube         192
## 6 comment         115

# check if there are articles from Večernji list
data %>%
  filter(response.from == "vecernji.hr") %>%
  select(title = response.title,
         time = response.insert_time,
         author = response.author,
         url = response.url,
         comment = response.comment_count,
         text = response.mention) %>%
  arrange(desc(comment)) %>%
  head

##                                                                                                                 title
## 1                        Vučić o padu drona u Zagrebu: 'To se Srbiji ne bi dogodilo, mi bi to oborili za pet minuta!'
## 2                               Visoki izvor iz MORH-a: Bomba u letjelici težine do 120 kg, eksplodirala ispod zemlje
## 3  Putinovi specijalci upadaju u rat, kreće totalni napad na Kijev. Naš general: Moguće je da Rusi ovo neće izdržati!
## 4 Ljudi u strahu od rata dižu štednju i kupuju stanove. Može li se ponoviti katastrofa iz 2008. i kako spasiti novac?
## 5          Vojni stručnjak: Da je eksplodiralo 120 kg eksploziva, sumnjam da bismo imali bilo kakve ostatke bilo čega
## 6        Nastavljaju se žestoke borbe, najteža situacija je u Mariupolju gdje se ljudi tuku za komad kruha i kap vode
##         time author
## 1 1647199162   <NA>
## 2 1647196367   <NA>
## 3 1647209120   <NA>
## 4 1647201866   <NA>
## 5 1647202093   <NA>
## 6 1647206736   <NA>
##                                                                                                                                                        url
## 1                          https://www.vecernji.hr/vijesti/vucic-o-padu-drona-u-zagrebu-to-se-srbiji-ne-bi-dogodilo-mi-bi-to-oborili-za-pet-minuta-1570574
## 2                               https://www.vecernji.hr/vijesti/slijedi-istraga-avio-bomba-koja-je-pronadena-u-letjelici-teska-je-do-120-kilograma-1570559
## 3   https://www.vecernji.hr/vijesti/putinovi-specijalci-upadaju-u-rat-krece-totalni-napad-na-kijev-nas-general-moguce-je-da-rusi-ovo-nece-izdrzati-1570576
## 4 https://www.vecernji.hr/vijesti/ljudi-u-strahu-od-rata-dizu-stednju-i-kupuju-stanove-moze-li-se-ponoviti-katastrofa-iz-2008-i-kako-spasiti-novac-1570555
## 5         https://www.vecernji.hr/vijesti/vojni-strucnjak-da-je-eksplodiralo-120-kg-eksploziva-sumnjam-da-bismo-imali-bilo-kakve-ostatke-bilo-cega-1570584
## 6      https://www.vecernji.hr/vijesti/nastavljaju-se-zestoke-borbe-najteza-situacija-je-u-mariupolju-gdje-se-ljudi-tuku-za-komad-kruha-i-kap-vode-1570572
##   comment
## 1     820
## 2     421
## 3     230
## 4     181
## 5     122
## 6      37
##                                                                                                                                                                                                                                                        text
## 1    Vučić o padu drona u Zagrebu: 'To se Srbiji ne bi dogodilo, mi bi to oborili za pet minuta!' Na predizbornom skupu dotaknuo se i tema poput Ukrajine, NATO-a, pada drona u centru Zagreba, ali i spremnosti i opremljenosti vojske. Srpski predsjednik
## 2   Visoki izvor iz MORH-a: Bomba u letjelici težine do 120 kg, eksplodirala ispod zemlje Nakon što je izvučena postavljena je uz sami krater, pregledali su je policijski i vojni službenici. Jutros je izvučena olupina letjelice koja je pala u četvrtak
## 3  Rusi će pojačavati pritisak putem raketa srednjeg dometa, a avijacijom uništavati ceste Ruske su se snage obrušile su se tijekom vikenda na gradove i naselja na zapadu Ukrajine, gdje se do sada nisu vodile tako žestoke borbe. Projektili su pogodili
## 4 Ljudi u strahu od rata dižu štednju i kupuju stanove. Može li se ponoviti katastrofa iz 2008. i kako spasiti novac? Na tržištu nekretnina nije uočen strah od rata i krize. Štoviše, cijene kvadrata rastu, u dvije godine u Zagrebu su poskupjeli za oko
## 5  Vojni stručnjak Robert Barić gostovao je u Dnevniku HTV-a i komentirao pad letjelice na Zagreb. Naime, iz visokih izvora u MORH-u doznaje se da je bomba pronađena u letjelici koja je pala na Zagreb bila težine do 120 kilograma. Srećom, letjelica je
## 6   Nastavljaju se žestoke borbe, najteža situacija je u Mariupolju gdje se ljudi tuku za komad kruha i kap vode Baza Javoriv, 10 kilometara od poljske granice, obično se koristi za obuku i vježbe ukrajinske vojske i njezinih NATO partnera, uglavnom u

# check title
data %>% 
  slice(1) %>% 
  select(title = response.title)

##                                                                  title
## 1 PREDSJEDNIČKI IZBORI Škoro se zbog nedostatka novca neće kandidirati

# check mention
data %>% 
  slice(1) %>%
  select(text = response.mention)

##                                                                                                                                                                                                                                               text
## 1 Ja sam čovek s biroa za nezaposlene gdje ću i ostati i posijle ovih izbora. Na žalost, nisam uspio da sakupim dovoljno novca da bih imao kakvu takvu skromnu kampanju, a istovrsremeno nisam želio da pristanem da budem dio nekih kombinacija i

# check url
data %>% 
  slice(1) %>%
  select(url = response.url)

##                                                                                                            url
## 1 https://direktno.hr/eu-i-svijet/predsjednicki-izbori-skoro-se-zbog-nedostatka-novca-nece-kandidirati-263796/

STORING DATA

MANIPULATING DATA

in R there are three major ways to manipulate data: base, tidyverse, data.table
you can also combine them together
we are going to explore and use tidywerse and data.table syntax in this course

TIDY WAY

let`s check some important functions
check this famous paper (Hadley Wickham, 2014 JSS) to motivate the tidyverse way and check tidyverse ecosystem
basic dplyr (tidyverse) syntax includes the following:

filter: filter (i.e. subset) rows by value.

# get the number of web articles/activity: FILTER
data %>% 
  filter(response.type == "web") %>% # FILTER!
  summarise(NumberOfArticles = n())

##   NumberOfArticles
## 1             1319

arrange: order (i.e. reorder) rows by value.

# arrange by share
data %>% 
  filter(response.type == "web") %>%
  group_by(response.from) %>% 
  summarise(Share = mean(response.share_count),
            Reach = mean(response.reach),
            Virality = mean(response.virality),
            LikeCount = mean(response.like_count),
            Comment = mean(response.comment_count)) %>%
  arrange(desc(Share)) # ARRANGE!

## # A tibble: 282 x 6
##    response.from            Share  Reach Virality LikeCount Comment
##    <chr>                    <dbl>  <dbl>    <dbl>     <dbl>   <dbl>
##  1 platak.hr                563   10601     1           11      0  
##  2 logicno.com              117   15919     2.10       191     19  
##  3 geopolitika.news          99    7283     1.86       304    163  
##  4 ampeu.hr                  96   11616     1.00        51      0  
##  5 hkig.hr                   61    5141     6.00        21      1  
##  6 lisinski.hr               42    6895     2.40        44      5  
##  7 zagorje-international.hr  41    4515     1.11        26      6  
##  8 sloboda.hr                39   69720     2.78       497    544  
##  9 priznajem.hr              34.2 24550     1.64       422.   153. 
## 10 maxportal.hr              32.5  4460.    0.995      130.    69.8
## # ... with 272 more rows

# arrange by reach  
data %>% 
  filter(response.type == "web") %>%
  group_by(response.from) %>% 
  summarise(Reach = mean(response.reach),
            Share = mean(response.share_count),
            Virality = mean(response.virality),
            LikeCount = mean(response.like_count),
            Comment = mean(response.comment_count)) %>%
  arrange(desc(Reach)) # ARRANGE!

## # A tibble: 282 x 6
##    response.from  Reach Share Virality LikeCount Comment
##    <chr>          <dbl> <dbl>    <dbl>     <dbl>   <dbl>
##  1 sloboda.hr    69720   39     2.78        497    544  
##  2 dw.com        41956   17     0.0438        2      1  
##  3 bongacams.com 40517    0     0             0      0  
##  4 priznajem.hr  24550   34.2   1.64        422.   153. 
##  5 index.hr      17231.  14.9   2.98        324.   139. 
##  6 logicno.com   15919  117     2.10        191     19  
##  7 jutarnji.hr   15492.  14.8   1.25        156.    52.2
##  8 24sata.hr     15200.  10.9   1.37        210.    80.1
##  9 zara.com      14313    0     0             0      0  
## 10 teleskop.hr   13291    3     0.984       177    108  
## # ... with 272 more rows

# arrange by comment 
data %>% 
  filter(response.type == "web") %>%
  group_by(response.from) %>% 
  summarise(Comment = mean(response.comment_count)) %>%
  arrange(desc(Comment)) # ARRANGE!

## # A tibble: 282 x 2
##    response.from           Comment
##    <chr>                     <dbl>
##  1 sloboda.hr                544  
##  2 geopolitika.news          163  
##  3 priznajem.hr              153. 
##  4 index.hr                  139. 
##  5 morski.hr                 132. 
##  6 teleskop.hr               108  
##  7 novidani.com              105  
##  8 zagreb.info               103. 
##  9 radio-banovina.hr          98  
## 10 braniteljski-portal.com    83.3
## # ... with 272 more rows

select: Choose (i.e. subset) columns by name.

# get to know your data I
data %>% 
  select(response.from, response.title, response.url) %>% # SELECT!
  filter(response.from == "geopolitika.news")

##      response.from
## 1 geopolitika.news
##                                                         response.title
## 1 MMF više ne smatra nemogućim da Rusija prestane plaćati svoje dugove
##                                                                                                 response.url
## 1 https://www.geopolitika.news/vijesti/mmf-vise-ne-smatra-nemogucim-da-rusija-prestane-placati-svoje-dugove/

# get to know your data II
data %>% 
  select(response.from, response.title, response.url) %>% # SELECT!
  filter(response.from == "priznajem.hr")

##   response.from
## 1  priznajem.hr
## 2  priznajem.hr
## 3  priznajem.hr
## 4  priznajem.hr
## 5  priznajem.hr
## 6  priznajem.hr
## 7  priznajem.hr
## 8  priznajem.hr
##                                                                                                   response.title
## 1                          Vučić: Neću nikoga plašiti, ali situacija je sve teža. Pogledajte sad ruske medije...
## 2                             Kalinić: Od Banožićevih bisera je ostalo samo još da kaže da je Zemlja ravna ploča
## 3                                  Vučić: Da se nama dogodilo ono u Zagrebu, mi bismo dron oborili za pet minuta
## 4                  Rusi kroz ‘srpsku rupu‘ ulaze u Europu ‘sazad‘: Razgrabljene karte na letovima Moskva-Beograd
## 5              Specijalac koji je ubio Osamu bin Ladena o Putinu: 'Ne možete poraziti luđaka pokazujući slabost'
## 6 Ukrajina objavila nove podatke o ruskim vojnim gubicima, evo što su im Ukrajinci uništili od početka invazije!
## 7            Anonymousi objavili videoporuku specijalno namijenjenu građanima Rusije: "Uklonite Putina s vlasti"
## 8               POSTOJI ŠANSA ZA MIR: Veliki preokret, postignut veliki napredak u pregovorima Ukrajine i Rusije
##                                                                                                                                        response.url
## 1                                                                                                                    https://priznajem.hr/?p=179023
## 2                            https://priznajem.hr/novosti/kalinic-od-banozicevih-bisera-je-ostalo-samo-jos-da-kaze-da-je-zemlja-ravna-ploca/179020/
## 3                                  https://priznajem.hr/novosti/vucic-da-se-nama-dogodilo-ono-u-zagrebu-mi-bismo-dron-oborili-za-pet-minuta/179018/
## 4                     https://priznajem.hr/novosti/rusi-kroz-srpsku-rupu-ulaze-u-europu-sazad-razgrabljene-karte-na-letovima-moskva-beograd/179015/
## 5               https://priznajem.hr/novosti/specijalac-koji-je-ubio-osamu-bin-ladena-o-putinu-ne-mozete-poraziti-ludaka-pokazujuci-slabost/179012/
## 6 https://priznajem.hr/novosti/ukrajina-objavila-nove-podatke-o-ruskim-vojnim-gubicima-evo-sto-su-im-ukrajinci-unistili-od-pocetka-invazije/179009/
## 7             https://priznajem.hr/novosti/anonymousi-objavili-videoporuku-specijalno-namijenjenu-gradanima-rusije-uklonite-putina-s-vlasti/179006/
## 8                                                                                                                    https://priznajem.hr/?p=179004

# get to know your data III
data %>% 
  select(response.from, response.title, response.url) %>% # SELECT!
  filter(response.from == "sloboda.hr")

##   response.from
## 1    sloboda.hr
##                                                                                                           response.title
## 1 LOŠ(O) ANALITIČAR: "Ako bi izbio rat, a neće, 60% ukrajinske vojske prelazi Rusima i sve se rješava u 48 sati" (VIDEO)
##                                                                                                                           response.url
## 1 https://www.sloboda.hr/loso-analiticar-ako-bi-izbio-rat-a-nece-60-ukrajinske-vojske-prelazi-rusima-i-sve-se-rjesava-u-48-sati-video/

mutate: Create new columns.

# select biggest portals
data %>% 
  filter(response.type == "web") %>%
  group_by(response.from) %>% 
  count() %>%
  arrange(desc(n)) %>%
  head(10) %>%
  select(response.from) %>%
  pull -> largePortals
# check biggest portals
largePortals

##  [1] "novine.hr"            "index.hr"             "dnevnik.hr"          
##  [4] "360hr.news"           "vecernji.hr"          "ljekarnaonline.hr"   
##  [7] "slobodnadalmacija.hr" "24sata.hr"            "jutarnji.hr"         
## [10] "glas-slavonije.hr"

# create negation operator
`%!in%` <- Negate(`%in%`)
# Create new column and check some descriptives
data %>% 
  filter(response.type == "web") %>%
  mutate(PortalSize = case_when(response.from %in% largePortals ~ "Large", # MUTATE!
                                response.from %!in% largePortals ~ "Small")) %>% 
  group_by(PortalSize) %>%
  count

## # A tibble: 2 x 2
## # Groups:   PortalSize [2]
##   PortalSize     n
##   <chr>      <int>
## 1 Large        375
## 2 Small        944

summarise: Make a descriptive summary.

we already saw this function in action :-)
let`s see another one

data %>% 
  filter(response.type == "web") %>%
  summarise(Average = n())

##   Average
## 1    1319

data %>%
  filter(response.type == "web") %>% 
  select()

## data frame with 0 columns and 1319 rows

DATA.TABLE WAY

advantages of data.table include:

Concise syntax

# read in library
library(data.table)
# check class
class(data)

## [1] "data.frame"

# set the data.table object
dataDT = as.data.table(data) 
# check class again
class(dataDT)

## [1] "data.table" "data.frame"

# do some descriptive statistics
dataDT[response.type == "web",
       .(minShare = min(response.share_count),
         maxShare = max(response.share_count),
         avgShare = mean(response.share_count),
         stdShare = sd(response.share_count))][]

##    minShare maxShare avgShare stdShare
## 1:        0      563 5.802881 20.41818

# how many letters in a title
dataDT[response.type == "web",
       .(Avg = mean(nchar(response.title)),
         STD = sd(nchar(response.title)),
         min = min(nchar(response.title)),
         max = max(nchar(response.title)))][]

##         Avg      STD min max
## 1: 70.51403 33.66524   4 160

# how many letters in a text
dataDT[response.type == "web",
       .(Avg = mean(nchar(response.mention)),
         STD = sd(nchar(response.mention)),
         min = min(nchar(response.mention)),
         max = max(nchar(response.mention)))][]

##         Avg      STD min max
## 1: 241.7392 20.11047  54 250

Very fast

library(tictoc)

# how many letters in a text by DT
tic()
dataDT[response.type == "web",
       .(Avg = mean(nchar(response.mention)),
         min = min(nchar(response.mention)),
         max = max(nchar(response.mention)))]

##         Avg min max
## 1: 241.7392  54 250

toc()

## 0 sec elapsed

tic()
# how many letters in a text by tidy
data %>% 
  group_by(response.type) %>%
  summarise(Avg = mean(nchar(response.mention)),
         min = min(nchar(response.mention)),
         max = max(nchar(response.mention)))

## # A tibble: 7 x 4
##   response.type   Avg   min   max
##   <chr>         <dbl> <int> <int>
## 1 comment        168.    34   250
## 2 facebook        49     49    49
## 3 instagram       50     50    50
## 4 reddit         108.     5   250
## 5 twitter         NA     NA    NA
## 6 web            242.    54   250
## 7 youtube        190.     9   250

toc()

## 0.01 sec elapsed

# READ IN FULL MEDIATOOLKIT DATA SAMPLE
path <- "D:/LUKA/Freelance/Mediatoolkit/FULLDATA"
raw <- list.files(path = path , pattern="xlsx")
raw_path <- paste0(path, "/", raw)
all_raw <- map_df(raw_path, read_excel)
# make data.table object
allDT <- as.data.table(all_raw)


# lets check average activity size across 
allDT[,
       .(Avg = mean(nchar(TITLE)),
         STD = sd(nchar(TITLE)),
         min = min(nchar(TITLE)),
         max = max(nchar(TITLE))),
       by = SOURCE_TYPE]

##    SOURCE_TYPE       Avg         STD min  max
## 1:       forum        NA          NA  NA   NA
## 2:         web        NA          NA  NA   NA
## 3:     twitter 178.27823 114.1908158   4 6030
## 4:      reddit 110.32908  79.2360687   1  350
## 5:     youtube  51.40096  23.5850392   1  100
## 6:     comment  22.04194   0.7407619  22   53
## 7:    facebook 127.80191  39.3007556   4  160
## 8:   instagram 176.66035  72.6184769   5  350

allDT[SOURCE_TYPE == "twitter",.(Avg = mean(nchar(TITLE)),
         STD = sd(nchar(TITLE)),
         min = min(nchar(TITLE)),
         max = max(nchar(TITLE)))]

##         Avg      STD min  max
## 1: 178.2782 114.1908   4 6030

Memory efficiency

Measuring memory efficiency is relatively complicated thing. For details check (after 12th slide) check data.table functionality.

Lots of possibilities, stability and 5. Low dependancy

tidyverse works step by step and data.table does it in one step
one operation is one flid thought
chaining in DT is also possible
let`try some of these possibilities

# check 10 articles d+from vecernji.hr
allDT[SOURCE_TYPE == "web" & FROM == "vecernji.hr",
      .(TITLE,URL,COMMENT_COUNT)]

##                                                                                                                            TITLE
##     1:         Pričali smo s Beograđanima: 'Nema šanse da ga je netko napao, mi volimo Splićane. Rijeka je na tom dijelu duboka'
##     2:                                                                              Boga korak do rekordnog transfera u Atalantu
##     3:                                                     Željka Kamenov: Neke nove navike iz doba korone vrijedilo bi zadržati
##     4:                                                                   Monaco bez Kovača izborio osminu finala, kraj za Rennes
##     5:                        Pirotehničar: Ove sam godine u Zagrebu za Novu čuo bombe i rafale automastkog oružja. Znam i zašto
##    ---                                                                                                                          
## 32402:                                              Svi radimo ovih 20 grešaka u kuhanju i tako uništavamo hranu - Ordinacija.hr
## 32403: Ovo je najčešći želučani problem zbog kojeg idemo liječniku, a evo i kako lijekovi utječu na vašu probavu - Ordinacija.hr
## 32404:                       Ne-Hodgkinov limfom - koji su simptomi i što sve morate znati o ovoj teškoj bolesti - Ordinacija.hr
## 32405:                                                 VIDEO Šok za PSG! U posljednjih deset godina nije tako rano ispao iz Kupa
## 32406:                           Čekamo novi "Oz" i pitamo se što je sljedeće - gay drvosječa? Možda Dorothy koja želi biti Don?
##                                                                                                                                                                    URL
##     1:            https://www.vecernji.hr/vijesti/pricali-smo-s-beogradanima-nema-sanse-da-ga-je-netko-napao-mi-volimo-splicane-rijeka-je-na-tom-dijelu-duboka-1552186
##     2:                                                                              https://www.vecernji.hr/sport/boga-korak-do-rekordnog-transfera-u-atalantu-1552174
##     3:                                                    https://www.vecernji.hr/vijesti/zeljka-kamenov-neke-nove-navike-iz-doba-korone-vrijedilo-bi-zadrzati-1551978
##     4:                                                                    https://www.vecernji.hr/sport/monaco-bez-kovaca-izborio-osminu-finala-kraj-za-rennes-1552184
##     5:                         https://www.vecernji.hr/zagreb/pirotehnicar-ove-sam-godine-u-zagrebu-za-novu-cuo-bombe-i-rafale-automastkog-oruzja-znam-i-zasto-1552183
##    ---                                                                                                                                                                
## 32402:                                          https://ordinacija.vecernji.hr/zdravi-tanjur/jedi-zdravo/svi-radimo-ovih-20-gresaka-u-kuhanju-i-tako-unistavamo-hranu/
## 32403: https://ordinacija.vecernji.hr/zdravlje/ohr-savjetnik/ovo-je-najcesci-zelucani-problem-zbog-kojeg-idemo-lijecniku-a-evo-i-kako-lijekovi-utjecu-na-vasu-probavu/
## 32404:                        https://ordinacija.vecernji.hr/zdravlje/ohr-savjetnik/ne-hodgkinov-limfom-koji-su-simptomi-i-sto-sve-morate-znati-o-ovoj-teskoj-bolesti/
## 32405:                                                  https://www.vecernji.hr/sport/video-sok-za-psg-u-posljednjih-deset-godina-nije-tako-rano-ispao-iz-kupa-1559697
## 32406:                               https://www.vecernji.hr/kultura/cekamo-novi-oz-i-pitamo-se-sto-je-sljedece-gay-drvosjeca-mozda-dorothy-koja-zeli-biti-don-1559483
##        COMMENT_COUNT
##     1:           573
##     2:             0
##     3:             0
##     4:             0
##     5:            59
##    ---              
## 32402:             0
## 32403:             0
## 32404:             0
## 32405:             0
## 32406:             0

changing columns looks like this

# check the date column
str(allDT$DATE)

##  chr [1:3476130] "2022-01-02" "2022-01-02" "2022-01-02" "2022-01-02" ...

# change the date column into date format
allDT[, DateColumn := as.Date(DATE,"%Y-%m-%d" )] # modify by reference
# checNoViewk the new date column 
str(allDT$DateColumn)

##  Date[1:3476130], format: "2022-01-02" "2022-01-02" "2022-01-02" "2022-01-02" "2022-01-02" ...

# show the results
allDT[1:5,.(DateColumn,TITLE,SOURCE_TYPE)][]

##    DateColumn
## 1: 2022-01-02
## 2: 2022-01-02
## 3: 2022-01-02
## 4: 2022-01-02
## 5: 2022-01-02
##                                                                                                         TITLE
## 1:                                                                                                       <NA>
## 2:                                                                                                       <NA>
## 3:                                                                                                       <NA>
## 4:                                                               EU fit for 55 plan - smanjivanje emisije CO2
## 5: LMS 993-Kit Teler: SMARAGDNA OGRLICA (5) (#19086776) - Aukcije - www.stripovi.com - Prozor u svijet stripa
##    SOURCE_TYPE
## 1:       forum
## 2:       forum
## 3:       forum
## 4:       forum
## 5:         web

combinations of tidy and and DT syntax is also posssible

allDT[SOURCE_TYPE == "facebook",.(AUTHOR,COMMENT_COUNT)] %>%
  filter(COMMENT_COUNT  > 0) %>%
  distinct(.) %>%
  arrange(desc(COMMENT_COUNT)) %>%
  head(15)

##                    AUTHOR COMMENT_COUNT
##  1:          Teta Violeta         33178
##  2:                Net.hr         10905
##  3:       Violeta We Care          9023
##  4:       Andrea Andrassy          8948
##  5:      Marijana Batinić          8724
##  6:                 Karla          8437
##  7:   Violeta Double Care          8216
##  8:       Andrea Andrassy          8153
##  9:        Violeta Srbija          7787
## 10:                Mustra          7191
## 11:       Andrea Andrassy          7136
## 12:       Andrea Andrassy          6972
## 13:             muzika.hr          6622
## 14: Katarina Mamić Design          6294
## 15:            Elegant.hr          6243

grouping in data table

allDT[,.(AwerageNoArticles = .N), by = SOURCE_TYPE][order(-AwerageNoArticles)]

##    SOURCE_TYPE AwerageNoArticles
## 1:         web           1443570
## 2:       forum            816445
## 3:    facebook            343927
## 4:     twitter            330532
## 5:      reddit            213029
## 6:     youtube            165681
## 7:     comment             92857
## 8:   instagram             70089

ANALYTICS

we will cover that in three following lectures
methods used are statistical analysis, machine learning and textutal analysis

REPORTING

the example of (almost) production ready .Rmd report

Learning Social Media Analytics

Lecture 3: Data Science Prerequisites

Luka Sikic, PhD
Faculty of Croatian Studies | LSMA

FETCHING DATA

Two main approaches:

webscraping

API

STORING DATA

MANIPULATING DATA

TIDY WAY

DATA.TABLE WAY

ANALYTICS

REPORTING

Learning Social Media Analytics

Lecture 3: Data Science Prerequisites

Luka Sikic, PhD Faculty of Croatian Studies | LSMA

FETCHING DATA

Two main approaches:

webscraping

API

STORING DATA

MANIPULATING DATA

TIDY WAY

DATA.TABLE WAY

ANALYTICS

REPORTING

Luka Sikic, PhD
Faculty of Croatian Studies | LSMA