Web scraping 1

Advanced Crime Analysis UCL

Bennett Kleinberg

14 Jan 2019

Getting data from the Internet

Webscraping 1

Today

  • Types of webscraping
  • Using APIs: Twitter + YouTube
  • Crime data ‘wrappers’
  • “Real” webscraping: basics of a webpage

What is webscraping anyway?

The game changer!

  • direct broadcasting of ideas
  • “unfiltered” and “uncensored” (?)
  • location-enabled
  • and: en masse

Types of webscraping

Data shared Data not shared
Ready-made table Download closed source
Not ready-made API Real webscraping

Application programming interfaces (APIs)

API: basics

Goal:

  • help developers interact with the platform
  • facilitates interaction in an automatable manner
  • analogous to the GUI
  • part of it: enabling data access
  • contains precise documentation

What an API does not do:

  • give you all the data
  • be free forever
  • give you full control

There’ no free lunch!

Using an API

Core elements of an API:

  • GET requests
  • POST requests

Implementable in different ways…

Using an API

Classes of APIs:

  1. Web APIs
    • send requests through the browser
    • add URL parameters https://data.police.uk/api/crimes-at-location?date=2017-08&location_id=884227
  2. Libraries/packages for APIs
    • depending on the API: python, js, php, ruby
    • = frameworks to access the API
    • = methods implemented in different languages
  3. API wrappers
    • R packages that use the API

Using an API

Identify API capabilities

Official API docs Twitter

Useful websites that have an API

  • Twitter
  • YouTube
  • Instagram
  • Facebook
  • Reddit

Case 1: Twitter’s API

Getting access

Basic steps

  1. Twitter account
  2. Apply for a developer’s account
  3. Create project
  4. Obtain access credentials

Tutorial here

The rtweet package

library(twitteR)

Note: check out the newer rtweet package.

Authenitication through R

my_consumer_key = "5tc2oAVLyO8DkCKW1k8ny2H6e"
my_consumer_secret = "qEQYGX6IKs6NiSUsENprBZlOOdoM9lWkoIht3p1sVnAMraQpq2"
my_access_token = "858383409986625537-Fy9Ai5eFyf23VZHguRJEdXqell6Q8Jl"
my_access_secret = "nT5Z0eQjAvBdf2ZjxMgiaoRb7hiHVxB8jYh7lT74CW1Um"

setup_twitter_oauth(consumer_key = my_consumer_key
                    , consumer_secret = my_consumer_secret
                    , access_token = my_access_token
                    , access_secret = my_access_secret
                    )
## [1] "Using direct authentication"

How to search?

Depends on the problem:

  • Tweets in a certain time frame (e.g. December 2018)
  • Tweets with a certain key-word (e.g. “#metoo”)
  • Tweets by a certain author (e.g. Elon Musk)
  • Tweets in a certain location (e.g. London)
  • Tweets in a certain language
  • Combined search queries

API possibilities

Always look at two sources:

  1. The original API (Twitter’s API docs)
  2. The API interface (twitteR R package)

Note: mostly original API options > API interface options.

Tweets by date

Search:

  • tweets since December 2018
  • with #metoo
metoo_tweets_december = searchTwitter(searchString = '#metoo'
                                , n = 10
                                , since = '2018-12-01'
                                )
metoo_tweets_december
## [[1]]
## [1] "zee45427557: #RajkumarHirani @aamir_khan #AamirKhan #aamir #3idiots #pk #MeToo #MeTooMovement #sanju #chopra #joshi https://t.co/76F422oOSM"
## 
## [[2]]
## [1] "metoozoo: #MeToo Merch - YellowMaps Beaver Island MI topo map, 1:100000 Scale, 30 X 60 Minute, Historical, 1984, Updated 1989… https://t.co/DsklJ4FZrA"
## 
## [[3]]
## [1] "Mirbia3: RT @la_patilla: Los aspirantes demócratas a la Casa Blanca bajo examen del #MeToo https://t.co/DYj2t5OFcS       .     ."
## 
## [[4]]
## [1] "gulfkannadiga: RT @timesofindia: #MeToo movement: Filmmaker #RajkummarHirani's Assistant Director  of #Sanju accuses him of sexual harassment \n\nvia @etime…"
## 
## [[5]]
## [1] "WeForNews: Rajkumar Hirani accused of sexual assault during making of Sanju\n\n#MeToo #MeTooMovement #RajuBhai #RajKumarHirani… https://t.co/rw3zpitdUd"
## 
## [[6]]
## [1] "worldwidetoto10: RT @12ji10pun: 怒りが収まらない\U0001f4a2\n海外では問題になりそうな #松本人志 の発言。\nこのセクハラ発言を寛容で場の雰囲気を壊さないような対応をするのが日本式の「大人でいいオンナ」\n だから #MeToo 運動は日本では無縁。\nなんてったって被害に遭ったメンバーが謝罪…"
## 
## [[7]]
## [1] "Neli_Ngqulana: RT @Moosa_Kaula: Are girlies gonna pretend like Arthur Mafokate isn't the face of #MeToo and pose happily with him? \U0001f62c https://t.co/9yMDmDUL…"
## 
## [[8]]
## [1] "Rohitpatil_24: RT @NEWS9TWEETS: #BIGNEWS: #Bollywood's famed director, @RajkumarHirani allegedly accused of sexual harassment by an assistant director of…"
## 
## [[9]]
## [1] "jackejones123: @keithellison #MeToo"
## 
## [[10]]
## [1] "Hun_Aram_e: RT @Bollyhungama: #MeToo: #SANJU director #RajkumarHirani accused of SEXUAL HARASSMENT by his assistant\nhttps://t.co/duGjUEaDI7"

Tweets by date

Display as dataframe with meta information:

twListToDF(metoo_tweets_december)
##                                                                                                                                                                                                                                                                    text
## 1                                                                                                                                        #RajkumarHirani @aamir_khan #AamirKhan #aamir #3idiots #pk #MeToo #MeTooMovement #sanju #chopra #joshi https://t.co/76F422oOSM
## 2                                                                                                                          #MeToo Merch - YellowMaps Beaver Island MI topo map, 1:100000 Scale, 30 X 60 Minute, Historical, 1984, Updated 1989… https://t.co/DsklJ4FZrA
## 3                                                                                                                                               RT @la_patilla: Los aspirantes demócratas a la Casa Blanca bajo examen del #MeToo https://t.co/DYj2t5OFcS       .     .
## 4                                                                                                                        RT @timesofindia: #MeToo movement: Filmmaker #RajkummarHirani's Assistant Director  of #Sanju accuses him of sexual harassment \n\nvia @etime…
## 5                                                                                                                          Rajkumar Hirani accused of sexual assault during making of Sanju\n\n#MeToo #MeTooMovement #RajuBhai #RajKumarHirani… https://t.co/rw3zpitdUd
## 6  RT @12ji10pun: 怒りが収まらない\U0001f4a2\n海外では問題になりそうな #松本人志 の発言。\nこのセクハラ発言を寛容で場の雰囲気を壊さないような対応をするのが日本式の「大人でいいオンナ」\n だから #MeToo 運動は日本では無縁。\nなんてったって被害に遭ったメンバーが謝罪…
## 7                                                                                                                 RT @Moosa_Kaula: Are girlies gonna pretend like Arthur Mafokate isn't the face of #MeToo and pose happily with him? \U0001f62c https://t.co/9yMDmDUL…
## 8                                                                                                                           RT @NEWS9TWEETS: #BIGNEWS: #Bollywood's famed director, @RajkumarHirani allegedly accused of sexual harassment by an assistant director of…
## 9                                                                                                                                                                                                                                                  @keithellison #MeToo
## 10                                                                                                                                     RT @Bollyhungama: #MeToo: #SANJU director #RajkumarHirani accused of SEXUAL HARASSMENT by his assistant\nhttps://t.co/duGjUEaDI7
##    favorited favoriteCount    replyToSN             created truncated
## 1      FALSE             0         <NA> 2019-01-13 11:38:59     FALSE
## 2      FALSE             0         <NA> 2019-01-13 11:38:51      TRUE
## 3      FALSE             0         <NA> 2019-01-13 11:38:47     FALSE
## 4      FALSE             0         <NA> 2019-01-13 11:38:44     FALSE
## 5      FALSE             0         <NA> 2019-01-13 11:38:33      TRUE
## 6      FALSE             0         <NA> 2019-01-13 11:38:24     FALSE
## 7      FALSE             0         <NA> 2019-01-13 11:38:19     FALSE
## 8      FALSE             0         <NA> 2019-01-13 11:38:16     FALSE
## 9      FALSE             0 keithellison 2019-01-13 11:38:13     FALSE
## 10     FALSE             0         <NA> 2019-01-13 11:38:10     FALSE
##             replyToSID                  id replyToUID
## 1                 <NA> 1084414501452242944       <NA>
## 2                 <NA> 1084414467520253952       <NA>
## 3                 <NA> 1084414453029003264       <NA>
## 4                 <NA> 1084414438751461376       <NA>
## 5                 <NA> 1084414394514038790       <NA>
## 6                 <NA> 1084414353770573824       <NA>
## 7                 <NA> 1084414333168291842       <NA>
## 8                 <NA> 1084414323722579971       <NA>
## 9  1084300045342728197 1084414308174446592   14135426
## 10                <NA> 1084414298548588544       <NA>
##                                                                            statusSource
## 1    <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 2                         <a href="http://metoozoo.com" rel="nofollow">metoozoo.com</a>
## 3                  <a href="https://mobile.twitter.com" rel="nofollow">Twitter Lite</a>
## 4  <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## 5   <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>
## 6    <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 7  <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## 8  <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## 9                    <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## 10   <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
##         screenName retweetCount isRetweet retweeted longitude latitude
## 1      zee45427557            0     FALSE     FALSE        NA       NA
## 2         metoozoo            0     FALSE     FALSE        NA       NA
## 3          Mirbia3            1      TRUE     FALSE        NA       NA
## 4    gulfkannadiga           15      TRUE     FALSE        NA       NA
## 5        WeForNews            0     FALSE     FALSE        NA       NA
## 6  worldwidetoto10            2      TRUE     FALSE        NA       NA
## 7    Neli_Ngqulana           76      TRUE     FALSE        NA       NA
## 8    Rohitpatil_24            3      TRUE     FALSE        NA       NA
## 9    jackejones123            0     FALSE     FALSE        NA       NA
## 10      Hun_Aram_e           12      TRUE     FALSE        NA       NA

Tweets by date

Example: Popular crime tweet in 2019?

crime_tweets_2019 = searchTwitter(searchString = 'crime'
                                , n = 1000
                                , since = '2019-01-01'
                                , resultType = 'popular'
                                )
## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 1000 tweets were requested but the
## API can only return 63
df.crime_tweets_2019 = twListToDF(crime_tweets_2019)

df.crime_tweets_2019[order(df.crime_tweets_2019$created, decreasing = F), ][1:3, 'text']
## [1] "#NewsUpdate ออกประกาศด่วน! หยุดเดินเรือข้ามเกาะสมุย ชาวบ้านแห่กักตุนอาหาร พร้อมรับมือพายุปาปึก #เรื่องเล่าเช้านี้… https://t.co/dgurGOK3OW"                                   
## [2] "Reform-minded prosecutors can repair our broken #CriminalJustice system\n  \nWesley Bell in Missouri: \n\U0001f4ccEnded prosecu… https://t.co/gfbfT7bPM6"
## [3] "Lembrando sempre que não gostar de alguém, além de não ser crime, não exige nenhum pré-requisito. Pode-se apelar ap… https://t.co/SRG4rOcN6V"

Tweets by keyword

Search:

  • “fake news” tweets
  • since the start of the year
fakenews_tweets_2019 = searchTwitter(searchString = 'fake+news'
                                , n = 1000
                                , since = '2019-01-01'
                                , resultType = 'popular'
                                )
## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 1000 tweets were requested but the
## API can only return 58

df.fakenews_tweets_2019 = twListToDF(fakenews_tweets_2019)
df.fakenews_tweets_2019[order(df.fakenews_tweets_2019$retweetCount, decreasing = T), ][1:5, ]
##                                                                                                                                            text
## 21 The Mainstream Media has NEVER been more dishonest than it is now. NBC and MSNBC are going Crazy. They report stori… https://t.co/zLh9zOR1J1
## 22 ....The Fake News Media in our Country is the real Opposition Party. It is truly the Enemy of the People! We must b… https://t.co/Y3KuJpWBAQ
## 31 With all of the success that our Country is having, including the just released jobs numbers which are off the char… https://t.co/Urm1LOV1bb
## 1  The Fake News Media keeps saying we haven’t built any NEW WALL. Below is a section just completed on the Border. An… https://t.co/2RqbrNEznu
## 30  The story in the New York Times regarding Jim Webb being considered as the next Secretary of Defense is FAKE NEWS.… https://t.co/1wwN10V5Pz
##    favorited favoriteCount replyToSN             created truncated
## 21     FALSE        125233      <NA> 2019-01-10 03:43:13      TRUE
## 22     FALSE        130273      <NA> 2019-01-07 13:31:00      TRUE
## 31     FALSE        135224      <NA> 2019-01-07 12:56:19      TRUE
## 1      FALSE        101249      <NA> 2019-01-11 17:50:04      TRUE
## 30     FALSE        105462      <NA> 2019-01-04 21:45:27      TRUE
##    replyToSID                  id replyToUID
## 21       <NA> 1083207607412760576       <NA>
## 22       <NA> 1082268365081767936       <NA>
## 31       <NA> 1082259636227620865       <NA>
## 1        <NA> 1083783112973320192       <NA>
## 30       <NA> 1081305634115674112       <NA>
##                                                                          statusSource
## 21 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 22 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 31 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 1  <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 30 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
##         screenName retweetCount isRetweet retweeted longitude latitude
## 21 realDonaldTrump        31694     FALSE     FALSE        NA       NA
## 22 realDonaldTrump        31683     FALSE     FALSE        NA       NA
## 31 realDonaldTrump        30337     FALSE     FALSE        NA       NA
## 1  realDonaldTrump        28196     FALSE     FALSE        NA       NA
## 30 realDonaldTrump        23408     FALSE     FALSE        NA       NA

Tweets by keyword

Search:

  • knife crime
  • yesterday
knife_crime_yesterday = searchTwitter(searchString = 'knife+crime'
                                , since = '2019-01-08'
                                )

knife_crime_yesterday[1:10]
## [[1]]
## [1] "IsmailRahiman: RT @DailyMailUK: Police are armed with metal detectors in latest bid to crackdown on knife epidemic sweeping streets of Britain https://t.c…"
## 
## [[2]]
## [1] "johnbrissenden: RT @natalieisonline: There are a few alarming things about the Jayden Moodie reporting we’ve seen from the Evening Standard that go far and…"
## 
## [[3]]
## [1] "Bob4719: RT @JuanDiablo4d: @MayorofLondon @BBCSPLondon @Jo_Coburn The rampant knife crime and killings of our youth should be your priority Mr Mayor…"
## 
## [[4]]
## [1] "ot7_trash: RT @SeemaChandwani: The child is dead. Murdered. \n\n@standardnews @George_Osborne you’re totally sick. Get help. \n\n https://t.co/IL30LoR9Jp"
## 
## [[5]]
## [1] "amitysv: RT @incorrectbucko: bucky, singing to himself: coming out of my cage and i’ve been doing some crime\n\nsteve: what \n\nbucky [tucking a knife i…"
## 
## [[6]]
## [1] "steer266: If May stays at No 10, I’m starting Project Hope, the positive case for remain | Sadiq Khan \nMy Project Hope is No… https://t.co/5nMLmcQijl"
## 
## [[7]]
## [1] "ikran: RT @SeemaChandwani: The child is dead. Murdered. \n\n@standardnews @George_Osborne you’re totally sick. Get help. \n\n https://t.co/IL30LoR9Jp"
## 
## [[8]]
## [1] "Exhausted33: RT @SeemaChandwani: The child is dead. Murdered. \n\n@standardnews @George_Osborne you’re totally sick. Get help. \n\n https://t.co/IL30LoR9Jp"
## 
## [[9]]
## [1] "rachelharger: RT @Jules_Carey: Excellent overview by @simonisrael  on the massive rise in stop and search where there is no reasonable suspicion (s.60 se…"
## 
## [[10]]
## [1] "iancadman4: RT @White_Hart_Spur: @DVATW @notasothers3 Grooming gangs rampant, knife crime epidemic... but the police take the easy route and arrest som…"

Tweets by combined keywords

Example: Tweets about knife killings in London in 2019

knife_killings_london = searchTwitter(searchString = 'knife+killing+london'
                                , since = '2019-01-01'
                                )
knife_killings_london[1:5]
## [[1]]
## [1] "Isabel_Cavazos: RT @Condor_Law: Khan's London: 14-Year-Old 'Butchered in Knife Murder, Girl Slashed in Face’\n\nThis would be any US city which would elect @…"
## 
## [[2]]
## [1] "Brasssneck: @I_R_DyLaNe @joshwoolcott Last week a man was killed in front of his son on a train to London and a 14-year-old boy… https://t.co/ov24hnOp5m"
## 
## [[3]]
## [1] "sterling_poetry: RT @Condor_Law: Khan's London: 14-Year-Old 'Butchered in Knife Murder, Girl Slashed in Face’\n\nThis would be any US city which would elect @…"
## 
## [[4]]
## [1] "WantBigHammer: RT @Condor_Law: Khan's London: 14-Year-Old 'Butchered in Knife Murder, Girl Slashed in Face’\n\nThis would be any US city which would elect @…"
## 
## [[5]]
## [1] "joeyd541: RT @Condor_Law: Khan's London: 14-Year-Old 'Butchered in Knife Murder, Girl Slashed in Face’\n\nThis would be any US city which would elect @…"

Tweets by combined keywords

Zoom in further…

Example: Tweets about murder on Surrey train (4th of Jan)

surrey_train_killings = searchTwitter(searchString = 'surrey+murder+knife'
                                , since = '2019-01-01'
                                , n = 100
                                )
## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 100 tweets were requested but the
## API can only return 82
df.surrey_train_killings = twListToDF(surrey_train_killings)

#how many are retweets?
table(df.surrey_train_killings$isRetweet)
## 
## FALSE  TRUE 
##    16    66

Tweets by author

Search:

  • tweets in 2019
  • by the Mayor of London
mol_2019 = searchTwitter(searchString = 'from:MayorofLondon'
                         , since = '2019-01-01'
                         , n = 100
                         )
## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 100 tweets were requested but the
## API can only return 46

df.mol_2019 = twListToDF(mol_2019)
head(df.mol_2019)
##                                                                                                                                           text
## 1 Tune in to @BBCSPLondon from 11am where I’ll be speaking live with @Jo_Coburn about my priorities for London and wh… https://t.co/vBf78ymhR8
## 2                           I’ll be on @BBC5Live with @JPonpolitics today talking about the issues facing Londoners. Listen in live from 10am.
## 3 No one should be sleeping rough tonight. We’re doing everything we can to help rough sleepers get the care they nee… https://t.co/RWNZw5JOVO
## 4 Can you guess how many Hopper journeys are made every day? Join Talk London - our online community - to take the Lo… https://t.co/fstPJjpSlZ
## 5 Violent crime has no place in our city, and I’m determined to do everything I can to keep Londoners safe. Our Viole… https://t.co/vf7ZiQO1og
## 6           What an incredible celebration of Waltham Forest this evening - our first-ever London Borough of Culture.… https://t.co/QofzUfkyBH
##   favorited favoriteCount replyToSN             created truncated
## 1     FALSE            41      <NA> 2019-01-13 10:49:23      TRUE
## 2     FALSE            36      <NA> 2019-01-13 09:01:32     FALSE
## 3     FALSE           152      <NA> 2019-01-12 17:14:08      TRUE
## 4     FALSE            43      <NA> 2019-01-12 14:06:36      TRUE
## 5     FALSE           239      <NA> 2019-01-12 10:10:16      TRUE
## 6     FALSE           191      <NA> 2019-01-11 20:14:01      TRUE
##   replyToSID                  id replyToUID
## 1       <NA> 1084402019174096896       <NA>
## 2       <NA> 1084374879321894915       <NA>
## 3       <NA> 1084136458137583618       <NA>
## 4       <NA> 1084089261786390528       <NA>
## 5       <NA> 1084029787222523906       <NA>
## 6       <NA> 1083819341127258113       <NA>
##                                                                         statusSource
## 1                 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## 2 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 3                 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## 4                 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## 5 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 6                 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
##      screenName retweetCount isRetweet retweeted longitude latitude
## 1 MayorofLondon            6     FALSE     FALSE        NA       NA
## 2 MayorofLondon           10     FALSE     FALSE        NA       NA
## 3 MayorofLondon           53     FALSE     FALSE        NA       NA
## 4 MayorofLondon            9     FALSE     FALSE        NA       NA
## 5 MayorofLondon           55     FALSE     FALSE        NA       NA
## 6 MayorofLondon           59     FALSE     FALSE        NA       NA

Tweets by author

plot(df.mol_2019$retweetCount, df.mol_2019$favoriteCount)

The “outlier”?

df.mol_2019[which.max(df.mol_2019$favoriteCount), 'text']
## [1] "The PM must finally rule out a no deal Brexit once and for all to avoid catastrophic consequences for Londoners and… https://t.co/kxvf1r7LJy"

Tweets by location

Search:

  • most retweeted tweets –> sort by retweetCount
  • yesterday –> set date to 8 Jan 2019
  • in London –> ???
  • about: …

Solution: geo coordinates.

Tweets by location

(51.507824, -0.127654)

We also add a radius around that point location.

Tweets by location

tweets_in_london = searchTwitter(searchString = ' '
                         , since = '2019-01-01'
                         , n = 100
                         , geocode='51.507824,-0.127654,15km'
                         )
head(tweets_in_london)
## [[1]]
## [1] "niamhethandarcy: RT @matthewsyed: British tennis was amateurish, parochial and about the old boy network. Then the Murray family arrived and drove a coach a…"
## 
## [[2]]
## [1] "gtrlucie: RT @aveirjapan: https://t.co/VI1jTcLU2t"
## 
## [[3]]
## [1] "preshn9: RT @donnyc1975: @JolyonMaugham @damocrat With one party you will have disaster capitalism - with the other you’ll have disaster socialism .…"
## 
## [[4]]
## [1] "flawlessftmalik: RT @esnysophie: LIAM HAS BEEN NOMINATED FOR A BRIT. DO NOT LET HIM DOWN GUYS. THIS MAN GAVE US A FREE CONCERT, MET ALL OF HIS FANS OUTSIDE…"
## 
## [[5]]
## [1] "FrauVonHerzL: RT @streetartmagic: Zabou - French Street Artist - Paris (F) - 02/2015 https://t.co/QihGPaD8y4"
## 
## [[6]]
## [1] "Ege2503: RT @NZ_Bazou: T’es en couple? — Célibataire depuis 1956 https://t.co/VK3nU3XW1L"

Tweets by language

Common problem:

  • you search for a keyword
  • but it’s in multiple languages
?searchTwitter

lang:

If not NULL, restricts tweets to the given language, given by an ISO 639-1 code

Tweets by language

ISO 639-1 language codes: https://www.loc.gov/standards/iso639-2/php/code_list.php

searchTwitter(searchString = '#metoo'
                         , since = '2019-01-01'
                         , n = 5
                         , lang = 'en'
                         )
## [[1]]
## [1] "LeadingWPassion: How to Tap Into Your Greatest Leadership Potential: https://t.co/z7ChRwi9xS\n#feminism #metoo"
## 
## [[2]]
## [1] "Tudumonstu: RT @RoshanKrRai: So #RajkumarHirani , the biggest image laundry is now himself stained. \n\nLet's see who makes a movie on him making him loo…"
## 
## [[3]]
## [1] "metoozoo: #MeToo Merch - YellowMaps Beaver Island MI topo map, 1:100000 Scale, 30 X 60 Minute, Historical, 1984, Updated 1989… https://t.co/DsklJ4FZrA"
## 
## [[4]]
## [1] "gulfkannadiga: RT @timesofindia: #MeToo movement: Filmmaker #RajkummarHirani's Assistant Director  of #Sanju accuses him of sexual harassment \n\nvia @etime…"
## 
## [[5]]
## [1] "WeForNews: Rajkumar Hirani accused of sexual assault during making of Sanju\n\n#MeToo #MeTooMovement #RajuBhai #RajKumarHirani… https://t.co/rw3zpitdUd"

Tweets by language

searchTwitter(searchString = '#metoo'
                         , since = '2019-01-01'
                         , n = 5
                         , lang = 'es'
                         )
## [[1]]
## [1] "Mirbia3: RT @la_patilla: Los aspirantes demócratas a la Casa Blanca bajo examen del #MeToo https://t.co/DYj2t5OFcS       .     ."
## 
## [[2]]
## [1] "77diegoleon: RT @NunkMasKKs: Las del hashtag #JuntoAActricesArgentinas que se llenan la boca hablando de sororidad, son las mismas que compartían actos…"
## 
## [[3]]
## [1] "Marlyndreams: RT @NunkMasKKs: Las del hashtag #JuntoAActricesArgentinas que se llenan la boca hablando de sororidad, son las mismas que compartían actos…"
## 
## [[4]]
## [1] "lucaluna2015: RT @NunkMasKKs: Las del hashtag #JuntoAActricesArgentinas que se llenan la boca hablando de sororidad, son las mismas que compartían actos…"
## 
## [[5]]
## [1] "Remanso4: RT @NunkMasKKs: Las Griseldas Sicilianis de la vida dijeron: \"A las pibas se les cree\", cuando Thelma Fardin denunció a Darthés.\n\nAhora, ot…"

Combined search queries

Example: Burglaries since Christmas?

combined_query_1 = searchTwitter(searchString = 'burgled'
              , since = '2018-12-24'
              #, until = '2019-01-07'
              , n = 100
              , lang = 'en'
              )
head(combined_query_1)
## [[1]]
## [1] "RobertCongreve: RT @K9Finn: Riding school charity for the disabled were burgled and all their equipment stolen. Anyone have any spare kit they can donate t…"
## 
## [[2]]
## [1] "jwoodford74: RT @paul_samouelle: Anybody seen this piece of scum. He burgled my children's nursery in Wanstead 1am on the 11th of Jan. https://t.co/ZSfj…"
## 
## [[3]]
## [1] "RobertAlanWint2: RT @K9Finn: Riding school charity for the disabled were burgled and all their equipment stolen. Anyone have any spare kit they can donate t…"
## 
## [[4]]
## [1] "PhoebeBellend: can’t believe my gaffs been burgled haha what a day"
## 
## [[5]]
## [1] "RobertAlanWint2: RT @neenaw: My mother (72) was burgled last week. They took her watch, which is nearly 40 years old and was a present from my father, who d…"
## 
## [[6]]
## [1] "oconnke: RT @RichardWellings: When ordinary members of the public get robbed or burgled the police often don't even bother to turn up. How different…"

Combined search queries

Example: Reactions to the Soubry issue in two cities?

soubry_london = searchTwitter(searchString = 'soubry'
              , since = '2019-01-01'
              , n = 100
              , lang = 'en'
              , geocode='51.507824,-0.127654,15km'
              )
head(soubry_london, 2)
## [[1]]
## [1] "baneman21: RT @Frankhaviland: If calling Anna Soubry a ‘Nazi’ is a crime, we’re gonna need another 17.4 million prison places. #JamesGoddard https://t…"
## 
## [[2]]
## [1] "journopoly: @Anna_Soubry @carolecadwalla The citizens see past the pretence of Parliamentarians and Parliawomentarians working… https://t.co/OAflmxtNJd"
soubry_manchester = searchTwitter(searchString = 'soubry'
              , since = '2019-01-01'
              , n = 100
              , lang = 'en'
              , geocode='53.480874,-2.242588,15km'
              )
head(soubry_manchester, 2)
## [[1]]
## [1] "jameslynn38: RT @StephenWadswor2: Absolutely shameful. Crocodile tears, then when the direct question of her party's responsibility for this situation i…"
## 
## [[2]]
## [1] "JamesIsherwoo15: BBC News - Brexit failure a catastrophic breach of trust, says May https://t.co/juUQrAxmu3. Grieve and Soubry give… https://t.co/0cGz7hqEbS"

Combined search queries

Your turn: what does this search do?

searchTwitter(searchString = '@Anna_Soubry + nazi'
              , since = '2019-01-01'
              , n = 5
              , lang = 'en'
              , resultType = 'popular'
              )

Combined search queries

Your turn: what does this search do?

searchTwitter(searchString = '@Anna_Soubry + nazi'
              , since = '2019-01-01'
              , n = 5
              , lang = 'en'
              , resultType = 'popular'
              )
## [[1]]
## [1] "SuzanneEvans1: And after screaming blue murder when @Anna_Soubry is called a ‘Nazi’, tonight @bbcnews is...silent. https://t.co/UrQm6GpuD2"
## 
## [[2]]
## [1] "BBCNormanS: Is this what its come to ...? @Anna_Soubry faces \"nazi\" taunts..... https://t.co/NHNMULtbEK"
## 
## [[3]]
## [1] "BBCPolitics: \"This is astonishing. This is what has happened to our country\" \n\nConservative MP @Anna_Soubry briefly stops live B… https://t.co/c0hZ22swyl"
## 
## [[4]]
## [1] "AngelaRayner: What has our Country come to when watching @BBCNews all you can hear is chants from protesters calling @Anna_Soubry… https://t.co/rCjXUS8ksR"
## 
## [[5]]
## [1] "davidkurten: I agree with @Anna_Soubry - it's wrong to call someone a Nazi with no justification. I wonder if her mate… https://t.co/x8dXYZUHnY"

getTrends…

San Francisco, California –> woied = 2487956

trends_sf = getTrends(woeid = 2487956)

head(trends_sf)
##          name                                           url
## 1     Cowboys           http://twitter.com/search?q=Cowboys
## 2   #dalvslar       http://twitter.com/search?q=%23dalvslar
## 3 CJ Anderson http://twitter.com/search?q=%22CJ+Anderson%22
## 4        Rams              http://twitter.com/search?q=Rams
## 5    #INDvsKC        http://twitter.com/search?q=%23INDvsKC
## 6       Colts             http://twitter.com/search?q=Colts
##               query   woeid
## 1           Cowboys 2487956
## 2       %23dalvslar 2487956
## 3 %22CJ+Anderson%22 2487956
## 4              Rams 2487956
## 5        %23INDvsKC 2487956
## 6             Colts 2487956

Additional stuff…

Twitter API/twitteR package

  • URLs
  • followers
  • user info

Twitter Search API vs Twitter Stream API

  • query quality
  • result quantity

Problems with Twitter/YouTube data?

Some issues:

  • Sample representativeness
  • Location accuracy
  • Location availability
  • Sampling through API

Case 2: Youtube’s API

Getting access

Basic steps

  1. Google account
  2. Login to Google Developers Console
  3. Create project/app
  4. Obtain access credentials

Tutorial here

The tuber package

library(tuber)

Authentication

client_secret = 'rwHJJDPf_xdvIWmQ4TL00HKz'
client_id = '625618111946-mf44nomvi5m9ot668b59k7koq122jmaa.apps.googleusercontent.com'

yt_oauth(app_id = client_id, app_secret = client_secret, token='')

Meta data per video

First step: get the video ID.

Meta data per video

Get the “meta” stats for this video:

video_identifier = '_H-UnmiMc3s'

video_stats = get_stats(video_id = video_identifier)
as.data.frame(video_stats)
##            id viewCount likeCount dislikeCount favoriteCount commentCount
## 1 _H-UnmiMc3s   1248936     41311         1041             0         3921

Detailed data per video

Want more depth of detail?

video_details = get_video_details(video_id = video_identifier)
video_details
## $kind
## [1] "youtube#videoListResponse"
## 
## $etag
## [1] "\"XpPGQXPnxQJhLgs6enD_n8JR4Qk/jSWIMSiXSwvh-NUIY6h8ErCkZhw\""
## 
## $pageInfo
## $pageInfo$totalResults
## [1] 1
## 
## $pageInfo$resultsPerPage
## [1] 1
## 
## 
## $items
## $items[[1]]
## $items[[1]]$kind
## [1] "youtube#video"
## 
## $items[[1]]$etag
## [1] "\"XpPGQXPnxQJhLgs6enD_n8JR4Qk/QXED6LnSMrVngC72qDrXUvb2znc\""
## 
## $items[[1]]$id
## [1] "_H-UnmiMc3s"
## 
## $items[[1]]$snippet
## $items[[1]]$snippet$publishedAt
## [1] "2018-12-14T14:37:30.000Z"
## 
## $items[[1]]$snippet$channelId
## [1] "UCtinbF-Q-fVthA0qrFQTgXQ"
## 
## $items[[1]]$snippet$title
## [1] "Never Ride an ELECTRIC SCOOTER in the Rain"
## 
## $items[[1]]$snippet$description
## [1] "thank you for such an amazing visit Poland.  looking forward to coming back to Warsaw.\n\nmusic by; https://youtube.com/joakimkarud\nlast song by; http://smarturl.it/venturamusix"
## 
## $items[[1]]$snippet$thumbnails
## $items[[1]]$snippet$thumbnails$default
## $items[[1]]$snippet$thumbnails$default$url
## [1] "https://i.ytimg.com/vi/_H-UnmiMc3s/default.jpg"
## 
## $items[[1]]$snippet$thumbnails$default$width
## [1] 120
## 
## $items[[1]]$snippet$thumbnails$default$height
## [1] 90
## 
## 
## $items[[1]]$snippet$thumbnails$medium
## $items[[1]]$snippet$thumbnails$medium$url
## [1] "https://i.ytimg.com/vi/_H-UnmiMc3s/mqdefault.jpg"
## 
## $items[[1]]$snippet$thumbnails$medium$width
## [1] 320
## 
## $items[[1]]$snippet$thumbnails$medium$height
## [1] 180
## 
## 
## $items[[1]]$snippet$thumbnails$high
## $items[[1]]$snippet$thumbnails$high$url
## [1] "https://i.ytimg.com/vi/_H-UnmiMc3s/hqdefault.jpg"
## 
## $items[[1]]$snippet$thumbnails$high$width
## [1] 480
## 
## $items[[1]]$snippet$thumbnails$high$height
## [1] 360
## 
## 
## $items[[1]]$snippet$thumbnails$standard
## $items[[1]]$snippet$thumbnails$standard$url
## [1] "https://i.ytimg.com/vi/_H-UnmiMc3s/sddefault.jpg"
## 
## $items[[1]]$snippet$thumbnails$standard$width
## [1] 640
## 
## $items[[1]]$snippet$thumbnails$standard$height
## [1] 480
## 
## 
## $items[[1]]$snippet$thumbnails$maxres
## $items[[1]]$snippet$thumbnails$maxres$url
## [1] "https://i.ytimg.com/vi/_H-UnmiMc3s/maxresdefault.jpg"
## 
## $items[[1]]$snippet$thumbnails$maxres$width
## [1] 1280
## 
## $items[[1]]$snippet$thumbnails$maxres$height
## [1] 720
## 
## 
## 
## $items[[1]]$snippet$channelTitle
## [1] "CaseyNeistat"
## 
## $items[[1]]$snippet$tags
## $items[[1]]$snippet$tags[[1]]
## [1] "warsaw"
## 
## $items[[1]]$snippet$tags[[2]]
## [1] "poland"
## 
## $items[[1]]$snippet$tags[[3]]
## [1] "lime scooter"
## 
## $items[[1]]$snippet$tags[[4]]
## [1] "bird scooter"
## 
## $items[[1]]$snippet$tags[[5]]
## [1] "byrd"
## 
## 
## $items[[1]]$snippet$categoryId
## [1] "22"
## 
## $items[[1]]$snippet$liveBroadcastContent
## [1] "none"
## 
## $items[[1]]$snippet$localized
## $items[[1]]$snippet$localized$title
## [1] "Never Ride an ELECTRIC SCOOTER in the Rain"
## 
## $items[[1]]$snippet$localized$description
## [1] "thank you for such an amazing visit Poland.  looking forward to coming back to Warsaw.\n\nmusic by; https://youtube.com/joakimkarud\nlast song by; http://smarturl.it/venturamusix"
## 
## 
## $items[[1]]$snippet$defaultAudioLanguage
## [1] "en"

Detailed data per video

A closer look:

items[[1]]$snippet$thumbnails$high$url

https://i.ytimg.com/vi/_H-UnmiMc3s/hqdefault.jpg

Think of:

  • thumbnails for propaganda
  • clickbait understanding

Comments per video

Comments made below the video:

New video: Introducing the Numberphile Podcast

video_identifier_2 = '0GzhWPj4-cw'

video_comments = get_all_comments(video_id = video_identifier_2)
names(video_comments)
##  [1] "authorDisplayName"     "authorProfileImageUrl"
##  [3] "authorChannelUrl"      "authorChannelId.value"
##  [5] "videoId"               "textDisplay"          
##  [7] "textOriginal"          "canRate"              
##  [9] "viewerRating"          "likeCount"            
## [11] "publishedAt"           "updatedAt"            
## [13] "id"                    "parentId"             
## [15] "moderationStatus"

Comments per video

A closer look:

head(video_comments)
##   authorDisplayName
## 1         For Phone
## 2      Martin Wujet
## 3        DarknessFX
## 4       vignesh war
## 5       Knives Town
## 6   Leonardo Castro
##                                                                                         authorProfileImageUrl
## 1 https://yt3.ggpht.com/-zzteRRx5IS4/AAAAAAAAAAI/AAAAAAAAAAA/n5vSdvluSPk/s28-c-k-no-mo-rj-c0xffffff/photo.jpg
## 2 https://yt3.ggpht.com/-M-LYAfoZxEw/AAAAAAAAAAI/AAAAAAAAAAA/YSgUM7jnmJ0/s28-c-k-no-mo-rj-c0xffffff/photo.jpg
## 3 https://yt3.ggpht.com/-LCk4tCJGpyw/AAAAAAAAAAI/AAAAAAAAAAA/i9d94vyWM8o/s28-c-k-no-mo-rj-c0xffffff/photo.jpg
## 4 https://yt3.ggpht.com/-Vp8GoRNPR3s/AAAAAAAAAAI/AAAAAAAAAAA/qwchRgCi_b0/s28-c-k-no-mo-rj-c0xffffff/photo.jpg
## 5 https://yt3.ggpht.com/-xOvjHZvnw2c/AAAAAAAAAAI/AAAAAAAAAAA/BFjyPiiydRU/s28-c-k-no-mo-rj-c0xffffff/photo.jpg
## 6 https://yt3.ggpht.com/-2Vp8jMULXk8/AAAAAAAAAAI/AAAAAAAAAAA/oH74EQ4onzw/s28-c-k-no-mo-rj-c0xffffff/photo.jpg
##                                          authorChannelUrl
## 1 http://www.youtube.com/channel/UC5VzcYt1NpagomAQ1I6gksw
## 2 http://www.youtube.com/channel/UCIF-cT20nIjlamcp2qny_Xg
## 3 http://www.youtube.com/channel/UCZSYMF-QGU4tZCIj2BcG6FA
## 4 http://www.youtube.com/channel/UCWOnFa83rPhj1Trg-m_9TCA
## 5 http://www.youtube.com/channel/UCFX6IKRPAGxaOUe5XhPorPQ
## 6 http://www.youtube.com/channel/UCMA0m1bf2TqR_ffSqbjT6jg
##      authorChannelId.value     videoId
## 1 UC5VzcYt1NpagomAQ1I6gksw 0GzhWPj4-cw
## 2 UCIF-cT20nIjlamcp2qny_Xg 0GzhWPj4-cw
## 3 UCZSYMF-QGU4tZCIj2BcG6FA 0GzhWPj4-cw
## 4 UCWOnFa83rPhj1Trg-m_9TCA 0GzhWPj4-cw
## 5 UCFX6IKRPAGxaOUe5XhPorPQ 0GzhWPj4-cw
## 6 UCMA0m1bf2TqR_ffSqbjT6jg 0GzhWPj4-cw
##                                                                                                                                                                                           textDisplay
## 1                                                                                     Nice, thanks!! didn.t know what podcasts were before now!! They are so awesome!! THanks for this big discovery!
## 2                                                                                                                                  Hi !! how many possible wordings Einstein / Lewis Caroll riddle ??
## 3 If there is a link to download it would be lovely to upload to my offline headphones, if it&#39;s online only then I would stick with the main channel and watch daily when I have internet access.
## 4                                                                                                                                                               I could listen to 3b1b voice all day!
## 5                                                                                                                                                      If blue was a number, what number would it be?
## 6                                                                                                                                                  Could you make a video about what a dual space is?
##                                                                                                                                                                                      textOriginal
## 1                                                                                 Nice, thanks!! didn.t know what podcasts were before now!! They are so awesome!! THanks for this big discovery!
## 2                                                                                                                              Hi !! how many possible wordings Einstein / Lewis Caroll riddle ??
## 3 If there is a link to download it would be lovely to upload to my offline headphones, if it's online only then I would stick with the main channel and watch daily when I have internet access.
## 4                                                                                                                                                           I could listen to 3b1b voice all day!
## 5                                                                                                                                                  If blue was a number, what number would it be?
## 6                                                                                                                                              Could you make a video about what a dual space is?
##   canRate viewerRating likeCount              publishedAt
## 1    TRUE         none         0 2019-01-09T16:49:08.000Z
## 2    TRUE         none         0 2019-01-09T05:53:53.000Z
## 3    TRUE         none         0 2019-01-08T04:33:58.000Z
## 4    TRUE         none         0 2019-01-06T09:45:04.000Z
## 5    TRUE         none         0 2019-01-04T02:19:10.000Z
## 6    TRUE         none         0 2019-01-03T18:33:49.000Z
##                  updatedAt                         id parentId
## 1 2019-01-09T16:49:08.000Z UgwdprwF8zOtUL4F9Od4AaABAg     <NA>
## 2 2019-01-09T05:53:53.000Z UgzTmRkGTsfSHxRePgh4AaABAg     <NA>
## 3 2019-01-08T04:33:58.000Z UgyG5bqAlAEHK0jyuCp4AaABAg     <NA>
## 4 2019-01-06T09:45:04.000Z UgyrU_R1hdP0_bpqk2d4AaABAg     <NA>
## 5 2019-01-04T02:19:10.000Z UgxSxeLYV0UoAO8BQP94AaABAg     <NA>
## 6 2019-01-03T18:34:11.000Z Ugzotl5B9AXCiPkhQ3V4AaABAg     <NA>
##   moderationStatus
## 1               NA
## 2               NA
## 3               NA
## 4               NA
## 5               NA
## 6               NA

Transcript (captions)

YouTube transcrips yet untapped source!

Few exceptions:

Transcript (captions)

Tricky to get transcripts.

  • uploader needs to provide proper transcript (high quality, low availibility)
  • workaround approach (moderate/low quality, high availibility)
?get_captions
?list_caption_tracks

Meta data per channel

First: identify the channel ID.

Meta data per channel

channel_identifier = 'UCLXo7UDZvByw2ixzpQCufnA'
channel_stats = get_channel_stats(channel_id = channel_identifier)
## Channel Title: Vox 
## No. of Views: 1209094639 
## No. of Subscribers: 5426703 
## No. of Videos: 946

channel_stats
## $kind
## [1] "youtube#channel"
## 
## $etag
## [1] "\"XpPGQXPnxQJhLgs6enD_n8JR4Qk/u5LBm16TDKiRba0uYloNu5kz1aQ\""
## 
## $id
## [1] "UCLXo7UDZvByw2ixzpQCufnA"
## 
## $snippet
## $snippet$title
## [1] "Vox"
## 
## $snippet$description
## [1] "Vox helps you cut through the noise and understand what's driving events in the headlines and in our lives.\n\nVox Video is Joe Posner, Mona Lalwani, Valerie Lapinski, Joss Fong, Estelle Caswell, Johnny Harris, Phil Edwards, Carlos Waters, Gina Barton, Liz Scheltens, Christophe Haubursin, Carlos Maza, Coleman Lowndes, Dion Lee, Mac Schneider, Sam Ellis, Ellen Rolfes, Mallory Brangan, Ranjani Chakraborty, Madeline Marshall, Kimberly Mas, Danush Parveneh, Christina Thornell, Alvin Chang, Agnes Mazur, Tian Wang, Rachel Abady, and the staff of Vox.com\n\nTo show us some love, get closer to our work, and creators and get exclusive access to our creators and a peek behind-the-scenes access, become a member of the Vox Video Lab today: http://www.vox.com/join \n\nDon’t forget to subscribe so you don't miss a video: http://goo.gl/0bsAjO. For even more Vox, head over to http://www.vox.com \n\nTo write us: joe@vox.com\nTo request permission to use our videos: permissions@voxmedia.com"
## 
## $snippet$customUrl
## [1] "voxdotcom"
## 
## $snippet$publishedAt
## [1] "2014-03-04T20:30:22.000Z"
## 
## $snippet$thumbnails
## $snippet$thumbnails$default
## $snippet$thumbnails$default$url
## [1] "https://yt3.ggpht.com/a-/AAuE7mBlnA9KlCyHqQzT6DIpVGM3e0_gSv3nKdwgsA=s88-mo-c-c0xffffffff-rj-k-no"
## 
## $snippet$thumbnails$default$width
## [1] 88
## 
## $snippet$thumbnails$default$height
## [1] 88
## 
## 
## $snippet$thumbnails$medium
## $snippet$thumbnails$medium$url
## [1] "https://yt3.ggpht.com/a-/AAuE7mBlnA9KlCyHqQzT6DIpVGM3e0_gSv3nKdwgsA=s240-mo-c-c0xffffffff-rj-k-no"
## 
## $snippet$thumbnails$medium$width
## [1] 240
## 
## $snippet$thumbnails$medium$height
## [1] 240
## 
## 
## $snippet$thumbnails$high
## $snippet$thumbnails$high$url
## [1] "https://yt3.ggpht.com/a-/AAuE7mBlnA9KlCyHqQzT6DIpVGM3e0_gSv3nKdwgsA=s800-mo-c-c0xffffffff-rj-k-no"
## 
## $snippet$thumbnails$high$width
## [1] 800
## 
## $snippet$thumbnails$high$height
## [1] 800
## 
## 
## 
## $snippet$localized
## $snippet$localized$title
## [1] "Vox"
## 
## $snippet$localized$description
## [1] "Vox helps you cut through the noise and understand what's driving events in the headlines and in our lives.\n\nVox Video is Joe Posner, Mona Lalwani, Valerie Lapinski, Joss Fong, Estelle Caswell, Johnny Harris, Phil Edwards, Carlos Waters, Gina Barton, Liz Scheltens, Christophe Haubursin, Carlos Maza, Coleman Lowndes, Dion Lee, Mac Schneider, Sam Ellis, Ellen Rolfes, Mallory Brangan, Ranjani Chakraborty, Madeline Marshall, Kimberly Mas, Danush Parveneh, Christina Thornell, Alvin Chang, Agnes Mazur, Tian Wang, Rachel Abady, and the staff of Vox.com\n\nTo show us some love, get closer to our work, and creators and get exclusive access to our creators and a peek behind-the-scenes access, become a member of the Vox Video Lab today: http://www.vox.com/join \n\nDon’t forget to subscribe so you don't miss a video: http://goo.gl/0bsAjO. For even more Vox, head over to http://www.vox.com \n\nTo write us: joe@vox.com\nTo request permission to use our videos: permissions@voxmedia.com"
## 
## 
## $snippet$country
## [1] "US"
## 
## 
## $statistics
## $statistics$viewCount
## [1] "1209094639"
## 
## $statistics$commentCount
## [1] "0"
## 
## $statistics$subscriberCount
## [1] "5426703"
## 
## $statistics$hiddenSubscriberCount
## [1] FALSE
## 
## $statistics$videoCount
## [1] "946"

Meta data per channel

Some statistics per channel:

channel_stats$statistics
## $viewCount
## [1] "1209094639"
## 
## $commentCount
## [1] "0"
## 
## $subscriberCount
## [1] "5426703"
## 
## $hiddenSubscriberCount
## [1] FALSE
## 
## $videoCount
## [1] "946"

Activity data per channel

New channel (smaller): JSNation

channel_identifier_2 = 'UCQM428Hwrvxla8DCgjGONSQ'
channel_activity = list_channel_activities(filter = c(channel_id = channel_identifier_2),
                                           max_results = 50)
names(channel_activity)
##  [1] "publishedAt"                "channelId"                 
##  [3] "title"                      "description"               
##  [5] "thumbnails.default.url"     "thumbnails.default.width"  
##  [7] "thumbnails.default.height"  "thumbnails.medium.url"     
##  [9] "thumbnails.medium.width"    "thumbnails.medium.height"  
## [11] "thumbnails.high.url"        "thumbnails.high.width"     
## [13] "thumbnails.high.height"     "thumbnails.standard.url"   
## [15] "thumbnails.standard.width"  "thumbnails.standard.height"
## [17] "thumbnails.maxres.url"      "thumbnails.maxres.width"   
## [19] "thumbnails.maxres.height"   "channelTitle"              
## [21] "type"                       "groupId"

Activity data per channel

head(channel_activity)
##                publishedAt                channelId
## 1 2018-12-10T09:58:22.000Z UCQM428Hwrvxla8DCgjGONSQ
## 2 2018-12-13T16:07:13.000Z UCQM428Hwrvxla8DCgjGONSQ
## 3 2018-12-10T09:57:31.000Z UCQM428Hwrvxla8DCgjGONSQ
## 4 2018-12-13T16:07:05.000Z UCQM428Hwrvxla8DCgjGONSQ
## 5 2018-12-10T09:54:19.000Z UCQM428Hwrvxla8DCgjGONSQ
## 6 2018-12-13T16:06:52.000Z UCQM428Hwrvxla8DCgjGONSQ
##                                                                     title
## 1 How to refactor JavaScript with JavaScript on a massive scale – Kersjes
## 2 How to refactor JavaScript with JavaScript on a massive scale – Kersjes
## 3           Creating IoT Applications with Web Bluetooth – Martin Woolley
## 4           Creating IoT Applications with Web Bluetooth – Martin Woolley
## 5                The impostor syndrome aka I'm a fraud – Claudio Semeraro
## 6                The impostor syndrome aka I'm a fraud – Claudio Semeraro
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       description
## 1 Talk recording from AmsterdamJS December 2018 Meetup https://www.meetup.com/AmsterdamJS/events/255511327/\n\nRefactoring on a massive scale is a different beast. What to do when "find and replace" simply isn't enough? We faced this challenge when we needed to unify the way an initial state of a React component was set across the codebase consisting of thousands of files. This is a story about how we created a faultless commit that touched around 100,000 lines of code. Our solution was to write a program that did the required modifications for us. These kind of programs are often called codemods. Languages and concepts are like tools in a toolbox. Codemods are a new tool to your toolbox.\n\nAbout Tijn\n\nTijn is a software engineer at Reaktor. He mainly writes Node.js/React applications, is interested in anything functional or reactive, and is rarely seen without a cup of coffee. After office hours he likes to play around with esoteric compilers.
## 2 Talk recording from AmsterdamJS December 2018 Meetup https://www.meetup.com/AmsterdamJS/events/255511327/\n\nRefactoring on a massive scale is a different beast. What to do when "find and replace" simply isn't enough? We faced this challenge when we needed to unify the way an initial state of a React component was set across the codebase consisting of thousands of files. This is a story about how we created a faultless commit that touched around 100,000 lines of code. Our solution was to write a program that did the required modifications for us. These kind of programs are often called codemods. Languages and concepts are like tools in a toolbox. Codemods are a new tool to your toolbox.\n\nAbout Tijn\n\nTijn is a software engineer at Reaktor. He mainly writes Node.js/React applications, is interested in anything functional or reactive, and is rarely seen without a cup of coffee. After office hours he likes to play around with esoteric compilers.
## 3                     Talk recording from AmsterdamJS December 2018 Meetup https://www.meetup.com/AmsterdamJS/events/255511327/\n\n10 million Bluetooth devices ship every day, and that figure is rising. Regarded as one of the key, enabling technologies of the IoT, Bluetooth is everywhere and in the summer of 2017, a new Bluetooth technology, Bluetooth mesh networking was released. Bluetooth mesh is used in enterprise and industrial IoT systems and in these environments, web technologies and cloud-based architectures are king.\n\nIn this session, we'll review key Bluetooth concepts and capabilities and the Web Bluetooth APIs which let you exploit them. There may even be demos!\n\nAbout Martin\n\nMartin Woolley is an industry veteran with over 30 years' experience working with computers large, small and ….. getting smaller. He still has a Sinclair ZX81 somewhere. He was a part of the BBC micro:bit team and designed the micro:bit's Bluetooth profile.
## 4                     Talk recording from AmsterdamJS December 2018 Meetup https://www.meetup.com/AmsterdamJS/events/255511327/\n\n10 million Bluetooth devices ship every day, and that figure is rising. Regarded as one of the key, enabling technologies of the IoT, Bluetooth is everywhere and in the summer of 2017, a new Bluetooth technology, Bluetooth mesh networking was released. Bluetooth mesh is used in enterprise and industrial IoT systems and in these environments, web technologies and cloud-based architectures are king.\n\nIn this session, we'll review key Bluetooth concepts and capabilities and the Web Bluetooth APIs which let you exploit them. There may even be demos!\n\nAbout Martin\n\nMartin Woolley is an industry veteran with over 30 years' experience working with computers large, small and ….. getting smaller. He still has a Sinclair ZX81 somewhere. He was a part of the BBC micro:bit team and designed the micro:bit's Bluetooth profile.
## 5                                                                                                                                                                                                                                                                                                                                                                          Talk recording from AmsterdamJS December 2018 Meetup https://www.meetup.com/AmsterdamJS/events/255511327/\n\nThe frontend world is moving at an incredibly fast pace, there is just so much to know that feeling overwhelmed may just be the norm. It doesn't matter if you're a seasoned developer or a junior just starting out, comparing yourself to others will trigger many biases and feeling like a fraud is way more common than you may think. It even has a name: the impostor syndrome.\n\nAbout Claudio\n\nFull stack JavaScript developer, successfully pretending to know how to code for 15 years now.
## 6                                                                                                                                                                                                                                                                                                                                                                          Talk recording from AmsterdamJS December 2018 Meetup https://www.meetup.com/AmsterdamJS/events/255511327/\n\nThe frontend world is moving at an incredibly fast pace, there is just so much to know that feeling overwhelmed may just be the norm. It doesn't matter if you're a seasoned developer or a junior just starting out, comparing yourself to others will trigger many biases and feeling like a fraud is way more common than you may think. It even has a name: the impostor syndrome.\n\nAbout Claudio\n\nFull stack JavaScript developer, successfully pretending to know how to code for 15 years now.
##                           thumbnails.default.url thumbnails.default.width
## 1 https://i.ytimg.com/vi/xS7UrNPmYX8/default.jpg                      120
## 2 https://i.ytimg.com/vi/xS7UrNPmYX8/default.jpg                      120
## 3 https://i.ytimg.com/vi/6p_LJFNbJZk/default.jpg                      120
## 4 https://i.ytimg.com/vi/6p_LJFNbJZk/default.jpg                      120
## 5 https://i.ytimg.com/vi/mmXcW2x06ho/default.jpg                      120
## 6 https://i.ytimg.com/vi/mmXcW2x06ho/default.jpg                      120
##   thumbnails.default.height
## 1                        90
## 2                        90
## 3                        90
## 4                        90
## 5                        90
## 6                        90
##                              thumbnails.medium.url thumbnails.medium.width
## 1 https://i.ytimg.com/vi/xS7UrNPmYX8/mqdefault.jpg                     320
## 2 https://i.ytimg.com/vi/xS7UrNPmYX8/mqdefault.jpg                     320
## 3 https://i.ytimg.com/vi/6p_LJFNbJZk/mqdefault.jpg                     320
## 4 https://i.ytimg.com/vi/6p_LJFNbJZk/mqdefault.jpg                     320
## 5 https://i.ytimg.com/vi/mmXcW2x06ho/mqdefault.jpg                     320
## 6 https://i.ytimg.com/vi/mmXcW2x06ho/mqdefault.jpg                     320
##   thumbnails.medium.height
## 1                      180
## 2                      180
## 3                      180
## 4                      180
## 5                      180
## 6                      180
##                                thumbnails.high.url thumbnails.high.width
## 1 https://i.ytimg.com/vi/xS7UrNPmYX8/hqdefault.jpg                   480
## 2 https://i.ytimg.com/vi/xS7UrNPmYX8/hqdefault.jpg                   480
## 3 https://i.ytimg.com/vi/6p_LJFNbJZk/hqdefault.jpg                   480
## 4 https://i.ytimg.com/vi/6p_LJFNbJZk/hqdefault.jpg                   480
## 5 https://i.ytimg.com/vi/mmXcW2x06ho/hqdefault.jpg                   480
## 6 https://i.ytimg.com/vi/mmXcW2x06ho/hqdefault.jpg                   480
##   thumbnails.high.height                          thumbnails.standard.url
## 1                    360 https://i.ytimg.com/vi/xS7UrNPmYX8/sddefault.jpg
## 2                    360 https://i.ytimg.com/vi/xS7UrNPmYX8/sddefault.jpg
## 3                    360 https://i.ytimg.com/vi/6p_LJFNbJZk/sddefault.jpg
## 4                    360 https://i.ytimg.com/vi/6p_LJFNbJZk/sddefault.jpg
## 5                    360 https://i.ytimg.com/vi/mmXcW2x06ho/sddefault.jpg
## 6                    360 https://i.ytimg.com/vi/mmXcW2x06ho/sddefault.jpg
##   thumbnails.standard.width thumbnails.standard.height
## 1                       640                        480
## 2                       640                        480
## 3                       640                        480
## 4                       640                        480
## 5                       640                        480
## 6                       640                        480
##                                  thumbnails.maxres.url
## 1 https://i.ytimg.com/vi/xS7UrNPmYX8/maxresdefault.jpg
## 2 https://i.ytimg.com/vi/xS7UrNPmYX8/maxresdefault.jpg
## 3 https://i.ytimg.com/vi/6p_LJFNbJZk/maxresdefault.jpg
## 4 https://i.ytimg.com/vi/6p_LJFNbJZk/maxresdefault.jpg
## 5 https://i.ytimg.com/vi/mmXcW2x06ho/maxresdefault.jpg
## 6 https://i.ytimg.com/vi/mmXcW2x06ho/maxresdefault.jpg
##   thumbnails.maxres.width thumbnails.maxres.height channelTitle
## 1                    1280                      720     JSNation
## 2                    1280                      720     JSNation
## 3                    1280                      720     JSNation
## 4                    1280                      720     JSNation
## 5                    1280                      720     JSNation
## 6                    1280                      720     JSNation
##           type                              groupId
## 1       upload VTE1NDQ0MzU5MDIxNDAzOTY0OTAwMDI3Njg=
## 2 playlistItem VTE1NDQ0MzU5MDIxNDAzOTY0OTAwMDI3Njg=
## 3       upload VTE1NDQ0MzU4NTExNDAzOTY0OTAwMDQ0MzI=
## 4 playlistItem VTE1NDQ0MzU4NTExNDAzOTY0OTAwMDQ0MzI=
## 5       upload VTE1NDQ0MzU2NTkxNDAzOTY0OTAwMDM3OTI=
## 6 playlistItem VTE1NDQ0MzU2NTkxNDAzOTY0OTAwMDM3OTI=

Video stats per channel

Stats for all videos in a channel:

all_video_stats = get_all_channel_video_stats(channel_id = channel_identifier_2)
names(all_video_stats)

Video stats per channel

head(all_video_stats)
##            id
## 1 _4nrh6mTt4E
## 2 _iIxC8ziZNM
## 3 -BGxJn3c7NA
## 4 -CGpVrydTyg
## 5 0t9FERJRShQ
## 6 1eH9-cLMXQg
##                                                                              title
## 1                                   Amsterdam JSNation Conference 2018 Live stream
## 2                                Smart Contracts in JavaScript - Mikhail Kuznetcov
## 3                         TypeScript Ruined My Life (In a Good Way) - Andy Mockler
## 4 SonarJS: How To Build a Static Code Analyzer - Elena Vilchik & Carlo Bottiglieri
## 5                                         The dark ages of IoT - Sebastian Golasch
## 6                       In the Ocean of Angular Web Applications - Yaprak Ayazoglu
##           publication_date viewCount likeCount dislikeCount favoriteCount
## 1 2018-06-01T16:58:23.000Z      2886        61            0             0
## 2 2018-04-08T17:11:16.000Z       351         8            0             0
## 3 2018-06-08T14:34:50.000Z       332         1            1             0
## 4 2017-09-20T20:30:04.000Z       532         7            0             0
## 5 2018-06-08T14:31:14.000Z        60         2            0             0
## 6 2018-06-17T17:42:55.000Z       133         4            0             0
##   commentCount                                         url
## 1            7 https://www.youtube.com/watch?v=_4nrh6mTt4E
## 2            0 https://www.youtube.com/watch?v=_iIxC8ziZNM
## 3            2 https://www.youtube.com/watch?v=-BGxJn3c7NA
## 4            0 https://www.youtube.com/watch?v=-CGpVrydTyg
## 5            0 https://www.youtube.com/watch?v=0t9FERJRShQ
## 6            0 https://www.youtube.com/watch?v=1eH9-cLMXQg

Additional queries

  • subscriber info: get_subscriptions
  • list all videos: get_

Problems with Twitter/YouTube data?

Some issues:

  • Sample representativeness
  • Location accuracy
  • Location availability
  • ~Sampling through API~
  • Transcript quality
  • Real representations (filtered)
  • Censored?

Crime data interfaces

police.uk as data repository

Public database of “open police data”

police.uk as data repository

API?

police.uk as data repository

Using the API:

  • different method
  • calls direct from the browser
  • no R implementation

The API is implemented as a standard JSON web service using HTTP GET and POST requests. Full request and response examples are provided in the documentation.

police.uk as data repository

“Crimes at location”

Search query example:

https://data.police.uk/api/crimes-at-location?date=2017-08&location_id=884227

police.uk as data repository

crimedata package

library(crimedata)

crimedata package

Aim:

Gives convenient access to publicly available police-recorded open crime data from large cities in the United States that are included in the Crime Open Database

Open Crime Database

crimedata package

Which data are available?

list_crime_data(quiet = FALSE)
## Downloading list of URLs for data files. This takes a few seconds but is only done once per session.
## # A tibble: 11 x 2
##    city           years       
##    <chr>          <chr>       
##  1 all cities     2007 to 2017
##  2 Chicago        2007 to 2017
##  3 Detroit        2009 to 2017
##  4 Fort Worth     2007 to 2017
##  5 Kansas City    2009 to 2017
##  6 Los Angeles    2010 to 2017
##  7 Louisville     2009 to 2017
##  8 New York       2007 to 2017
##  9 San Francisco  2007 to 2017
## 10 Tucson         2009 to 2017
## 11 Virginia Beach 2013 to 2017

crimedata package

Getting data:

crime_data_ny_2017 = get_crime_data(years = 2017
                            , cities = c("New York"))
## Using cached URLs to get data from server. These URLs rarely change and this is almost certainly safe.
## Downloading sample data for New York in 2017
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |==                                                               |   4%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |========                                                         |  13%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |===========                                                      |  16%
  |                                                                       
  |===========                                                      |  18%
  |                                                                       
  |============                                                     |  19%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |==============                                                   |  21%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |=================                                                |  27%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |=========================                                        |  39%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |===========================                                      |  41%
  |                                                                       
  |============================                                     |  42%
  |                                                                       
  |============================                                     |  44%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |===============================                                  |  47%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |=================================                                |  50%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |==================================                               |  53%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |====================================                             |  55%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=====================================                            |  57%
  |                                                                       
  |======================================                           |  58%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |=======================================                          |  61%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |===========================================                      |  66%
  |                                                                       
  |============================================                     |  67%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |===============================================                  |  73%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |===================================================              |  78%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=====================================================            |  81%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |======================================================           |  84%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |=========================================================        |  87%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |===========================================================      |  91%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |============================================================     |  93%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |==============================================================   |  95%
  |                                                                       
  |===============================================================  |  96%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |================================================================ |  99%
  |                                                                       
  |=================================================================| 100%

names(crime_data_ny_2017)
##  [1] "uid"               "city_name"         "offense_code"     
##  [4] "offense_type"      "offense_group"     "offense_against"  
##  [7] "date_single"       "date_start"        "date_end"         
## [10] "longitude"         "latitude"          "location_type"    
## [13] "location_category" "census_block"
head(crime_data_ny_2017)
## # A tibble: 6 x 14
##      uid city_name offense_code offense_type offense_group offense_against
##    <int> <chr>     <chr>        <chr>        <chr>         <chr>          
## 1 1.38e7 New York  90Z          all other o… all other of… other          
## 2 1.38e7 New York  22B          non-residen… burglary/bre… property       
## 3 1.38e7 New York  13B          simple assa… assault offe… persons        
## 4 1.38e7 New York  90Z          all other o… all other of… other          
## 5 1.38e7 New York  12A          personal ro… robbery       property       
## 6 1.38e7 New York  90Z          all other o… all other of… other          
## # ... with 8 more variables: date_single <chr>, date_start <chr>,
## #   date_end <chr>, longitude <dbl>, latitude <dbl>, location_type <chr>,
## #   location_category <chr>, census_block <chr>

crimedata package

Multiple cities, multiple years:

crime_data_2010_2015 = get_crime_data(years = 2010:2015
                            , cities = c("Chicago", "Detroit"))
## Warning in if (cities != "all" & !all(cities %in% unique(urls$city))) {:
## the condition has length > 1 and only the first element will be used
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |                                                                 |   1%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |==                                                               |   4%
  |                                                                       
  |====                                                             |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=========                                                        |  13%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |===========                                                      |  17%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |===================                                              |  30%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |=====================                                            |  33%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |========================                                         |  36%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |==========================                                       |  39%
  |                                                                       
  |===========================                                      |  41%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |===============================                                  |  47%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |=================================                                |  51%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=====================================                            |  57%
  |                                                                       
  |======================================                           |  59%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |=========================================                        |  64%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |================================================                 |  73%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=====================================================            |  82%
  |                                                                       
  |======================================================           |  84%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |========================================================         |  87%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |===========================================================      |  90%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |=============================================================    |  93%
  |                                                                       
  |==============================================================   |  95%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |                                                                 |   1%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |===========                                                      |  17%
  |                                                                       
  |============                                                     |  19%
  |                                                                       
  |==============                                                   |  21%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |=====================                                            |  33%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |===========================                                      |  41%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================                                |  51%
  |                                                                       
  |==================================                               |  53%
  |                                                                       
  |====================================                             |  55%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |======================================                           |  58%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |========================================                         |  61%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |=============================================                    |  70%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |================================================                 |  73%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |=====================================================            |  81%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |===========================================================      |  91%
  |                                                                       
  |============================================================     |  93%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |===============================================================  |  98%
  |                                                                       
  |=================================================================|  99%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |                                                                 |   1%
  |                                                                       
  |==                                                               |   2%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |========                                                         |  13%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |===========                                                      |  16%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |============                                                     |  19%
  |                                                                       
  |=============                                                    |  21%
  |                                                                       
  |===============                                                  |  22%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |=====================                                            |  33%
  |                                                                       
  |======================                                           |  35%
  |                                                                       
  |========================                                         |  36%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |===========================                                      |  42%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |==============================                                   |  47%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |=================================                                |  50%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=====================================                            |  57%
  |                                                                       
  |======================================                           |  59%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |============================================                     |  67%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |=====================================================            |  81%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |===========================================================      |  91%
  |                                                                       
  |=============================================================    |  93%
  |                                                                       
  |==============================================================   |  95%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |================================================================ |  99%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |                                                                 |   1%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |==================                                               |  27%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |=====================                                            |  33%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |=========================                                        |  39%
  |                                                                       
  |==========================                                       |  41%
  |                                                                       
  |============================                                     |  42%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |==============================                                   |  47%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |=================================                                |  50%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |======================================                           |  58%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |=========================================                        |  64%
  |                                                                       
  |===========================================                      |  66%
  |                                                                       
  |============================================                     |  67%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |================================================                 |  73%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |====================================================             |  81%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |=========================================================        |  87%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |===========================================================      |  91%
  |                                                                       
  |============================================================     |  93%
  |                                                                       
  |==============================================================   |  95%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |================================================================ |  99%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |                                                                 |   1%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |=========                                                        |  13%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |===========                                                      |  17%
  |                                                                       
  |=============                                                    |  19%
  |                                                                       
  |==============                                                   |  21%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |=====================                                            |  33%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |=========================                                        |  39%
  |                                                                       
  |===========================                                      |  41%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |===============================                                  |  47%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |=================================                                |  51%
  |                                                                       
  |===================================                              |  53%
  |                                                                       
  |====================================                             |  55%
  |                                                                       
  |=====================================                            |  57%
  |                                                                       
  |=======================================                          |  59%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |=========================================                        |  64%
  |                                                                       
  |===========================================                      |  66%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |=============================================                    |  70%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |===============================================                  |  73%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |=====================================================            |  81%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |=========================================================        |  87%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |===========================================================      |  91%
  |                                                                       
  |=============================================================    |  93%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |===============================================================  |  98%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |===========                                                      |  16%
  |                                                                       
  |============                                                     |  19%
  |                                                                       
  |=============                                                    |  21%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |==================                                               |  27%
  |                                                                       
  |===================                                              |  30%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |========================                                         |  36%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |==========================                                       |  41%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |===============================                                  |  47%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |======================================                           |  59%
  |                                                                       
  |========================================                         |  61%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |=============================================                    |  70%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |==================================================               |  76%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |=====================================================            |  81%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |===============================================================  |  96%
  |                                                                       
  |================================================================ |  99%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |=================================                                |  50%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |================================================                 |  75%
  |                                                                       
  |====================================================             |  81%
  |                                                                       
  |========================================================         |  87%
  |                                                                       
  |============================================================     |  93%
  |                                                                       
  |================================================================ |  99%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |===========================                                      |  41%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |========================================                         |  61%
  |                                                                       
  |============================================                     |  67%
  |                                                                       
  |=============================================                    |  70%
  |                                                                       
  |==================================================               |  76%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |===========================                                      |  41%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |=======================================                          |  61%
  |                                                                       
  |============================================                     |  67%
  |                                                                       
  |=============================================                    |  70%
  |                                                                       
  |==================================================               |  76%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |==                                                               |   2%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |===========                                                      |  16%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |=================================                                |  51%
  |                                                                       
  |======================================                           |  58%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |============================================================     |  93%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |===========                                                      |  18%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |===============================                                  |  47%
  |                                                                       
  |====================================                             |  55%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |=============================================                    |  70%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |==========================                                       |  41%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=========================================                        |  64%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |=====================================================            |  82%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |=================================================================| 100%
table(crime_data_2010_2015$city_name, crime_data_2010_2015$offense_against)
##          
##           other persons property society
##   Chicago  1434    4811     9658    2992
##   Detroit   480    1445     4096     492

crimedata package

Additional features:

  • nycvehiclethefts: Dataset containing records of thefts of motor vehicles in New York City from 2014 to 2017
  • homicides15: Dataset containing records of homicides in nine large US cities in 2015

APIs: Pros & Cons

Pro

  • easiy to access
  • nicely documentation
  • works even if website changes

Cons

  • quota limits ($ $ $)
  • under the platforms’ control
  • only for few platforms

Don’t let the data determine your research!

COOL

But what about:

No APIs

  • incels.me
  • stormfront
  • 4chan

  • APIs are restrictive!

… what about:

Your research question –> no API?

Main problem:

Really ‘juicy’ data of the Internet vs APIs

“Real” webscraping: basics of a webpage

Three elements of a webpage

  1. Structure
  2. Behaviour
  3. Style

Three elements of a webpage

  1. Structure
  2. Behaviour
    • JavaScript (!= Java)
    • user interaction
    • examples: alerts, popups, server-interaction
  3. Style

Three elements of a webpage

  1. Structure
  2. Behaviour
  3. Style
    • CSS (Cascading Style Sheets)
    • formatting, design, responsiveness
    • examples: submit buttons, app interaces

Three elements of a webpage

  1. Structure
    • HTML (hypertext markup language)
    • structured with <tags>
    • contains the pure content of the webpage
  2. Behaviour
  3. Style

For now: HTML

The very basics of HTML:

Raw architecture of a webpage

<!DOCTYPE html>
<html>
<body>

HERE COMES THE VISIBLE PART!!

</body>
</html>

Note: Every tags < > is closed < />. Content is contained within the tag.

HTML basics

Ways to put content in the <body> ... </body> tag:

  • headings: <h1>I'm a heading at level 1</>

Content in the body tag

  • paragraphs: <p>This is a paragraph</p>

Content in the body tag

  • images: <img src="./img/ucl.jpg">

Content in the body tag

  • links: <a href="https://www.ucl.ac.uk/">Click here to go to UCL's website</a></a>

Content in the body tag

  • tables
<table>
  <tr>
    <th>Departments</th>
    <th>Location</th>
  </tr>
  <tr>
    <td>Dept. of Security and Crime Science</td>
    <td>Division of Psychology and Language Sciences</td>
  </tr>
  <tr>
    <td>35 Tavistock Square</td>
    <td>26 Bedford Way</td>
  </tr>
</table> 

Html <table>...</table>

Content in the body tag

  • lists
<ul>
  <li>Terrorism</li>
  <li>Cyber Crime</li>
  <li>Data Science</li>
</ul> 

HTML basics

Elements (can) have IDs:

<p id='paragraph1'>This is a paragraph</p>
<img id='ucl_image' src="./img/ucl.jpg">

Same for tables, links, etc.

Every element can have an ID.

You need unique IDs! Two elements cannot have the same ID.

HTML basics

Common elements (can) have CLASSES:

<p id="paragraph1" class="paragraph_class">I am the first paragraph</p>
<p class="paragraph_class">I am the second paragraph</p>
<p class="paragraph_class">I am the third paragraph</p>

Multiple elements can have the same class.

Now what?

Web scraping logic

If all webpages are built in this structure…

… then we could access this structure programmatically.

But where do I find that structure?

Is it just “there”?

YES!!

How to see the html structure?

Example 1: Missing persons

Example 1: Missing persons

Example 2: FBI most wanted

Webscraping in a nutshell

  1. understand the structure of a webpage
  2. exploit that structure for web-scraping

RECAP

  • Always: problem first, never the method first!
  • Method follows problem!
  • APIs give you structured access
  • R packages twitteR and tuber
  • HTML structure key to ‘real’ webscraping

Outlook

Next week

  • Webscraping on the html structure
  • Tutorial: APIs + webscraping in R

Homework

  • Getting access to YouTube and Twitter API.

END