Web scraping 1

Advanced Crime Analysis UCL

Bennett Kleinberg

14 Jan 2019

Getting data from the Internet

Webscraping 1

Today

  • Types of webscraping
  • Using APIs: Twitter + YouTube
  • Crime data ‘wrappers’
  • “Real” webscraping: basics of a webpage

What is webscraping anyway?

The game changer!

  • direct broadcasting of ideas
  • “unfiltered” and “uncensored” (?)
  • location-enabled
  • and: en masse

Types of webscraping

Data shared Data not shared
Ready-made table Download closed source
Not ready-made API Real webscraping

Application programming interfaces (APIs)

API: basics

Goal:

  • help developers interact with the platform
  • facilitates interaction in an automatable manner
  • analogous to the GUI
  • part of it: enabling data access
  • contains precise documentation

What an API does not do:

  • give you all the data
  • be free forever
  • give you full control

There’ no free lunch!

Using an API

Core elements of an API:

  • GET requests
  • POST requests

Implementable in different ways…

Using an API

Classes of APIs:

  1. Web APIs
    • send requests through the browser
    • add URL parameters https://data.police.uk/api/crimes-at-location?date=2017-08&location_id=884227
  2. Libraries/packages for APIs
    • depending on the API: python, js, php, ruby
    • = frameworks to access the API
    • = methods implemented in different languages
  3. API wrappers
    • R packages that use the API

Using an API

Identify API capabilities

Official API docs Twitter

Useful websites that have an API

  • Twitter
  • YouTube
  • Instagram
  • Facebook
  • Reddit

Case 1: Twitter’s API

Getting access

Basic steps

  1. Twitter account
  2. Apply for a developer’s account
  3. Create project
  4. Obtain access credentials

Tutorial here

The rtweet package

library(twitteR)

Note: check out the newer rtweet package.

Authenitication through R

my_consumer_key = "5tc2oAVLyO8DkCKW1k8ny2H6e"
my_consumer_secret = "qEQYGX6IKs6NiSUsENprBZlOOdoM9lWkoIht3p1sVnAMraQpq2"
my_access_token = "858383409986625537-Fy9Ai5eFyf23VZHguRJEdXqell6Q8Jl"
my_access_secret = "nT5Z0eQjAvBdf2ZjxMgiaoRb7hiHVxB8jYh7lT74CW1Um"

setup_twitter_oauth(consumer_key = my_consumer_key
                    , consumer_secret = my_consumer_secret
                    , access_token = my_access_token
                    , access_secret = my_access_secret
                    )
## [1] "Using direct authentication"

How to search?

Depends on the problem:

  • Tweets in a certain time frame (e.g. December 2018)
  • Tweets with a certain key-word (e.g. “#metoo”)
  • Tweets by a certain author (e.g. Elon Musk)
  • Tweets in a certain location (e.g. London)
  • Tweets in a certain language
  • Combined search queries

API possibilities

Always look at two sources:

  1. The original API (Twitter’s API docs)
  2. The API interface (twitteR R package)

Note: mostly original API options > API interface options.

Tweets by date

Search:

  • tweets since December 2018
  • with #metoo
metoo_tweets_december = searchTwitter(searchString = '#metoo'
                                , n = 10
                                , since = '2018-12-01'
                                )
metoo_tweets_december
## [[1]]
## [1] "zee45427557: #RajkumarHirani @aamir_khan #AamirKhan #aamir #3idiots #pk #MeToo #MeTooMovement #sanju #chopra #joshi https://t.co/76F422oOSM"
## 
## [[2]]
## [1] "metoozoo: #MeToo Merch - YellowMaps Beaver Island MI topo map, 1:100000 Scale, 30 X 60 Minute, Historical, 1984, Updated 1989… https://t.co/DsklJ4FZrA"
## 
## [[3]]
## [1] "Mirbia3: RT @la_patilla: Los aspirantes demócratas a la Casa Blanca bajo examen del #MeToo https://t.co/DYj2t5OFcS       .     ."
## 
## [[4]]
## [1] "gulfkannadiga: RT @timesofindia: #MeToo movement: Filmmaker #RajkummarHirani's Assistant Director  of #Sanju accuses him of sexual harassment \n\nvia @etime…"
## 
## [[5]]
## [1] "WeForNews: Rajkumar Hirani accused of sexual assault during making of Sanju\n\n#MeToo #MeTooMovement #RajuBhai #RajKumarHirani… https://t.co/rw3zpitdUd"
## 
## [[6]]
## [1] "worldwidetoto10: RT @12ji10pun: 怒りが収まらない\U0001f4a2\n海外では問題になりそうな #松本人志 の発言。\nこのセクハラ発言を寛容で場の雰囲気を壊さないような対応をするのが日本式の「大人でいいオンナ」\n だから #MeToo 運動は日本では無縁。\nなんてったって被害に遭ったメンバーが謝罪…"
## 
## [[7]]
## [1] "Neli_Ngqulana: RT @Moosa_Kaula: Are girlies gonna pretend like Arthur Mafokate isn't the face of #MeToo and pose happily with him? \U0001f62c https://t.co/9yMDmDUL…"
## 
## [[8]]
## [1] "Rohitpatil_24: RT @NEWS9TWEETS: #BIGNEWS: #Bollywood's famed director, @RajkumarHirani allegedly accused of sexual harassment by an assistant director of…"
## 
## [[9]]
## [1] "jackejones123: @keithellison #MeToo"
## 
## [[10]]
## [1] "Hun_Aram_e: RT @Bollyhungama: #MeToo: #SANJU director #RajkumarHirani accused of SEXUAL HARASSMENT by his assistant\nhttps://t.co/duGjUEaDI7"

Tweets by date

Display as dataframe with meta information:

twListToDF(metoo_tweets_december)
##                                                                                                                                                                                                                                                                    text
## 1                                                                                                                                        #RajkumarHirani @aamir_khan #AamirKhan #aamir #3idiots #pk #MeToo #MeTooMovement #sanju #chopra #joshi https://t.co/76F422oOSM
## 2                                                                                                                          #MeToo Merch - YellowMaps Beaver Island MI topo map, 1:100000 Scale, 30 X 60 Minute, Historical, 1984, Updated 1989… https://t.co/DsklJ4FZrA
## 3                                                                                                                                               RT @la_patilla: Los aspirantes demócratas a la Casa Blanca bajo examen del #MeToo https://t.co/DYj2t5OFcS       .     .
## 4                                                                                                                        RT @timesofindia: #MeToo movement: Filmmaker #RajkummarHirani's Assistant Director  of #Sanju accuses him of sexual harassment \n\nvia @etime…
## 5                                                                                                                          Rajkumar Hirani accused of sexual assault during making of Sanju\n\n#MeToo #MeTooMovement #RajuBhai #RajKumarHirani… https://t.co/rw3zpitdUd
## 6  RT @12ji10pun: 怒りが収まらない\U0001f4a2\n海外では問題になりそうな #松本人志 の発言。\nこのセクハラ発言を寛容で場の雰囲気を壊さないような対応をするのが日本式の「大人でいいオンナ」\n だから #MeToo 運動は日本では無縁。\nなんてったって被害に遭ったメンバーが謝罪…
## 7                                                                                                                 RT @Moosa_Kaula: Are girlies gonna pretend like Arthur Mafokate isn't the face of #MeToo and pose happily with him? \U0001f62c https://t.co/9yMDmDUL…
## 8                                                                                                                           RT @NEWS9TWEETS: #BIGNEWS: #Bollywood's famed director, @RajkumarHirani allegedly accused of sexual harassment by an assistant director of…
## 9                                                                                                                                                                                                                                                  @keithellison #MeToo
## 10                                                                                                                                     RT @Bollyhungama: #MeToo: #SANJU director #RajkumarHirani accused of SEXUAL HARASSMENT by his assistant\nhttps://t.co/duGjUEaDI7
##    favorited favoriteCount    replyToSN             created truncated
## 1      FALSE             0         <NA> 2019-01-13 11:38:59     FALSE
## 2      FALSE             0         <NA> 2019-01-13 11:38:51      TRUE
## 3      FALSE             0         <NA> 2019-01-13 11:38:47     FALSE
## 4      FALSE             0         <NA> 2019-01-13 11:38:44     FALSE
## 5      FALSE             0         <NA> 2019-01-13 11:38:33      TRUE
## 6      FALSE             0         <NA> 2019-01-13 11:38:24     FALSE
## 7      FALSE             0         <NA> 2019-01-13 11:38:19     FALSE
## 8      FALSE             0         <NA> 2019-01-13 11:38:16     FALSE
## 9      FALSE             0 keithellison 2019-01-13 11:38:13     FALSE
## 10     FALSE             0         <NA> 2019-01-13 11:38:10     FALSE
##             replyToSID                  id replyToUID
## 1                 <NA> 1084414501452242944       <NA>
## 2                 <NA> 1084414467520253952       <NA>
## 3                 <NA> 1084414453029003264       <NA>
## 4                 <NA> 1084414438751461376       <NA>
## 5                 <NA> 1084414394514038790       <NA>
## 6                 <NA> 1084414353770573824       <NA>
## 7                 <NA> 1084414333168291842       <NA>
## 8                 <NA> 1084414323722579971       <NA>
## 9  1084300045342728197 1084414308174446592   14135426
## 10                <NA> 1084414298548588544       <NA>
##                                                                            statusSource
## 1    <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 2                         <a href="http://metoozoo.com" rel="nofollow">metoozoo.com</a>
## 3                  <a href="https://mobile.twitter.com" rel="nofollow">Twitter Lite</a>
## 4  <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## 5   <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>
## 6    <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 7  <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## 8  <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## 9                    <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## 10   <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
##         screenName retweetCount isRetweet retweeted longitude latitude
## 1      zee45427557            0     FALSE     FALSE        NA       NA
## 2         metoozoo            0     FALSE     FALSE        NA       NA
## 3          Mirbia3            1      TRUE     FALSE        NA       NA
## 4    gulfkannadiga           15      TRUE     FALSE        NA       NA
## 5        WeForNews            0     FALSE     FALSE        NA       NA
## 6  worldwidetoto10            2      TRUE     FALSE        NA       NA
## 7    Neli_Ngqulana           76      TRUE     FALSE        NA       NA
## 8    Rohitpatil_24            3      TRUE     FALSE        NA       NA
## 9    jackejones123            0     FALSE     FALSE        NA       NA
## 10      Hun_Aram_e           12      TRUE     FALSE        NA       NA

Tweets by date

Example: Popular crime tweet in 2019?

crime_tweets_2019 = searchTwitter(searchString = 'crime'
                                , n = 1000
                                , since = '2019-01-01'
                                , resultType = 'popular'
                                )
## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 1000 tweets were requested but the
## API can only return 63
df.crime_tweets_2019 = twListToDF(crime_tweets_2019)

df.crime_tweets_2019[order(df.crime_tweets_2019$created, decreasing = F), ][1:3, 'text']
## [1] "#NewsUpdate ออกประกาศด่วน! หยุดเดินเรือข้ามเกาะสมุย ชาวบ้านแห่กักตุนอาหาร พร้อมรับมือพายุปาปึก #เรื่องเล่าเช้านี้… https://t.co/dgurGOK3OW"                                   
## [2] "Reform-minded prosecutors can repair our broken #CriminalJustice system\n  \nWesley Bell in Missouri: \n\U0001f4ccEnded prosecu… https://t.co/gfbfT7bPM6"
## [3] "Lembrando sempre que não gostar de alguém, além de não ser crime, não exige nenhum pré-requisito. Pode-se apelar ap… https://t.co/SRG4rOcN6V"

Tweets by keyword

Search:

  • “fake news” tweets
  • since the start of the year
fakenews_tweets_2019 = searchTwitter(searchString = 'fake+news'
                                , n = 1000
                                , since = '2019-01-01'
                                , resultType = 'popular'
                                )
## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 1000 tweets were requested but the
## API can only return 58

df.fakenews_tweets_2019 = twListToDF(fakenews_tweets_2019)
df.fakenews_tweets_2019[order(df.fakenews_tweets_2019$retweetCount, decreasing = T), ][1:5, ]
##                                                                                                                                            text
## 21 The Mainstream Media has NEVER been more dishonest than it is now. NBC and MSNBC are going Crazy. They report stori… https://t.co/zLh9zOR1J1
## 22 ....The Fake News Media in our Country is the real Opposition Party. It is truly the Enemy of the People! We must b… https://t.co/Y3KuJpWBAQ
## 31 With all of the success that our Country is having, including the just released jobs numbers which are off the char… https://t.co/Urm1LOV1bb
## 1  The Fake News Media keeps saying we haven’t built any NEW WALL. Below is a section just completed on the Border. An… https://t.co/2RqbrNEznu
## 30  The story in the New York Times regarding Jim Webb being considered as the next Secretary of Defense is FAKE NEWS.… https://t.co/1wwN10V5Pz
##    favorited favoriteCount replyToSN             created truncated
## 21     FALSE        125233      <NA> 2019-01-10 03:43:13      TRUE
## 22     FALSE        130273      <NA> 2019-01-07 13:31:00      TRUE
## 31     FALSE        135224      <NA> 2019-01-07 12:56:19      TRUE
## 1      FALSE        101249      <NA> 2019-01-11 17:50:04      TRUE
## 30     FALSE        105462      <NA> 2019-01-04 21:45:27      TRUE
##    replyToSID                  id replyToUID
## 21       <NA> 1083207607412760576       <NA>
## 22       <NA> 1082268365081767936       <NA>
## 31       <NA> 1082259636227620865       <NA>
## 1        <NA> 1083783112973320192       <NA>
## 30       <NA> 1081305634115674112       <NA>
##                                                                          statusSource
## 21 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 22 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 31 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 1  <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 30 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
##         screenName retweetCount isRetweet retweeted longitude latitude
## 21 realDonaldTrump        31694     FALSE     FALSE        NA       NA
## 22 realDonaldTrump        31683     FALSE     FALSE        NA       NA
## 31 realDonaldTrump        30337     FALSE     FALSE        NA       NA
## 1  realDonaldTrump        28196     FALSE     FALSE        NA       NA
## 30 realDonaldTrump        23408     FALSE     FALSE        NA       NA

Tweets by keyword

Search:

  • knife crime
  • yesterday
knife_crime_yesterday = searchTwitter(searchString = 'knife+crime'
                                , since = '2019-01-08'
                                )

knife_crime_yesterday[1:10]
## [[1]]
## [1] "IsmailRahiman: RT @DailyMailUK: Police are armed with metal detectors in latest bid to crackdown on knife epidemic sweeping streets of Britain https://t.c…"
## 
## [[2]]
## [1] "johnbrissenden: RT @natalieisonline: There are a few alarming things about the Jayden Moodie reporting we’ve seen from the Evening Standard that go far and…"
## 
## [[3]]
## [1] "Bob4719: RT @JuanDiablo4d: @MayorofLondon @BBCSPLondon @Jo_Coburn The rampant knife crime and killings of our youth should be your priority Mr Mayor…"
## 
## [[4]]
## [1] "ot7_trash: RT @SeemaChandwani: The child is dead. Murdered. \n\n@standardnews @George_Osborne you’re totally sick. Get help. \n\n https://t.co/IL30LoR9Jp"
## 
## [[5]]
## [1] "amitysv: RT @incorrectbucko: bucky, singing to himself: coming out of my cage and i’ve been doing some crime\n\nsteve: what \n\nbucky [tucking a knife i…"
## 
## [[6]]
## [1] "steer266: If May stays at No 10, I’m starting Project Hope, the positive case for remain | Sadiq Khan \nMy Project Hope is No… https://t.co/5nMLmcQijl"
## 
## [[7]]
## [1] "ikran: RT @SeemaChandwani: The child is dead. Murdered. \n\n@standardnews @George_Osborne you’re totally sick. Get help. \n\n https://t.co/IL30LoR9Jp"
## 
## [[8]]
## [1] "Exhausted33: RT @SeemaChandwani: The child is dead. Murdered. \n\n@standardnews @George_Osborne you’re totally sick. Get help. \n\n https://t.co/IL30LoR9Jp"
## 
## [[9]]
## [1] "rachelharger: RT @Jules_Carey: Excellent overview by @simonisrael  on the massive rise in stop and search where there is no reasonable suspicion (s.60 se…"
## 
## [[10]]
## [1] "iancadman4: RT @White_Hart_Spur: @DVATW @notasothers3 Grooming gangs rampant, knife crime epidemic... but the police take the easy route and arrest som…"

Tweets by combined keywords

Example: Tweets about knife killings in London in 2019