Recap + peer-feedback
Advanced Crime Analysis UCL
Bennett Kleinberg
11 March 2019
Today
- 2 case studies
- module recap
- your feedback
- peer-feedback
Sentiment styles: RQ
Are there patterns in sentiment usage in popular YouTube vlogs?
Needed
- vlog data
- sentiment “trajectories”
- cluster analysis
Data pipeline
- retrieved list of most popular vloggers
- excluded those that were no real vloggers and non-English speaking
- scraped transcripts for each vlog
- 27,333 transcripts
- 40m tokens
- 24b views
Analysis
- unsupervised learning
- k-means method
- assigned each vlog to its cluster
Findings
- 7 distinct sentiment styles
- preferance was moderated by gender
Findings
Downhill from here |
2.23 |
1.26 |
-2.88* |
Mood swings |
-2.31 |
1.96 |
1.25 |
Rags to riches |
2.13 |
-1.95 |
-1.08 |
Riches to rags |
-2.05 |
4.88* |
-0.56 |
Bump in the road |
1.69 |
-1.12 |
-1.08 |
End on a high note |
-5.16* |
-6.03* |
8.32* |
Twin peaks |
3.83* |
2.25 |
-4.99* |
Findings
- 7 distinct sentiment styles
- preferance was moderated by gender
- no effect on view count
- no effect on vlog length
Fake news paper: RQ
Can we detect fake news based on linguistic features in the news article?
Needed
- fake and real news
- different domains of news
- predictive analysis
Data pipeline
- sourcing “real” news from mainstream news sites
- crowdsourcing approach for fake news
- news domain (sports, business, politics, technology, entertainment, education)
Data pipeline
Data pipeline
Data pipeline
- sourcing “real” news from mainstream news sites
- crowdsourcing approach for fake news
- news domain (sports, business, politics, technology, entertainment, education)
- celebrity domain
- Entertainment Weekly, People Magazine,RadarOnline
- checked with GossipCop.com and other online sources
Analysis
- extracted linguistic features
- ngrams, punctuation, LIWC, readability, syntax
- supervised learning
- SVM classifier
- 5-fold CV
- 50% baseline
Findings
Findings
Findings
Table 9: Performance of two annotators (A1, A2) and the developed automatic system (Sys) on the fakenews datasets
You’ve learned a lot:
- data generation/collection
- text analysis
- machine learning
Data generation/collection
- APIs
- custom-made webscraping
Text analysis
- sentiment
- ngrams
- LIWC
- TFIDF
Machine learning
- supervised learning
- unsupervised learning
- cross-validation
- performance metrics
Next week
CLASS TEST
- Monday, 18 March 2019
- 1-3pm
- 60 min
- 10 questions (5 MC, 5 open)