Click here to load reader
Upload
chris-birchall
View
4.592
Download
7
Embed Size (px)
DESCRIPTION
Using the Twitter statuses sample API to build a namecountry database
Citation preview
Guess the CountryPlaying with Twitter Streaming API
Chris Birchall#m3dev Tech Talk 2014/7/11
It started with an idle tweet...
https://twitter.com/cbirchall/status/466197512143912961
Let’s use Twitter for something (slightly) useful!
The plan:● Collect geo-tagged tweets from Twitter
Streaming API● Use them to build a name⇔country DB● Build a simple search UI as a proof of
concept● (crowbar Spark in there somewhere
because it’s cool)
Implementation
TwitterStreaming
API
EC2
https://github.com/cb372/guess-the-country
Twitter4j
.log
Fluentd
S3
EC2
Spark
Postgres(RDS)
Heroku
Rails
Collecting tweets
● Ran the collector for 13 days● Collected 285,340 geo-tagged tweets● 205,798 distinct users● Only collected names and countries,
threw everything else away
● Used Spark to filter out duplicate usersProcessing
Stats
Top 10 countries by user count
Distinct countries = 204Distinct first names = 40,689 Distinct last names = 81,674
country | percentage-----------------------------+------------ United States | 39.4 United Kingdom | 10.1 Indonesia | 8.9 Brasil | 8.1 Türkiye | 3.9 España | 2.4 México | 2.2 Republic of the Philippines | 2.0 Canada | 1.8 Malaysia | 1.8
first_name------------ chris alex david michael sarah
second_name------------- smith jones garcia williams johnson
Most popular first names
Most popular surnames
Results
It works surprisingly well!
(well, it worked for my name, anyway)
Note for the pedantic: Since the original data is geo-tagged tweets, strictly speaking we only know where a user is, not where they come from.
Try for yourself
Demohttp://guess-the-country.herokuapp.com/
Codehttps://github.com/cb372/guess-the-country