Guess the Country - Playing with Twitter Streaming API

Preview:

DESCRIPTION

Using the Twitter statuses sample API to build a namecountry database

Citation preview

Guess the CountryPlaying with Twitter Streaming API

Chris Birchall#m3dev Tech Talk 2014/7/11

It started with an idle tweet...

https://twitter.com/cbirchall/status/466197512143912961

Let’s use Twitter for something (slightly) useful!

The plan:● Collect geo-tagged tweets from Twitter

Streaming API● Use them to build a name⇔country DB● Build a simple search UI as a proof of

concept● (crowbar Spark in there somewhere

because it’s cool)

Implementation

TwitterStreaming

API

EC2

https://github.com/cb372/guess-the-country

Twitter4j

.log

Fluentd

S3

EC2

Spark

Postgres(RDS)

Heroku

Rails

Collecting tweets

● Ran the collector for 13 days● Collected 285,340 geo-tagged tweets● 205,798 distinct users● Only collected names and countries,

threw everything else away

● Used Spark to filter out duplicate usersProcessing

Stats

Top 10 countries by user count

Distinct countries = 204Distinct first names = 40,689 Distinct last names = 81,674

country | percentage-----------------------------+------------ United States | 39.4 United Kingdom | 10.1 Indonesia | 8.9 Brasil | 8.1 Türkiye | 3.9 España | 2.4 México | 2.2 Republic of the Philippines | 2.0 Canada | 1.8 Malaysia | 1.8

first_name------------ chris alex david michael sarah

second_name------------- smith jones garcia williams johnson

Most popular first names

Most popular surnames

Results

It works surprisingly well!

(well, it worked for my name, anyway)

Note for the pedantic: Since the original data is geo-tagged tweets, strictly speaking we only know where a user is, not where they come from.

Try for yourself

Demohttp://guess-the-country.herokuapp.com/

Codehttps://github.com/cb372/guess-the-country

Recommended