On Real-Time Twitter Analysis

Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT

On Real-Time Twitter Analysis

Mikio L. Braun http://blog.mikiobraun.de mikiobrauntwimpact UG (haftungsbeschränkt) http://twimpact.com

with Matthias Jugel thinkberg

Apache Hadoop Get Together, BerlinApril 28, 2012

http://blog.mikiobraun.de/

http://twimpact.com/


Big Data and Data Science and Social Media

● There's a lot you can do with social media data

● Trend analysis (“trending topics”)

● Sentiment analysis

● Impact analysis (Klout, Kred, etc.)

● More general studies (diameter of network, distribution patterns, etc.)

● Types of data

● Event treams (Twitter stream)

● Graph data (user relationships, retweet networks)

● Text data (sentiment analysis, word clouds)

● URLs

● …


Social Media Streaming Data

● Examples● Twitter firehose/sprinkler● Click-through data● bit.ly URL resolution requests

● Some numbers:● up to a few thousand events per second● events are small up to a few kilobytes


Timestamp

Retweeting User

Retweeted User

Hashtag

Link

User Mention

Keywords

TweetRetweeted Tweet

What's in a Tweet?


TWIMPACT - Retweet trends

● Trending by retweet activity● Robust matching of tweets even if shortened,

edited (slightly)● Compute trends for links, hashtags, URLs● Aggregate TWIMPACT score for users


How to scale stream processing?


History of approaches

● Started in June 2009● Free Twitter stream (capped at 50 tweets/s)

Language Storage backend

Stream mining + in memory

Version 1

Version 2

Version 3


Putting it all in a data base

● Insert millions of rows into data base

● Get reports by

● Hardly real-time. Also, data bases will become slower and slower...

SELECT *, COUNT(*) FROM eventsWHERE created_at > … AND created_at < …GROUP BY idORDER BY COUNT(*) DESCLIMIT 100;


NoSQL: Cassandra

● Structure: Families → Tables → Rows → Key Value pairs

● Easy clustering (peer-to-peer configuration)● Flexible consistency, read-repair, hinted

handoff, etc.● No locking, (in 0.6.x:) no support for indices,

counters → complete rewrite● Operations profile (about 50:50 read/write)


Cassandra: Multithreading

● Multithreading helps (but without locking support?)

1

24

816

32

64

Core i7,4 cores(2 + 2 HT)

Seconds

Tw

eets

per

sec

ond


Cassandra: Configuration

Flush

Compaction

Memtables,indexes, etc.

Size of Memtable: 128M, JVM Heap: 3G, #CF: 12


Cassandra: Configuration

Compaction

“Big”GC

Tw

eets

per

sec

ond


NoSQL/Cassandra - Summary

● Works quite well, faster than PostgreSQL (from 200 to 600 tps)

● Lack of locking/indices require a lot of manual management

● Configuration messy● 4 node cluster vs. single node:

Single node consistently 1.5 – 3 times faster!

● Ultimately, becomes slower and slower● Doesn't handle deletions gracefully


Stream processing frameworks

● Stream processing = scalable actor based concurrency

● For example:● Twitter's (backtype's) Storm https://github.com/nathanmarz/storm

● Yahoo's S4 http://incubator.apache.org/s4/

● Esper http://esper.codehaus.org/

● Streambase http://www.streambase.com

https://github.com/nathanmarz/storm

http://incubator.apache.org/s4/

http://esper.codehaus.org/

http://www.streambase.com/


Stream processing- some thoughts

● Maximum throughput hard to estimate● Not everything can be parallelized● Scalable storage system still necessary● How to deal with failure/congestion?● Persistent messaging middleware not what you

might want.


The DataSift infrastructurehttp://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html

● C++, PHP, Java/Scala, Ruby

● MySQL on SSDs, HBase (30 nodes, 400TB), memcached, Redis for some queues

● 0MQ, Kafka (LinkedIn)

● 936 CPU cores

● Analyzes 250 million tweets per day

● Peak throughput: 120,000 t/s

● monitoring & accounting

ParseAugmentContent

CustomFilters Delivery

Throughput: 120,000 tweets per second

but: 120,000 / 936 = 128.2 tweets per second per core

http://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html


Principles of Stream Processing

● Keep resource needs constant● Control maximum processing rates● Disks too slow, keep data in RAM


Stream mining

asd

fixed number of slots

42

37

25

qwe

13r13t

erqew

erq

fgsa

gwth

5z3

wet

13

20

17

10

7

4

erq

qer

qer 5

● Focus on relevant data, discard the rest

● Provably approximates true counts

● Keep data in memory

Space Saving algorithm (Metwally, Agrawal, Abbadi, “Efficient Computation of Frequent and Top-k Elements in Data Streams”, International Conference on Database Theory, 2005.)

21


TWIMPACTReal-time Twitter Retweet Analysis

● Stream mining to keep “hot set” of few hundred thousand most active retweets in memory

● Secondary indices, bipartite graphs, object stores

● Write snapshots to disk for later analysis● Up to several thousand tweets per second

in single threaded operation.


2011 in Retweets


2011 in Retweets


Our Analysis Pipeline

RetweetMatching

& Retweet TrendsSnapshots

Day 1

Day 2

Day n

Trends

Thread 1

Thread k

Tweets

synchronizedworker threads

single threaded

map reduce like

JSON parsing

Analyzing dependent trends(links/hashtags/etc.)


Most retweeted users


Most retweeted tweets


Social network buzz


● Many interesting challenges in social media.● Many different data types, including streams.● MapReduce doesn't really fit stream processing● You can't just scale into real-time● Principles of Stream Processing

● Bounded “hot set” of data in memory● Mine stream, discard irrelevant data

● Real world applications often include a mixture of multithreading, stream processing, map reduce and single thread stages.

Summary

Documents

On Real-Time Twitter Analysis