Upload
mikio-braun
View
2.165
Download
3
Embed Size (px)
DESCRIPTION
Slides for the talk given at the Apache Hadoop Get Together in Berlin on April 18, 2012.
Citation preview
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
On Real-Time Twitter Analysis
Mikio L. Braun http://blog.mikiobraun.de mikiobrauntwimpact UG (haftungsbeschränkt) http://twimpact.com
with Matthias Jugel thinkberg
Apache Hadoop Get Together, BerlinApril 28, 2012
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
Big Data and Data Science and Social Media
● There's a lot you can do with social media data
● Trend analysis (“trending topics”)
● Sentiment analysis
● Impact analysis (Klout, Kred, etc.)
● More general studies (diameter of network, distribution patterns, etc.)
● Types of data
● Event treams (Twitter stream)
● Graph data (user relationships, retweet networks)
● Text data (sentiment analysis, word clouds)
● URLs
● …
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
Social Media Streaming Data
● Examples● Twitter firehose/sprinkler● Click-through data● bit.ly URL resolution requests
● Some numbers:● up to a few thousand events per second● events are small up to a few kilobytes
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
Timestamp
Retweeting User
Retweeted User
Hashtag
Link
User Mention
Keywords
TweetRetweeted Tweet
What's in a Tweet?
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
TWIMPACT - Retweet trends
● Trending by retweet activity● Robust matching of tweets even if shortened,
edited (slightly)● Compute trends for links, hashtags, URLs● Aggregate TWIMPACT score for users
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
History of approaches
● Started in June 2009● Free Twitter stream (capped at 50 tweets/s)
Language Storage backend
Stream mining + in memory
Version 1
Version 2
Version 3
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
Putting it all in a data base
● Insert millions of rows into data base
● Get reports by
● Hardly real-time. Also, data bases will become slower and slower...
SELECT *, COUNT(*) FROM eventsWHERE created_at > … AND created_at < …GROUP BY idORDER BY COUNT(*) DESCLIMIT 100;
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
NoSQL: Cassandra
● Structure: Families → Tables → Rows → Key Value pairs
● Easy clustering (peer-to-peer configuration)● Flexible consistency, read-repair, hinted
handoff, etc.● No locking, (in 0.6.x:) no support for indices,
counters → complete rewrite● Operations profile (about 50:50 read/write)
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
Cassandra: Multithreading
● Multithreading helps (but without locking support?)
1
24
816
32
64
Core i7,4 cores(2 + 2 HT)
Seconds
Tw
eets
per
sec
ond
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
Cassandra: Configuration
Flush
Compaction
Memtables,indexes, etc.
Size of Memtable: 128M, JVM Heap: 3G, #CF: 12
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
Cassandra: Configuration
Compaction
“Big”GC
Tw
eets
per
sec
ond
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
NoSQL/Cassandra - Summary
● Works quite well, faster than PostgreSQL (from 200 to 600 tps)
● Lack of locking/indices require a lot of manual management
● Configuration messy● 4 node cluster vs. single node:
Single node consistently 1.5 – 3 times faster!
● Ultimately, becomes slower and slower● Doesn't handle deletions gracefully
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
Stream processing frameworks
● Stream processing = scalable actor based concurrency
● For example:● Twitter's (backtype's) Storm https://github.com/nathanmarz/storm
● Yahoo's S4 http://incubator.apache.org/s4/
● Esper http://esper.codehaus.org/
● Streambase http://www.streambase.com
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
Stream processing- some thoughts
● Maximum throughput hard to estimate● Not everything can be parallelized● Scalable storage system still necessary● How to deal with failure/congestion?● Persistent messaging middleware not what you
might want.
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
The DataSift infrastructurehttp://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html
● C++, PHP, Java/Scala, Ruby
● MySQL on SSDs, HBase (30 nodes, 400TB), memcached, Redis for some queues
● 0MQ, Kafka (LinkedIn)
● 936 CPU cores
● Analyzes 250 million tweets per day
● Peak throughput: 120,000 t/s
● monitoring & accounting
ParseAugmentContent
CustomFilters Delivery
Throughput: 120,000 tweets per second
but: 120,000 / 936 = 128.2 tweets per second per core
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
Principles of Stream Processing
● Keep resource needs constant● Control maximum processing rates● Disks too slow, keep data in RAM
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
Stream mining
asd
fixed number of slots
42
37
25
qwe
13r13t
erqew
erq
fgsa
gwth
5z3
wet
13
20
17
10
7
4
erq
qer
qer 5
● Focus on relevant data, discard the rest
● Provably approximates true counts
● Keep data in memory
Space Saving algorithm (Metwally, Agrawal, Abbadi, “Efficient Computation of Frequent and Top-k Elements in Data Streams”, International Conference on Database Theory, 2005.)
21
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
TWIMPACTReal-time Twitter Retweet Analysis
● Stream mining to keep “hot set” of few hundred thousand most active retweets in memory
● Secondary indices, bipartite graphs, object stores
● Write snapshots to disk for later analysis● Up to several thousand tweets per second
in single threaded operation.
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
Our Analysis Pipeline
RetweetMatching
& Retweet TrendsSnapshots
Day 1
Day 2
Day n
Trends
Thread 1
Thread k
Tweets
synchronizedworker threads
single threaded
map reduce like
JSON parsing
Analyzing dependent trends(links/hashtags/etc.)
Apache Hadoop Get Together, April 18, 2012, Berlin © 2012 TWIMPACT
● Many interesting challenges in social media.● Many different data types, including streams.● MapReduce doesn't really fit stream processing● You can't just scale into real-time● Principles of Stream Processing
● Bounded “hot set” of data in memory● Mine stream, discard irrelevant data
● Real world applications often include a mixture of multithreading, stream processing, map reduce and single thread stages.
Summary