Upload
mapr-technologies
View
177
Download
1
Tags:
Embed Size (px)
DESCRIPTION
A talk that Ted Dunning gave at the Big Data Analytics meetup hosted by Klout about how real-time and long-time can be integrated into a single computation.
Citation preview
1©MapR Technologies - Confidential
Real-time and Long-time with Storm and Hadoop
2©MapR Technologies - Confidential
Real-time and Long-time with Storm and Hadoop MapR
3©MapR Technologies - Confidential
Contact:– [email protected]– @ted_dunning
Slides and such (available late tonight):– http://info.mapr.com/ted-pjug
Hash tags: #mapr #pjug
4©MapR Technologies - Confidential
The Challenge
Hadoop is great of processing vats of data– But sucks for real-time (by design!)
Storm is great for real-time processing– But lacks any way to deal with batch processing
It sounds like there isn’t a solution– Neither fashionable solution handles everything
5©MapR Technologies - Confidential
This is not a problem.
It’s an opportunity!
6©MapR Technologies - Confidential
What is Map-Reduce?
Map-reduce programs are defined (mostly) by– A map function that does independent record transformations (and
deletions and replications)– Reduce functions that do aggregation
Map-reduce programs run in framework that– Schedules and re-runs tasks– Splits the input– Moves map outputs to reduce inputs– Receives the results
6
7©MapR Technologies - Confidential
Inside Map-Reduce
7
Input Map CombineShuffleand sort
Reduce Output
Reduce
"The time has come," the Walrus said,"To talk of many things:Of shoes—and ships—and sealing-wax
the, 1time, 1has, 1come, 1…
come, [3,2,1]has, [1,5,2]the, [1,2,1]time, [10,1,3]…
come, 6has, 8the, 4time, 14…
8©MapR Technologies - Confidential
Not Just Text
Counting words is easy
Many other problems work as well– Sessionize user logs– Very large scale joins– Large scale matrix recommendations– Computing the quadrillionth digit of π
Map-reduce is inherently batch oriented
9©MapR Technologies - Confidential
Inside Map-Reduce
9
Input Map Shuffleand sort
Reduce Output
road1, polyline(p1, p2, p3, …)lake1, polygon(p4, p5, p7, p9, …)road2, polyline(p6, p7, p9, …)
tile0918-1412, road1tile1082-8143, road1tile0014-3284, lake1tile1082-8143, lake1…
tile0918-1412, [road1]tile1082-8143, [road1, lake1]tile0014-3284, [lake1]…
tile1082-8143, img#1tile0014-3284, img#2tile1082-8143, img#3…
10©MapR Technologies - Confidential
What is Storm
A Storm program is called a topology– Spouts inject data into a topology– Bolts process data
The units of data are called tuples All processing is flow-through Bolts can buffer or persist Output tuples can be anchored Bolts that fail are restarted and un-acked tuples are replayed
11©MapR Technologies - Confidential
t
now
Hadoop is Not Very Real-time
UnprocessedData
Fully processed
Latest full period
Hadoop job takes this long for this data
12©MapR Technologies - Confidential
t
now
Hadoop works great back here
Storm workshere
Real-time and Long-time together
Blended view
Blended view
Blended View
13©MapR Technologies - Confidential
One Alternative
Search Engine
NoSqlde Jour
Consumer
Real-time Long-time
?
14©MapR Technologies - Confidential
Problems
Simply dumping into noSql engine doesn’t quite work Insert rate is limited No load isolation– Big retrospective jobs kill real-time
Low scan performance– Hbase pretty good, but not stellar
Difficult to set boundaries– where does real-time end and long-time begin?
15©MapR Technologies - Confidential
Rough Design – Data Flow
Search Engine
Query Event Spout
Logger Bolt
Counter Bolt
Raw Logs
LoggerBolt
Semi Agg
Hadoop Aggregator
Snap
Long agg
Query Event Spout
Counter Bolt
Logger Bolt
16©MapR Technologies - Confidential
Closer Look
Critical design goals:– fast ack for all tuples– fast restart of counter
Ack happens when tuple hits the replay log (100’s of milliseconds) Restart involves replaying semi-agg’s + replay log (very fast)
Counter Bolt
Replay Log
Semi-aggregated records
Incoming records
Real-time Long-time
17©MapR Technologies - Confidential
A Frozen Moment in Time
Snapshot defines the dividing line
All data in the snap is long-time, all after is real-time
Semi-agg strategy allows clean query
Semi Agg
Hadoop Aggregator
Snap
Long agg
18©MapR Technologies - Confidential
Guarantees
Counter output volume is small-ish– the greater of k tuples per 100K inputs or k tuple/s– 1 tuple/s/label/bolt for this exercise
Persistence layer must provide guarantees– distributed against node failure– must have either readable flush or closed-append
HDFS is distributed, but provides no guarantees and strange semantics
MapRfs is distributed, provides all necessary guarantees
19©MapR Technologies - Confidential
Presentation Layer
Presentation must– read recent output of Logger bolt– read relevant output of Hadoop jobs– combine semi-aggregated records
User will see– counts that increment within 0-2 s of events– seamless meld of short and long-term data
20©MapR Technologies - Confidential
Example 2 – AB testing in real-time
I have 15 versions of my landing page Each visitor is assigned to a version– Which version?
A conversion or sale or whatever can happen– How long to wait?
Some versions of the landing page are horrible– Don’t want to give them traffic
21©MapR Technologies - Confidential
A Quick Diversion
You see a coin– What is the probability of heads?– Could it be larger or smaller than that?
I flip the coin and while it is in the air ask again I catch the coin and ask again I look at the coin (and you don’t) and ask again Why does the answer change?– And did it ever have a single value?
22©MapR Technologies - Confidential
A Philosophical Conclusion
Probability as expressed by humans is subjective and depends on information and experience
23©MapR Technologies - Confidential
I Dunno
24©MapR Technologies - Confidential
5 heads out of 10 throws
25©MapR Technologies - Confidential
2 heads out of 12 throws
26©MapR Technologies - Confidential
Bayesian Bandit
Compute distributions based on data Sample p1 and p2 from these distributions
Put a coin in bandit 1 if p1 > p2
Else, put the coin in bandit 2
27©MapR Technologies - Confidential
And it works!
28©MapR Technologies - Confidential
Video Demo
29©MapR Technologies - Confidential
The Code
Select an alternative
Select and learn
But we already know how to count!
n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0)))
for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)
30©MapR Technologies - Confidential
The Basic Idea
We can encode a distribution by sampling Sampling allows unification of exploration and exploitation
Can be extended to more general response models
31©MapR Technologies - Confidential
Contact:– [email protected]– @ted_dunning
Slides and such (available late tonight):– http://info.mapr.com/ted-pjug
Hash tags: #mapr #pjug
32©MapR Technologies - Confidential
Thank You