32
1 ©MapR Technologies - Confidential Real-time and Long- time with Storm and Hadoop

Real-time and Long-time Together

Embed Size (px)

DESCRIPTION

A talk that Ted Dunning gave at the Big Data Analytics meetup hosted by Klout about how real-time and long-time can be integrated into a single computation.

Citation preview

Page 1: Real-time and Long-time Together

1©MapR Technologies - Confidential

Real-time and Long-time with Storm and Hadoop

Page 2: Real-time and Long-time Together

2©MapR Technologies - Confidential

Real-time and Long-time with Storm and Hadoop MapR

Page 3: Real-time and Long-time Together

3©MapR Technologies - Confidential

Contact:– [email protected]– @ted_dunning

Slides and such (available late tonight):– http://info.mapr.com/ted-pjug

Hash tags: #mapr #pjug

Page 4: Real-time and Long-time Together

4©MapR Technologies - Confidential

The Challenge

Hadoop is great of processing vats of data– But sucks for real-time (by design!)

Storm is great for real-time processing– But lacks any way to deal with batch processing

It sounds like there isn’t a solution– Neither fashionable solution handles everything

Page 5: Real-time and Long-time Together

5©MapR Technologies - Confidential

This is not a problem.

It’s an opportunity!

Page 6: Real-time and Long-time Together

6©MapR Technologies - Confidential

What is Map-Reduce?

Map-reduce programs are defined (mostly) by– A map function that does independent record transformations (and

deletions and replications)– Reduce functions that do aggregation

Map-reduce programs run in framework that– Schedules and re-runs tasks– Splits the input– Moves map outputs to reduce inputs– Receives the results

6

Page 7: Real-time and Long-time Together

7©MapR Technologies - Confidential

Inside Map-Reduce

7

Input Map CombineShuffleand sort

Reduce Output

Reduce

"The time has come," the Walrus said,"To talk of many things:Of shoes—and ships—and sealing-wax

the, 1time, 1has, 1come, 1…

come, [3,2,1]has, [1,5,2]the, [1,2,1]time, [10,1,3]…

come, 6has, 8the, 4time, 14…

Page 8: Real-time and Long-time Together

8©MapR Technologies - Confidential

Not Just Text

Counting words is easy

Many other problems work as well– Sessionize user logs– Very large scale joins– Large scale matrix recommendations– Computing the quadrillionth digit of π

Map-reduce is inherently batch oriented

Page 9: Real-time and Long-time Together

9©MapR Technologies - Confidential

Inside Map-Reduce

9

Input Map Shuffleand sort

Reduce Output

road1, polyline(p1, p2, p3, …)lake1, polygon(p4, p5, p7, p9, …)road2, polyline(p6, p7, p9, …)

tile0918-1412, road1tile1082-8143, road1tile0014-3284, lake1tile1082-8143, lake1…

tile0918-1412, [road1]tile1082-8143, [road1, lake1]tile0014-3284, [lake1]…

tile1082-8143, img#1tile0014-3284, img#2tile1082-8143, img#3…

Page 10: Real-time and Long-time Together

10©MapR Technologies - Confidential

What is Storm

A Storm program is called a topology– Spouts inject data into a topology– Bolts process data

The units of data are called tuples All processing is flow-through Bolts can buffer or persist Output tuples can be anchored Bolts that fail are restarted and un-acked tuples are replayed

Page 11: Real-time and Long-time Together

11©MapR Technologies - Confidential

t

now

Hadoop is Not Very Real-time

UnprocessedData

Fully processed

Latest full period

Hadoop job takes this long for this data

Page 12: Real-time and Long-time Together

12©MapR Technologies - Confidential

t

now

Hadoop works great back here

Storm workshere

Real-time and Long-time together

Blended view

Blended view

Blended View

Page 13: Real-time and Long-time Together

13©MapR Technologies - Confidential

One Alternative

Search Engine

NoSqlde Jour

Consumer

Real-time Long-time

?

Page 14: Real-time and Long-time Together

14©MapR Technologies - Confidential

Problems

Simply dumping into noSql engine doesn’t quite work Insert rate is limited No load isolation– Big retrospective jobs kill real-time

Low scan performance– Hbase pretty good, but not stellar

Difficult to set boundaries– where does real-time end and long-time begin?

Page 15: Real-time and Long-time Together

15©MapR Technologies - Confidential

Rough Design – Data Flow

Search Engine

Query Event Spout

Logger Bolt

Counter Bolt

Raw Logs

LoggerBolt

Semi Agg

Hadoop Aggregator

Snap

Long agg

Query Event Spout

Counter Bolt

Logger Bolt

Page 16: Real-time and Long-time Together

16©MapR Technologies - Confidential

Closer Look

Critical design goals:– fast ack for all tuples– fast restart of counter

Ack happens when tuple hits the replay log (100’s of milliseconds) Restart involves replaying semi-agg’s + replay log (very fast)

Counter Bolt

Replay Log

Semi-aggregated records

Incoming records

Real-time Long-time

Page 17: Real-time and Long-time Together

17©MapR Technologies - Confidential

A Frozen Moment in Time

Snapshot defines the dividing line

All data in the snap is long-time, all after is real-time

Semi-agg strategy allows clean query

Semi Agg

Hadoop Aggregator

Snap

Long agg

Page 18: Real-time and Long-time Together

18©MapR Technologies - Confidential

Guarantees

Counter output volume is small-ish– the greater of k tuples per 100K inputs or k tuple/s– 1 tuple/s/label/bolt for this exercise

Persistence layer must provide guarantees– distributed against node failure– must have either readable flush or closed-append

HDFS is distributed, but provides no guarantees and strange semantics

MapRfs is distributed, provides all necessary guarantees

Page 19: Real-time and Long-time Together

19©MapR Technologies - Confidential

Presentation Layer

Presentation must– read recent output of Logger bolt– read relevant output of Hadoop jobs– combine semi-aggregated records

User will see– counts that increment within 0-2 s of events– seamless meld of short and long-term data

Page 20: Real-time and Long-time Together

20©MapR Technologies - Confidential

Example 2 – AB testing in real-time

I have 15 versions of my landing page Each visitor is assigned to a version– Which version?

A conversion or sale or whatever can happen– How long to wait?

Some versions of the landing page are horrible– Don’t want to give them traffic

Page 21: Real-time and Long-time Together

21©MapR Technologies - Confidential

A Quick Diversion

You see a coin– What is the probability of heads?– Could it be larger or smaller than that?

I flip the coin and while it is in the air ask again I catch the coin and ask again I look at the coin (and you don’t) and ask again Why does the answer change?– And did it ever have a single value?

Page 22: Real-time and Long-time Together

22©MapR Technologies - Confidential

A Philosophical Conclusion

Probability as expressed by humans is subjective and depends on information and experience

Page 23: Real-time and Long-time Together

23©MapR Technologies - Confidential

I Dunno

Page 24: Real-time and Long-time Together

24©MapR Technologies - Confidential

5 heads out of 10 throws

Page 25: Real-time and Long-time Together

25©MapR Technologies - Confidential

2 heads out of 12 throws

Page 26: Real-time and Long-time Together

26©MapR Technologies - Confidential

Bayesian Bandit

Compute distributions based on data Sample p1 and p2 from these distributions

Put a coin in bandit 1 if p1 > p2

Else, put the coin in bandit 2

Page 27: Real-time and Long-time Together

27©MapR Technologies - Confidential

And it works!

Page 28: Real-time and Long-time Together

28©MapR Technologies - Confidential

Video Demo

Page 29: Real-time and Long-time Together

29©MapR Technologies - Confidential

The Code

Select an alternative

Select and learn

But we already know how to count!

n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0)))

for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)

Page 30: Real-time and Long-time Together

30©MapR Technologies - Confidential

The Basic Idea

We can encode a distribution by sampling Sampling allows unification of exploration and exploitation

Can be extended to more general response models

Page 31: Real-time and Long-time Together

31©MapR Technologies - Confidential

Contact:– [email protected]– @ted_dunning

Slides and such (available late tonight):– http://info.mapr.com/ted-pjug

Hash tags: #mapr #pjug

Page 32: Real-time and Long-time Together

32©MapR Technologies - Confidential

Thank You