31
1 ©MapR Technologies - Confidential Real-time and Long- time with Storm and Hadoop

Real-time and long-time together

Embed Size (px)

DESCRIPTION

A talk that I gave at the Big Data Analytics meetup hosted by Klout about how real-time and long-time can be integrated into a single computation.

Citation preview

Page 1: Real-time and long-time together

1©MapR Technologies - Confidential

Real-time and Long-time with Storm and Hadoop

Page 2: Real-time and long-time together

2©MapR Technologies - Confidential

Real-time and Long-time with Storm and Hadoop MapR

Page 3: Real-time and long-time together

3©MapR Technologies - Confidential

Contact:– [email protected]– @ted_dunning

Slides and such (available late tonight):– http://info.mapr.com/ted-utahjug

Hash tags: #mapr #storm

Page 4: Real-time and long-time together

4©MapR Technologies - Confidential

The Challenge

Hadoop is great of processing vats of data– But sucks for real-time (by design!)

Storm is great for real-time processing– But lacks any way to deal with batch processing

It sounds like there isn’t a solution– Neither fashionable solution handles everything

Page 5: Real-time and long-time together

5©MapR Technologies - Confidential

This is not a problem.

It’s an opportunity!

Page 6: Real-time and long-time together

6©MapR Technologies - Confidential

What is Map-Reduce?

Map-reduce programs are defined (mostly) by– A map function that does independent record transformations (and

deletions and replications)– Reduce functions that do aggregation

Map-reduce programs run in framework that– Schedules and re-runs tasks– Splits the input– Moves map outputs to reduce inputs– Receives the results

6

Page 7: Real-time and long-time together

7©MapR Technologies - Confidential

Inside Map-Reduce

7

Input Map CombineShuffleand sort

Reduce Output

Reduce

"The time has come," the Walrus said,"To talk of many things:Of shoes—and ships—and sealing-wax

the, 1time, 1has, 1come, 1…

come, [3,2,1]has, [1,5,2]the, [1,2,1]time, [10,1,3]…

come, 6has, 8the, 4time, 14…

Page 8: Real-time and long-time together

8©MapR Technologies - Confidential

Not Just Text

Counting words is easy

Many other problems work as well– Sessionize user logs– Very large scale joins– Large scale matrix recommendations– Computing the quadrillionth digit of π

Map-reduce is inherently batch oriented

Page 9: Real-time and long-time together

9©MapR Technologies - Confidential

Inside Map-Reduce

9

Input Map Shuffleand sort

Reduce Output

road1, polyline(p1, p2, p3, …)lake1, polygon(p4, p5, p7, p9, …)road2, polyline(p6, p7, p9, …)

tile0918-1412, road1tile1082-8143, road1tile0014-3284, lake1tile1082-8143, lake1…

tile0918-1412, [road1]tile1082-8143, [road1, lake1]tile0014-3284, [lake1]…

tile1082-8143, img#1tile0014-3284, img#2tile1082-8143, img#3…

Page 10: Real-time and long-time together

10©MapR Technologies - Confidential

What is Storm?

A Storm program is called a topology– Spouts inject data into a topology– Bolts process data

The units of data are called tuples All processing is flow-through Bolts can buffer or persist Output tuples can be anchored If a tuple and all descendants are not ack’ed in time, bolt has died Bolts that fail are restarted and un-acked tuples are replayed

Page 11: Real-time and long-time together

11©MapR Technologies - Confidential

t

now

Hadoop is Not Very Real-time

UnprocessedData

Fully processed

Latest full period

Hadoop job takes this long for this data

Page 12: Real-time and long-time together

12©MapR Technologies - Confidential

t

now

Hadoop works great back here

Storm workshere

Real-time and Long-time together

Blended view

Blended view

Blended View

Page 13: Real-time and long-time together

13©MapR Technologies - Confidential

One Alternative

Search Engine

NoSqlde Jour

Consumer

Real-time Long-time

?

Page 14: Real-time and long-time together

14©MapR Technologies - Confidential

Problems

Simply dumping into noSql engine doesn’t quite work Insert rate is limited No load isolation– Big retrospective jobs kill real-time

Low scan performance– Hbase pretty good, but not stellar

Difficult to set boundaries– where does real-time end and long-time begin?

Page 15: Real-time and long-time together

15©MapR Technologies - Confidential

Data Sources

Catcher Cluster

Rough Design – Data Flow

Catcher Cluster

Query Event Spout

Logger Bolt

Counter Bolt

Raw Logs

LoggerBolt

Semi Agg

Hadoop Aggregator

Snap

Long agg

ProtoSpout Counter Bolt

Logger Bolt

Data Sources

Page 16: Real-time and long-time together

16©MapR Technologies - Confidential

Closer Look – Catcher Protocol

Data Sources

Catcher ClusterCatcher Cluster

Data Sources

The data sources and catchers communicate with a very simple protocol.

Hello() => list of catchersLog(topic,message) => (OK|FAIL, redirect-to-catcher)

Page 17: Real-time and long-time together

17©MapR Technologies - Confidential

Closer Look – Catcher Queues

Catcher Cluster

Catcher Cluster

The catchers forward log requests to the correct catcher and return that host in the reply to allow the client to avoid the extra hop.

Each topic file is appended by exactly one catcher.

Topic files are kept in shared file storage.

TopicFile

TopicFile

Page 18: Real-time and long-time together

18©MapR Technologies - Confidential

Closer Look – ProtoSpout

The ProtoSpout tails the topic files,parses log records into tuples andinjects them into the Storm topology.

Last fully acked position stored in shared file system.

TopicFile

TopicFile

ProtoSpout

Page 19: Real-time and long-time together

19©MapR Technologies - Confidential

Closer Look – Counter Bolt

Critical design goals:– fast ack for all tuples– fast restart of counter

Ack happens when tuple hits the replay log (10’s of milliseconds) Restart involves replaying semi-agg’s + replay log (very fast) Replay log only lasts until next semi-aggregate goes out

Counter Bolt

Replay Log

Semi-aggregated records

Incoming records

Real-time Long-time

Page 20: Real-time and long-time together

20©MapR Technologies - Confidential

A Frozen Moment in Time

Snapshot defines the dividing line

All data in the snap is long-time, all after is real-time

Semi-agg strategy allows clean query

Semi Agg

Hadoop Aggregator

Snap

Long agg

Page 21: Real-time and long-time together

21©MapR Technologies - Confidential

Guarantees

Counter output volume is small-ish– the greater of k tuples per 100K inputs or k tuple/s– 1 tuple/s/label/bolt for this exercise

Persistence layer must provide guarantees– distributed against node failure– must have either readable flush or closed-append

HDFS is distributed, but provides no guarantees and strange semantics

MapRfs is distributed, provides all necessary guarantees

Page 22: Real-time and long-time together

22©MapR Technologies - Confidential

Presentation Layer

Presentation must– read recent output of Logger bolt– read relevant output of Hadoop jobs– combine semi-aggregated records

User will see– counts that increment within 0-2 s of events– seamless and accurate meld of short and long-term data

Page 23: Real-time and long-time together

23©MapR Technologies - Confidential

Example 2 – AB testing in real-time

I have 15 versions of my landing page Each visitor is assigned to a version– Which version?

A conversion or sale or whatever can happen– How long to wait?

Some versions of the landing page are horrible– Don’t want to give them traffic

Page 24: Real-time and long-time together

29©MapR Technologies - Confidential

Bayesian Bandit

Compute distributions based on data Sample p1 and p2 from these distributions

Put a coin in bandit 1 if p1 > p2

Else, put the coin in bandit 2

Page 25: Real-time and long-time together

30©MapR Technologies - Confidential

And it works!

Page 26: Real-time and long-time together

31©MapR Technologies - Confidential

Video Demo

Page 27: Real-time and long-time together

32©MapR Technologies - Confidential

The Code

Select an alternative

Select and learn

But we already know how to count!

n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0)))

for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)

Page 28: Real-time and long-time together

33©MapR Technologies - Confidential

The Basic Idea

We can encode a distribution by sampling Sampling allows unification of exploration and exploitation

Can be extended to more general response models

Page 29: Real-time and long-time together

34©MapR Technologies - Confidential

The Bigger Basic Idea

Online algorithms generally have relatively small state (like counting)

Online algorithms generally have a simple update (like counting) If we can do this with counting, we can do it with all kinds of

algorithms

Page 30: Real-time and long-time together

37©MapR Technologies - Confidential

Contact:– [email protected]– @ted_dunning

Slides and such (available late tonight):– Send email !!

Hash tags: #mapr #bigdata

We are hiring!

Page 31: Real-time and long-time together

38©MapR Technologies - Confidential

Thank You