40
1 ©MapR Technologies - Confidential Real-time and Long- time with Storm and Hadoop

1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

Embed Size (px)

Citation preview

Page 1: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

1©MapR Technologies - Confidential

Real-time and Long-time with Storm and Hadoop

Page 2: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

2©MapR Technologies - Confidential

Real-time and Long-time with Storm and Hadoop MapR

Page 3: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

3©MapR Technologies - Confidential

Contact:– [email protected]– @ted_dunning

Slides and such:– http://info.mapr.com/ted-uk-05-2012

Hash tag: #mapr_uk

Collective notes: http://bit.ly/JDCRhc

Page 4: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

4©MapR Technologies - Confidential

Company Background

MapR provides the industry’s best Hadoop Distribution– Combines the best of the Hadoop community

contributions with significant internally financed infrastructure development

Background of Team– Deep management bench with extensive analytic,

storage, virtualization, and open source experience– Google, EMC, Cisco, VMWare, Network Appliance, IBM,

Microsoft, Apache Foundation, Aster Data, Brio, ParAccel Proven – MapR used across industries (Financial Services, Media,

Telcom, Health Care, Internet Services, Government) – Strategic OEM relationship with EMC and Cisco– Over 1,000 installs

Page 5: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

5©MapR Technologies - Confidential

Expanding Hadoop Use Cases

NFS for file-based

applications

Hadoop APIsfor HadoopApplications

ODBC (JDBC) for SQL-based applications

Blue = MapR Innovations

Real-time Applications

MissionCritical and SLA

dependent Applications

Page 6: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

6©MapR Technologies - Confidential

MapR’s Complete Distribution for Apache Hadoop

MapR Heatmap™

LDAP, NISIntegration

Quotas, Alerts, Alarms

CLI, REST APT

Hive Pig Oozle Sqoop HBase Whirr

Mahout Cascading NaglosIntegration

GangliaIntegration

Flume Zoo-keeper

MapR Control System

DirectAccess

NFS

Real-Time Streaming Volumes Mirrors Snap-

shotsData

Placement

No NameNodeArchitecture

High PerformanceDirect Shuffle

Stateful Failoverand Self Healing

2.7MapR’s Storage Services™

Integrated, tested, hardened andSupported

100% Hadoop, HBase, HDFS API compatible

Easy portability/migration between distributions

Unique advanced features

No changes required to Hadoop applications

Runs on commodity hardware

Page 7: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

7©MapR Technologies - Confidential

So what about that real-time stuff?

Page 8: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

8©MapR Technologies - Confidential

The Challenge

Hadoop is great of processing vats of data– But sucks for real-time (by design!)

Storm is great for real-time processing– But lacks any way to deal with batch processing

It sounds like there isn’t a solution– Neither fashionable solution handles everything

Page 9: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

9©MapR Technologies - Confidential

This is not a problem.

It’s an opportunity!

Page 10: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

10©MapR Technologies - Confidential

t

now

Hadoop is Not Very Real-time

UnprocessedData

Fully processed

Latest full period

Hadoop job takes this long for this data

Page 11: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

11©MapR Technologies - Confidential

Need to Plug the Hole in Hadoop

We have real-time data with limited state– Exactly what Storm does– And what Hadoop does not

We also have long-term analytics with lots of state– Exactly what Hadoop does– And what Storm does not

Can Storm and Hadoop be combined?

Page 12: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

12©MapR Technologies - Confidential

t

now

Hadoop works great back here

Storm workshere

Real-time and Long-time together

Blended view

Blended view

Blended View

Page 13: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

13©MapR Technologies - Confidential

An Example

I want to know how many queries I get– Per second, minute, day, week

Results should be available– within <2 seconds 99.9+% of the time– within 30 seconds almost always

History should last >3 years Should work for 0.001 q/s up to 100,000 q/s Failure tolerant, yadda, yadda

Page 14: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

14©MapR Technologies - Confidential

Rough Design – Data Flow

Search Engine

Query Event Spout

Logger Bolt

Counter Bolt

Raw Logs

LoggerBolt

Semi Agg

Hadoop Aggregator

Snap

Long agg

Query Event Spout

Counter Bolt

Logger Bolt

Page 15: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

15©MapR Technologies - Confidential

Counter Bolt Detail

Input: Labels to count Output: Short-term semi-aggregated counts– (time-window, label, count)

Input is logged until next flush Non-zero counts emitted on flush if– event count reaches threshold (typical 100K)– time since last count reaches threshold (typical 1-10s)

Tuples acked when counts emitted Double count probability is > 0 but very small

Page 16: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

16©MapR Technologies - Confidential

Counter Bolt Counterintuitivity

Counts are emitted for same label, same time window many times– these are semi-aggregated– this is a feature– tuples can be acked within 1s– time windows can be much longer than 1s

No need to send same label to same bolt– speeds failure recovery

Page 17: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

17©MapR Technologies - Confidential

Design Flexibility

Tuples can be ack’ed as soon as they hit the log– counter can recover state on failure– log is burn after write

Count flush interval can be extended without extending tuple timeout– Decreases currency of counts in semi-aggregates

Total bandwidth for log is typically not huge– All of twitter @10,000 messages per second = 10K x 2KB = 20MB/s

Page 18: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

18©MapR Technologies - Confidential

Counter Bolt No-nos

Cannot accumulate entire period in-memory– Tuples must be ack’ed much sooner– State must be persisted before ack’ing– State can easily grow too large to handle without disk access

Cannot persist entire count table at once – Incremental persistence required

Page 19: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

19©MapR Technologies - Confidential

Guarantees

Counter output volume is small-ish– the greater of k tuples per 100K inputs or k tuple/s– 1 tuple/s/label/bolt for this exercise

Persistence layer must provide guarantees– distributed against node failure– must have either readable flush or closed-append

HDFS is distributed, but provides no guarantees and strange semantics

MapRfs is distributed, provides all necessary guarantees

Page 20: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

20©MapR Technologies - Confidential

Failure Modes

Bolt failure– buffered tuples will go un’acked– after timeout, tuples will be resent– timeout ≈ 10s– if failure occurs after persistence, before acking, then double-counting is

possible Storage (with MapR)– most failures invisible– a few continue within 0-2s, some take 10s– catastrophic cluster restart can take 2-3 min– logger can buffer this much easily

Page 21: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

21©MapR Technologies - Confidential

Presentation Layer

Presentation must– read recent output of Logger bolt– read relevant output of Hadoop jobs– combine semi-aggregated records

User will see– counts that increment within 0-2 s of events– seamless meld of short and long-term data

Page 22: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

22©MapR Technologies - Confidential

Example 2 – Real-time learning

My system has to– learn a response model

and– select training data– in real-time

Data rate up to 100K queries per second

Page 23: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

23©MapR Technologies - Confidential

Door Number 3 – AB testing in real-time

I have 15 versions of my landing page Each visitor is assigned to a version– Which version?

A conversion or sale or whatever can happen– How long to wait?

Some versions of the landing page are horrible– Don’t want to give them traffic

Page 24: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

24©MapR Technologies - Confidential

Real-time Constraints

Selection must happen in <20 ms almost all the time Training events must be handled in <20 ms Failover must happen within 5 seconds Client should timeout and back-off– no need for an answer after 500ms

State persistence required

Page 25: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

25©MapR Technologies - Confidential

Rough Design

DRPC Spout Query Event Spout

Logger Bolt

Counter Bolt

Raw Logs

Model State

Timed Join Model

Logger Bolt

Conversion Detector

Selector Layer

Page 26: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

26©MapR Technologies - Confidential

A Quick Diversion

You see a coin– What is the probability of heads?– Could it be larger or smaller than that?

I flip the coin and while it is in the air ask again I catch the coin and ask again I look at the coin (and you don’t) and ask again Why does the answer change?– And did it ever have a single value?

Page 27: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

27©MapR Technologies - Confidential

A First Conclusion

Probability as expressed by humans is subjective and depends on information and experience

Page 28: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

28©MapR Technologies - Confidential

A Second Diversion

What is the mass of the moon?– 1/2 degree @ 385 Mm = ~ 3.8 Mm diameter (really about 3.4-ish)– V = 1/6 x pi x 3.83 x 1018 m3 = ~ 29 x 1018 m3 (really about 22)– m = rho V = 4 Mg/m3 x 29 x 1018 m3 = 1.2 x 1023 kg (really about 0.7)

Is that the exact number?– Shouldn’t we have confidence bounds?

Wikipedia says: 7.3477 × 1022 kg– Is that the exact number?– Shouldn’t they have confidence bounds?

Page 29: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

29©MapR Technologies - Confidential

A Second Conclusion

A single number is a bad way to express uncertain knowledge

A distribution of values might be better

Page 30: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

30©MapR Technologies - Confidential

I Dunno

Page 31: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

31©MapR Technologies - Confidential

5 and 5

Page 32: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

32©MapR Technologies - Confidential

2 and 10

Page 33: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

33©MapR Technologies - Confidential

Bayesian Bandit

Compute distributions based on data Sample p1 and p2 from these distributions

Put a coin in bandit 1 if p1 > p2

Else, put the coin in bandit 2

Page 34: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

34©MapR Technologies - Confidential

And it works!

Page 35: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

35©MapR Technologies - Confidential

Video Demo

Page 36: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

36©MapR Technologies - Confidential

The Code

Select an alternative

Select and learn

But we already know how to count!

n = dim(k)[1] p0 = rep(0, length.out=n) for (i in 1:n) { p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1) } return (which(p0 == max(p0)))

for (z in 1:steps) { i = select(k) j = test(i) k[i,j] = k[i,j]+1 } return (k)

Page 37: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

37©MapR Technologies - Confidential

The Basic Idea

We can encode a distribution by sampling Sampling allows unification of exploration and exploitation

Can be extended to more general response models

Page 38: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

39©MapR Technologies - Confidential

Contact:– [email protected]– @ted_dunning

Slides and such:– http://info.mapr.com/ted-uk-05-2012

Page 39: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

40©MapR Technologies - Confidential

MapR’s Innovations

Page 40: 1 ©MapR Technologies - Confidential Real-time and Long-time with Storm and Hadoop

41©MapR Technologies - Confidential

Thank You