44
Scaling near-realtime analytics with Kafka and HBase OSCON 2012 Dave Revell & Nate Putnam Urban Airship

Near-realtime analytics with Kafka and HBase

Embed Size (px)

Citation preview

Page 1: Near-realtime analytics with Kafka and HBase

Scaling near-realtime analytics with Kafka and HBaseOSCON 2012Dave Revell & Nate PutnamUrban Airship

Page 2: Near-realtime analytics with Kafka and HBase

About Us

• Nate Putnam• Team Lead, Core Data and Analytics (1 year)• Previously Engineer at Jive Software (4 years)• Contributor to HBase/Zookeeper

• Dave Revell• Database Engineer, Core Data and Analytics (1 year)• Previously Engineer at Meebo (1 year)• HBase contributor

Page 3: Near-realtime analytics with Kafka and HBase

In this Talk

• About Urban Airship• Why near-realtime?• About Kafka• Data Consumption • Scheduling • HBase / High speed counting• Questions

Page 4: Near-realtime analytics with Kafka and HBase

What is an Urban Airship?• Hosting for mobile services that developers should not

build themselves• Unified API for services across platforms• SLAs for throughput, latency

Page 5: Near-realtime analytics with Kafka and HBase

By The Numbers

Page 6: Near-realtime analytics with Kafka and HBase

By The Numbers

• Hundreds of millions devices

Page 7: Near-realtime analytics with Kafka and HBase

By The Numbers

• Hundreds of millions devices• Front end API sustains thousands of requests per

second

Page 8: Near-realtime analytics with Kafka and HBase

By The Numbers

• Hundreds of millions devices• Front end API sustains thousands of requests per

second• Millions of Android devices online all the time

Page 9: Near-realtime analytics with Kafka and HBase

By The Numbers

• Hundreds of millions devices• Front end API sustains thousands of requests per

second• Millions of Android devices online all the time• 6 months for the company to deliver 1M messages,

hundred million plus a day now.

Page 10: Near-realtime analytics with Kafka and HBase

Pretty Graphs

Page 11: Near-realtime analytics with Kafka and HBase

Near-Realtime?

• Realtime or Really fast?• Events happen async • Realtime failure scenarios are hard • In practice a few minutes is all right for analytics

Page 12: Near-realtime analytics with Kafka and HBase

Overview

Kafka-0

Kafka-1

Kafka-2

Consumer-0

Memory Queue Worker-0

Indexer-0

Indexer-1

Indexer-2

HBase

Page 13: Near-realtime analytics with Kafka and HBase

Kafka

• Created by the SNA Team @ LinkedIn• Publish-Subscribe system with minimal features• Can keep as many messages buffered as you have disk

space• Fast with some trade offs

Page 14: Near-realtime analytics with Kafka and HBase

Kafka

• Topics are partitioned across servers• Partitioning scheme is customizable

Topic_0partition_0

partition_1

partition_2

partition_3

Topic_0partition_4

partition_5

partition_6

partition_7

Topic_0partition_8

partition_9

partition_10

partition_11

Page 15: Near-realtime analytics with Kafka and HBase

Kafka

• Consumption is 1 thread per distinct partition• Consumers keep track of their own offsets

Topic_0, Consumer Group 1Partition 0, Offset : 7

Time

Topic_0, Consumer Group 1Partition 1, Offset : 10

Topic_0, Consumer Group 1Partition 3, Offset : 2

Page 16: Near-realtime analytics with Kafka and HBase

Kafka

• Manipulation of time based indexes is powerful• Monitor offsets and lag• Throw as much disk at this as you can• http://incubator.apache.org/kafka/

Page 17: Near-realtime analytics with Kafka and HBase

Consumers

• Mirror Kafka design• Lots of parallelism to increase throughput• Share nothing• No ordering guarantees

Page 18: Near-realtime analytics with Kafka and HBase

Consumers

Kafka-0

Kafka-1

Kafka-2

Consumer-0

Memory Queue Worker-0

Indexer-0

Indexer-1

Indexer-2

HBase

Page 19: Near-realtime analytics with Kafka and HBase

Consumers

Kafka-0

Kafka-1

Kafka-2

Consumer-0

Memory Queue Worker-0

Indexer-0

Indexer-1

Indexer-2

HBase

Page 20: Near-realtime analytics with Kafka and HBase

Consumers

Consumer-0

partition_9

partition_8

partition_11

partition_1

partition_0

partition_2

partition_3

partition_5

partition_4

partition_6

partition_7

Consumer-1

Consumer-2

Consumer-3

partition_10

Queue

Queue

Queue

Queue

Page 21: Near-realtime analytics with Kafka and HBase

Consumers

Kafka-0

Kafka-1

Kafka-2

Consumer-0

Memory Queue Worker-0

Indexer-0

Indexer-1

Indexer-2

HBase

Page 22: Near-realtime analytics with Kafka and HBase

Consumers

Queue

Worker-0

Worker-1

Worker-2

Worker-3

Worker-4

Worker-5

Indexer-0

Indexer-1

Indexer-2

Indexer-3

Indexer-4

Indexer-5

Indexer-6

Indexer-7

HBase

Page 23: Near-realtime analytics with Kafka and HBase

Scheduled aggregation tasks

• Challenge: aggregate values that arrive out of order• Example: sessions/clickstream• Two steps:

• Quickly write into HBase• Periodically scan to

calculate aggregates

t6 t4 t8t3t7t5 t1 t0t2t9

Page 24: Near-realtime analytics with Kafka and HBase

t6 t4 t8t3t7t5 t1 t0t2t9

Events arrive in arbitrary order

t0 t1 t5t4t3t2 t6 t9t8t7

Time-ordered on disk

Initially we store in an

HBase table with timestamp

as key

S E EESS S SSES = session startE = session end

Scheduled task scans sequentially and infers

sessions

Page 25: Near-realtime analytics with Kafka and HBase

t6 t4 t8t3t7t5 t1 t0t2t9

Events arrive in arbitrary order

t0 t1 t5t4t3t2 t6 t9t8t7

Time-ordered on disk

Initially we store in an

HBase table with timestamp

as key

S E EESS S SSES = session startE = session end

Scheduled task scans sequentially and infers

sessions

Page 26: Near-realtime analytics with Kafka and HBase

t6 t4 t8t3t7t5 t1 t0t2t9

Events arrive in arbitrary order

t0 t1 t5t4t3t2 t6 t9t8t7

Time-ordered on disk

Initially we store in an

HBase table with timestamp

as key

S E EESS S SSES = session startE = session end

Scheduled task scans sequentially and infers

sessions

Page 27: Near-realtime analytics with Kafka and HBase

Strengths

• Efficient with disk and memory• Can tradeoff response time for disk usage• Fine granularity, 10Ks of jobs

Page 28: Near-realtime analytics with Kafka and HBase

Compared to MapReduce

• Similar to MapReduce shuffle: sequential IO, external sort

• Fine grained failures, scheduling, resource allocation

• Can’t do lots of jobs, can’t do big jobs

• But MapReduce is easier to use 0

25

50

75

100

Alice Bob Charlie Dave

Input data size

Page 29: Near-realtime analytics with Kafka and HBase

Pro/con vs. realtime streaming

• For example, a Storm topology• Challenge: avoid random reads

(disk seeks) without keeping too much state in RAM

• Sorting minimizes state• But latency would be good

Bob's app,Devices 10-20

Page 30: Near-realtime analytics with Kafka and HBase

HBase

• What it is• Why it’s good for low-latency big data

Page 31: Near-realtime analytics with Kafka and HBase

HBase

• A database that uses HDFS for storage• Based on Google’s BigTable• Solves the problem “how do I query my Hadoop data?”

• Operations typically measured in milliseconds• MapReduce is not suitable for real time queries

• Scales well by adding servers (if you do everything right)• Not partition tolerant or eventually consistent

Page 32: Near-realtime analytics with Kafka and HBase

Why we like HBase

• Scalable• Fast: millions of ops/sec• Open source, ASF top-level project• Strong community

Page 33: Near-realtime analytics with Kafka and HBase

HBase difficulties

• Low level features, harder to use than RDBMS

• Hard to avoid accidentally introducing bottlenecks

• Garbage collection, JVM tuning• HDFS

Is it fast enough?

Identify bottleneck

Rethink access patterns

Page 34: Near-realtime analytics with Kafka and HBase

How to fail at HBase

• Schema can limit scalability

Region 1

Region 2

Region 3

Region 4

Region 5

Region 6

Region 7

RegionServer1

RegionServer2

KeyAKeyB

KeyCKeyD

KeyEKeyF

KeyGKeyH

KeyIKeyJ

KeyKKeyL

KeyMKeyN

Datanode 1

Datanode 2

Datanode 3

Datanode 4

HBase HDFS

Page 35: Near-realtime analytics with Kafka and HBase

Troubleshooting• Isolate slow regions or servers with statshtable

• http://github.com/urbanairship/statshtable

Region 1

Region 2

Region 3

Region 4

Region 5

Region 6

Region 7

RegionServer1

RegionServer2

KeyAKeyB

KeyCKeyD

KeyEKeyF

KeyGKeyH

KeyIKeyJ

KeyKKeyL

KeyMKeyN

Datanode 1

Datanode 2

Datanode 3

Datanode 4

HBase HDFS

Page 36: Near-realtime analytics with Kafka and HBase

Counting

• The main thing that we do• Scaling dimensions:

• Many counters of interest per event• Many events• Many changes to counter definitions

Page 37: Near-realtime analytics with Kafka and HBase

for event in stream:user_id = extract_user_id(event)timestamp = extract_timestamp(event)event_type = extract_event_type(event)client_type = extract_client_type(event)location = extract_location(event)

increment_event_type_count(event_type)increment_client_and_event_type_count(event_type, client_type)increment_user_id_and_event_type_count(user_id, event_type)increment_user_id_and_client_type_count(user_id, client_type)

for time_precision in {HOURLY, DAILY, MONTHLY}:increment_time_count(time_precision, timestamp)increment_time_client_type_event_type_count(time_precision, ..)increment_(...)

for location_precision in {CITY, STATE, COUNTRY}:increment_time_location_event_type_client_type_user_id(...)

for bucket in yet_another_dimension: ....

A naive attempt

Page 38: Near-realtime analytics with Kafka and HBase

Counting with datacubes

• Challenge: count items in a stream matching various criteria, when criteria may change

• github.com/urbanairship/datacube• A Java library for turning

streams into OLAP cubes• Especially multidimensional counters

Page 39: Near-realtime analytics with Kafka and HBase

The data cubeabstraction

Time Dimension

User ID dimension

User agent d

imension

Alice Bob Charlie

7/15/12 00:007/15/12 01:00

… more hourly rows...

7/15/12 00:00:007/16/12 00:00:00

… more daily rows...

5

8

3

...

0

8

11 10242

5

Hourly buckets

Daily buckets...

AndroidIOS

Not shown: many more dimensions

Page 40: Near-realtime analytics with Kafka and HBase

Why datacube?

• Handles exponential number of writes• Async IO with batching• Declarative interface: say what to count, not how• Pluggable database backend (currently HBase)• Bulk loader• Easily change/extend online cubes

Page 41: Near-realtime analytics with Kafka and HBase

Datacube isn’t just for counters

Courtesy Tim Robertson, GBIF

github.com/urbanairship/datacube

Page 42: Near-realtime analytics with Kafka and HBase

Courtesy Tim Robertson, GBIF

github.com/urbanairship/datacube

Page 43: Near-realtime analytics with Kafka and HBase

Questions?

Page 44: Near-realtime analytics with Kafka and HBase

Thanks!

• HBase and Kafka for being awesome• We’re hiring! urbanairship.com/jobs/• @nateputnam @dave_revell