40
Extending the Yahoo! Streaming Benchmark Jamie Grier @jamiegrier [email protected] om

Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Embed Size (px)

Citation preview

Page 1: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Extending the Yahoo! Streaming Benchmark

Jamie Grier@[email protected]

Page 2: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Who am I?• Director of Applications Engineering at data

Artisans• Previously working on streaming

computation at Twitter, Gnip and Boulder Imaging

• Involved in various kinds of stream processing for about a decade

• High-speed video, social media streaming, general frameworks for stream processing

Page 3: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Overview• Yahoo! performed a benchmark comparing

Apache Flink, Storm and Spark• The benchmark never actually pushed Flink

to it’s throughput limits but stopped at Storms limits

• I knew Flink was capable of much more so I repeated the benchmarks myself

• I did a follow up blog post explaining my findings and will summarize them here

Page 4: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Yahoo! Benchmark• Count ad impressions grouped by

campaign• Compute aggregates over a 10 second

window• Emit current value of window aggregates

to Redis every second for query• Map ads to campaigns using Redis as well

Page 5: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Any questions so far?

Page 6: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Storm Code

Page 7: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Flink Code

Page 8: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Hardware Specs• 10 Kafka brokers with 2 partitions each• 10 compute nodes (Flink / Storm)• Each machine has 1 Xeon [email protected] CPU

• 4 cores, 8 vCores (hyperthreading)• 32 GB RAM (only 8GB allocated to JVMs)

• 10 GigE Ethernet between compute nodes• 1 GigE Ethernet between Kafka cluster and compute

nodes

Page 9: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Logical Deployment

Data Generat

orKafka Source Filter Project Join

Redis

Window Sink Redis

Stream Processor

Page 10: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Redis

Apache StormDeployment

Kafka

Kafka

Kafka

Source Filter Project Join Window Sink

FlinkData Generator

Redis

Shuffle

Apache Storm10 Gige Link1 Gige Link

Page 11: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Redis

Kafka

Kafka

Kafka

Source Filter Project Join Window Sink

FlinkData Generator

Redis

Shuffle

10 Gige Link1 Gige Link

Page 12: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Redis

Kafka

Kafka

Kafka

Source / Filter Project Join Window Sink

FlinkData Generator

Redis

Shuffle

10 Gige Link1 Gige Link

Page 13: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Redis

Kafka

Kafka

Kafka

Source / Filter / Project Join Window Sink

FlinkData Generator

Redis

Shuffle

10 Gige Link1 Gige Link

Page 14: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Redis

Kafka

Kafka

Kafka

Source / Filter / Project / Join Window Sink

FlinkData Generator

Redis

Shuffle

10 Gige Link1 Gige Link

Page 15: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Redis

Kafka

Kafka

Kafka

Window / Sink

FlinkData Generator

Redis

Shuffle

Source / Filter / Project / Join

10 Gige Link1 Gige Link

Page 16: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Redis

Kafka

Kafka

Kafka

FlinkData Generator

Redis

Shuffle

Window / SinkSource / Filter / Project / Join

10 Gige Link1 Gige Link

Page 17: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Redis

Kafka

Kafka

Kafka

FlinkData Generator

Redis

Shuffle

Apache FlinkDeployment

Apache Flink

Window / SinkSource / Filter / Project / Join

10 Gige Link1 Gige Link

Page 18: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Processing Guarantees

Apples and OrangesApache Storm Apache Flink

At least once semantics

Exactly once semantics

Double counting after failures No double counting

Lost state after failures No state loss

Page 19: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Benchmark

Storm (Kafka, 1 GigE)

Flink (Kafka, 1 GigE)

0 1 2 2 3 4

0M

3M

Baseline

Throughput: msgs/sec

Page 20: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Bottleneck AnalysisApache Storm

Kafka

Kafka

Kafka

Source Filter Project Join Window Sink

FlinkData Generator

Shuffle

Apache Storm10 Gige Link1 Gige Link

Redis

Redis

Page 21: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Bottleneck AnalysisApache Storm

Kafka

Kafka

Kafka

Source Filter Project Join Window Sink

FlinkData Generator

Shuffle

Apache Storm10 Gige Link1 Gige Link

Redis

Redis

CPU

Page 22: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Redis

Kafka

Kafka

Kafka

FlinkData Generator

Redis

Shuffle

Bottleneck AnalysisApache Flink

Apache Flink

Window / SinkSource / Filter / Project / Join

10 Gige Link1 Gige Link

Page 23: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Redis

Kafka

Kafka

Kafka

FlinkData Generator

Redis

Shuffle

Bottleneck AnalysisApache Flink

Apache Flink

Window / SinkSource / Filter / Project / Join

10 Gige Link1 Gige Link

Network

Page 24: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Redis

Kafka

Kafka

Kafka

FlinkData Generator

Redis

Shuffle

Eliminate theBottleneck

Apache Flink

Window / SinkSource / Filter / Project / Join

10 Gige Link1 Gige Link

Page 25: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Redis

FlinkData Generator

Redis

Shuffle

Apache Flink

Window / SinkSource / Filter / Project / Join

10 Gige Link1 Gige Link

Eliminate theBottleneck

Page 26: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Redis

Redis

Shuffle

Apache Flink

Window / SinkSource / Filter / Project / Join

10 Gige Link1 Gige Link

DataGenerator

Eliminate theBottleneck

Page 27: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Redis

Redis

Shuffle

Apache Flink

Window / SinkSource / Filter / Project / Join

10 Gige Link1 Gige Link

DataGenerator

Apache FlinkDeployment

Round 2

Page 28: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Benchmark

Storm (Kafka, 1 GigE)

Flink (Kafka, 1 GigE)

0 1 2 2 3 4

0M

3M

Baseline

Throughput: msgs/sec

Page 29: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

BenchmarkRound 2

Storm (Kafka, 1 GigE)

Flink (Kafka, 1 GigE)

Flink (DataGen, 10 GigE)

0 4 8 12 16

0M

3M

15M

10 GigE end-to-end

Throughput: msgs/sec

Page 30: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Results• Apache Flink achieved 15 million messages

/ sec on Yahoo! benchmark• Much stronger processing guarantees:

Exactly once• 80x higher than what was reported in the

original Yahoo! benchmark on similar hardware

Page 31: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Questions?

Page 32: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Redis

Redis

Shuffle

MapR Cluster

Window / SinkSource / Filter / Project / Join

10 Gige Link

DataGenerator

Apache Flink andMapR Streams

MapRStreams

MapRStreams

MapRStreams

Page 33: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

MapR BenchmarkHardware Specs

• 10 MapR nodes, 3X data replication• Each node has 1 Xeon E5-2660-v3 @ 2.60GHz

CPU• 10 cores, 20 vCores (hyperthreading)• 16 vCores used for Flink on each node• 256 GB RAM (only 8GB allocated to Flink)

• 40 GigE Ethernet between compute nodes

Page 34: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Benchmarking on MapRHPC Cluster

Series1

40 GigE end-to-end

Throughput: msgs/sec

10 Million msgs/sec(with 3x replication)

Page 35: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Benchmarking on MapRHPC Cluster

Flink (MapR Streams)

Flink (w/ Data Gener-ator)

0 20 40 60 80

10M

72M

40 GigE end-to-end

Throughput: msgs/sec

Page 36: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

BenchmarkingSummary

Storm (Kafka, 1 GigE)

Flink (Kafka, 1 GigE)

Flink (MapR, 40 GigE)

Flink (DataGen, 10 GigE)

Flink (DataGen, 40 GigE)

0 20 40 60 80

0M

3M

10M

15M

72M

Throughput: msgs/sec

Page 37: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

What’s missing?

Flink (Kafka, 10 GigE)

Flink (Kafka, 40 GigE)

0 1Throughput: msgs/sec

???

???

Page 38: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Results• Apache Flink achieved 10 million messages

/ sec on Yahoo! benchmark when paired with MapR Streams and a high-performance 10 node cluster

• On the same cluster hardware Apache Flink achieved 72 millions message / sec when using direct data generation

Page 39: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Storm Compatibility• Lot’s of companies already have applications

written using the Storm API• Flink provides a Storm compatibility layer• Run your Storm jobs on Flink with a one line

code change• Flink also allows you to reuse your existing

Storm spout and bolt code from a Flink job• Give it a try!

Page 40: Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Thanks to MapR!Special thanks to:

Terry HeTed Dunning