31
With Akka Streams & Kafka Implementing Back-Pressure Akara Sucharitakul, PayPal

Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Embed Size (px)

Citation preview

Page 1: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

With Akka Streams & KafkaImplementing Back-Pressure

Akara Sucharitakul, PayPal

Page 2: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Intro & Agenda Crawler Intro & Problem Statements

Crawler Architecture

Infrastructure: Akka Streams, Kafka, etc.

The Goodies

Page 3: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Crawl Jobs

Job DB

Validate

URLCache

Download Process

URLs

URLs

Timestamps

High-Level View

Page 4: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Requirements Ever-expanding # of URLs

Can’t crawl all URLs at once

Control over concurrent web GETs

Efficient resource usage

Resilient under high burst

Scales horizontally & vertically

Page 5: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Sizing the Crawl Job

Let:i = Number of seed URLs in a jobn = Average number of links per paged = The crawl depth (how many layers to follow links)u = The max number of URLs to process

Then:u = ind

0 2 4 6 8 10 121.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+061.00E+07

totalURLs vs depth

depth (initialURLs = 1, outLinks = 5)

1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 1E+071.00E+031.00E+041.00E+051.00E+061.00E+071.00E+081.00E+091.00E+101.00E+11

totalURLs vs initialURLs

initialURLs (depth = 5, outLinks = 5)

Page 6: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

The Reactive Manifesto

Responsive

Message Driven

Elastic Resilient

Page 7: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Why Does it Matter?

Respond in a deterministic, timely manner

Stays responsive in the face of failure – even cascading failures

Stays responsive under workload spikes

Basic building block for responsive, resilient, and elastic systems

Responsive

Resilient

Elastic

Message Driven

Page 8: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

The Right Ingredients• Kafka

• Huge persistent buffer for the bursts• Load distribution to very large number of

processing nodes• Enable horizontal scalability

• Akka streams• High performance, highly efficient

processing pipeline• Resilient with end-to-end back-pressure• Fully asynchronous – utilizes

mapAsyncUnordered with Async HTTP client• Async HTTP client

• Non-blocking and consumes no threads in waiting

• Integrates with Akka Streams for a high parallelism, low resource solution

EfficientResilient

Scale

AkkaStrea

m

AsyncHTTP

Reactive

Kafka

Page 9: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Crawl Jobs

Job DB

Validate

URLCache

Download Process

URLs

URLs

Timestamps

Adding Kafka & Akka Streams

URLsAkka

Streams

Page 10: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Akka Streams,what???

High performance, pure async, stream processing

Conforms to reactive streams

Simple, yet powerful GraphDSL allows clear stream topology declaration

Central point to understand processing pipeline

Page 11: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Crawl Stream

Actual Stream Declaration in Code prioritizeSource ~> crawlerFlow ~> bCast0 ~> result ~> bCast ~> outLinksFlow ~> outLinksSink bCast ~> dataSinkFlow ~> kafkaDataSink bCast ~> hdfsDataSink bCast ~> graphFlow ~> merge ~> graphSink bCast0 ~> maxPage ~> merge bCast0 ~> retry ~> bCastRetry ~> retryFailed ~> merge bCastRetry ~> errorSink

PrioritizedSource

Crawl

Result

MaxPageReached

Retry

OutLinks

Data

Graph

CheckFail

CheckErr

OutLinksSinkKafka DataSinkHDFS DataSink

GraphSink

ErrorSink

Page 12: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Resulting CharacteristicsEfficient• Low thread count, controlled by Akka and pure non-blocking async HTTP• High latency URLs do not block low latency URLs using MapAsyncUnordered• Well-controlled download concurrency using MapAsyncUnordered• Thread per concurrent crawl jobResilient• Processes only what can be processed – no resource overload• Kafka as short-term, persistent queueScale• Kafka feeds next batch of URLs to available client• Pull model – only processes that have capacity will get the load• Kafka distributes work to large number of processing nodes in cluster

Page 13: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Back-Pressure

0 100 200 300 400 500 600 7000

20000

40000

60000

80000

100000

120000

Queue Size

Time (seconds)

0

100

200

300

400

URLs/secTime (seconds)

seedURLs : 100parallelism : 1000processTime : 1 – 5 soutLinks : 0 - 10depth : 5totalCrawled : 312500

Page 14: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

ChallengesTraining• Developers not used to E2E stream

definitions

• More familiar with deeply nested function calls

Maturity of Infrastructure• Kafka 0.9 use fetch as heartbeat

• Slow nodes cause timeout & rebalance

• Solved in 0.10.0.1

Page 15: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

What it would have been…

Bloated, ineffective concurrency control

Lack of well-thought-out and visible processing pipeline

Clumsy code, hard to manage & understand

Low training cost, high project TCODev / Support / Maintenance

Page 16: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Bottom LineCrawl Time Reduced to 1/10th (compared to thread-based architecture)

Page 17: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Standardized Reactive PlatformFor Large Scale Internet Deployments

Page 18: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Efficiency & Resilience meets Standardization

• Monitoring• Need to collect metrics, consistently

• Logging• Correlation across services• Uniformity in logs

• Security• Need to apply standard security configuration

• Environment Resolution• Staging, production, etc.

Consistency in the face of Heterogeneity

Page 19: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

squbs is not… A framework by its own

A programming model – use Akka

Take all or none – Components/patterns can mostly be used independently

Page 20: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

squbsAkka for large scale deployments

Bootstrap

Lifecycle management

Loosely-coupled module system

Integration hooks for logging, monitoring, ops integration

Page 21: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

squbsAkka for large scale deployments

JSON console

HttpClient with pluggable resolver and monitoring/logging hooks

Test tools and interfaces

Goodies:- Activators for Scala & Java- Programming patterns and helpers for Akka and Akka Stream Use cases…, and growing

Page 22: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Performance Akka is principally designed for great performance & scalability

Actor scheduling, dispatcher, message batching• throughput parameter

The job for squbs is adding ops functionality without impacting performance

Page 23: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

squbsPerformancefor Akka Http

Pipeline (auth, logging, etc.) built as Akka Streams BidiFlow

Pipeline supplied by infra or app

App code supplies Flow or Route

Pipeline and app flow baked into single Request/Response Flow

Fully utilizes Akka-Stream fusing

Zero Overhead for Given Functionality

Page 24: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

ApplicationPerformanceTips

Kafka – great firehose, with a bit latency

Only convert byte ↔ char when absolutely needed

Work with ByteString if you can

Need better facilities (ByteString JSON parser, etc.)

Remember : App has biggest impact on performance, not tuning

Page 25: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Akka PerformanceTuning

Parallelism Factor: 1.0 is optimal• Minimizes context switches

Use other dispatcher for blocking

Test your throughput setting

Try and test different GC settings

Page 26: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

The Goodies

Page 27: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

PerpetualStream

• Provides a convenience trait to help write streams controlled by system lifecycle• Minimal/no message losses

• Register PerpetualStream to make stream start/stop

• Provides customization hooks – especially for how to stop the stream

• Provides killSwitch (from Akka) to be embedded into stream

• Implementers - just provide your stream!

A non-stop stream; starts and stops with the systemclass MyStream extends PerpetualStream[Future[Int]] {

def generator = Iterator.iterate(0) { p => if (p == Int.MaxValue) 0 else p + 1 } val source = Source.fromIterator(generator _) val ignoreSink = Sink.ignore[Int]

override def streamGraph = RunnableGraph.fromGraph( GraphDSL.create(ignoreSink) { implicit builder => sink => import GraphDSL.Implicits._ source ~> killSwitch.flow[Int] ~> sink ClosedShape })}

Page 28: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

PersistentBuffer/BroadcastBuffer• Data & indexes in rotating memory-mapped files• Off-heap rotating file buffer – very large buffers• Restarts gracefully with no or minimal message loss• Not as durable as a remote data store, but much faster

• Does not back-pressure upstream beyond data/index writes• Similar usage to Buffer and Broadcast• BroadcastBuffer – a FanOutShape decouples each output port making each

downstream independent• Useful if downstream stage blocked or unavailable• Kafka is unavailable/rebalancing but system cannot backpressure/deny

incoming traffic• Optional commit stage for at-least-once delivery semantics• Implementation based on Chronicle Queue

A buffer of virtually unlimited size

Page 29: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Summary• Kafka + Akka Streams + Async I/O = Ideal Architecture for

High Bursts & High Efficiency• Akka Streams• Clear view of stream topology• Back-pressure & Kafka allows buffering load bursts

• Standardization• Walk like a duck, quack like a duck, and manage it like a duck

• squbs: Have the cake, and eat it too• Functionality without sacrificing performance• Goodies like PerpetualStream, PersistentBuffer, & BroadcastBuffer

Page 30: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Q&A – Feedback AppreciatedJoin us on – link from https://github.com/paypal/squbs @squbs, @S_Akara

Page 31: Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka