42
SITE RELIABILITY ENGINEERING ©2016 LinkedIn Corporation. All Rights Reserved. Kafka at Peak Performance

Kafka at Peak Performance

Embed Size (px)

Citation preview

Page 1: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.

Kafka at Peak Performance

Page 2: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.

Todd PalinoStaff Site Reliability EngineerLinkedIn, Data Infrastructure Streaming

Page 3: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 3

Who Am I?

Page 4: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 4

Kafka At LinkedIn

1100+ Kafka brokers Over 32,000 topics 350,000+ Partitions

875 Billion messages per day 185 Terabytes In 675 Terabytes Out

Peak Load (whole site)– 10.5 Million messages/sec– 18.5 Gigabits/sec Inbound– 70.5 Gigabits/sec Outbound

1800+ Kafka brokers Over 79,000 topics 1,130,000+ Partitions

1.3 Trillion messages per day 330 Terabytes In 1.2 Petabytes Out

Peak Load (single cluster)– 2 Million messages/sec– 4.7 Gigabits/sec Inbound– 15 Gigabits/sec Outbound

Page 5: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 5

What Will We Talk About?

Picking Your Hardware

Monitoring the Cluster

Triaging Broker Performance Problems

Conclusion

Page 6: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 6

Hardware Selection

Page 7: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 7

What’s Important To You?

Message Retention - Disk size

Message Throughput - Network capacity

Producer Performance - Disk I/O

Consumer Performance - Memory

Page 8: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 8

Go Wide

Kafka is well-suited to horizontal scaling

RAIS - Redundant Array of Inexpensive Servers

Also helps with CPU utilization– Kafka needs to decompress and recompress every message batch– KIP-31 will help with this by eliminating recompression

Don’t co-locate Kafka

Page 9: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 9

Disk Layout

RAID– Can survive a single disk failure (not RAID 0)– Provides the broker with a single log directory– Eats up disk I/O

JBOD– Gives Kafka all the disk I/O available– Broker is not smart about balancing partitions– If one disk fails, the entire broker stops

Amazon EBS performance works!

Page 10: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 10

Operating System Tuning

Filesystem Options– EXT or XFS– Using unsafe mount options

Virtual Memory– Swappiness– Dirty Pages

Networking

Page 11: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 11

Java

Only use JDK 8 now

Keep heap size small– Even our largest brokers use a 6 GB heap– Save the rest for page cache

Garbage Collection - G1 all the way– Basic tuning only– Watch for humongous allocations

Page 12: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 12

How Much Do You Need?

Page 13: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 13

Buy The Book!

Early Access available now.

Covers all aspects of Kafka, from setup to client development to ongoing administration and troubleshooting.

Also discusses stream processing and other use cases.

Page 14: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 14

Kafka Cluster Sizing

How big for your local cluster?– How much disk space do you have?– How much network bandwidth do you have?– CPU, memory, disk I/O

How big for your aggregate cluster?– In general, multiple the number of brokers by the number of local clusters– May have additional concerns with lots of consumers

Page 15: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 15

Topic Configuration

Partition Counts for Local– Many theories on how to do this correctly, but the answer is “it depends”– How many consumers do you have?– Do you have specific partition requirements?– Keeping partition sizes manageable

Partition Counts for Aggregate– Multiply the number of partitions in a local cluster by the number of local clusters– Periodically review partition counts in all clusters

Message Retention– If aggregate is where you really need the messages, only retain it in local for long enough

to cover mirror maker problems

Page 16: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 16

Possible Broker Improvements

Namespaces– Namespace topics by datacenter– Eliminate local clusters and just have aggregate– Significant hardware savings

JBOD Fixes– Intelligent partition assignment– Admin tools to move partitions between mount points– Broker should not fail completely with a single disk failure

Page 17: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 17

Administrative Improvements

Multiple cluster management– Topic management across clusters– Visualization of mirror maker paths

Better client monitoring– Burrow for consumer monitoring– No open source solution for producer monitoring (audit)

End-to-end availability monitoring

Page 18: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 18

Keeping An Eye On Things

Page 19: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 19

Monitoring The Foundation

CPU Load

Network inbound and outbound

Filehandle usage for Kafka

Disk– Free space - where you write logs, and where Kafka stores messages– Free inodes– I/O performance - at least average wait and percent utilization

Garbage Collection

Page 20: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 20

Broker Ground Rules

Tuning– Stick (mostly) with the defaults– Set default cluster retention as appropriate– Default partition count should be at least the number of brokers

Monitoring– Watch the right things– Don’t try to alert on everything

Triage and Resolution– Solve problems, don’t mask them

Page 21: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 21

Too Much Information!

Monitoring teams hate Kafka– Per-Topic metrics– Per-Partition metrics– Per-Client metrics

Capture as much as you can– Many metrics are useful while triaging an issue

Clients want metrics on their own topics

Only alert on what is needed to signal a problem

Page 22: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 22

Broker Monitoring

Bytes In and Out, Messages In– Why not messages out?

Partitions– Count and Leader Count– Under Replicated and Offline

Threads– Network pool, Request pool– Max Dirty Percent

Requests– Rates and times - total, queue, local, and send

Page 23: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 23

Topic Monitoring

Bytes In, Bytes Out Messages In, Produce Rate, Produce Failure Rate Fetch Rate, Fetch Failure Rate

Partition Bytes Log End Offset

– Why bother?– KIP-32 will make this unnecessary

Quota Throttling

Provide this to your customers for them to alert on

Page 24: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 24

Client Monitoring

For consumers, use Burrow– Monitor all partitions for all consumers– Provides an easy to digest “good, warning, bad” state, with detail available– Fast and free

Producers are a little harder– Several internal implementations of message auditing– The community needs a good open source standard

Cluster availability monitoring– kafka-monitoring is coming soon from LinkedIn!

Page 25: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 25

It’s Broken! Now What?

Page 26: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 26

All The Best Ops People…

Know more of what is happening than their customers

Are proactive

Fix bugs, not work around them

This applies to our developers too!

Page 27: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 27

Anticipating Trouble

Trend cluster utilization and growth over time

Use default configurations for quotas and retention to require customers to talk to you

Monitor request times– If you are able to develop a consistent baseline, this is early warning

Page 28: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 28

Under Replicated Partitions

Count of number of partitions which are not fully replicated within the cluster

Also referred to as “replica lag”

Primary indicator of problems within the cluster

Page 29: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 29

Broker Performance Checks

Are you still running 0.8? Are all the brokers in the cluster working? Are the network interfaces saturated?

– Reelect partition leaders– Rebalance partitions in the cluster– Spread out traffic more (increase partitions or brokers)

Is the CPU utilization high? (especially iowait)– Is another process competing for resources?– Look for a bad disk

Do you have really big messages?

Page 30: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 30

Kafka’s OK, Now What?

If Kafka is working properly, it’s probably a client issue– Don’t throw it over the fence. Help your customers understand

Common producer issues– Batch size and linger time– Receive and send buffers– Sync vs. async, and acknowledgements

Common consumer issues– Garbage collection problems– Min fetch bytes and max wait time– Not enough partitions

Page 31: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 31

Conclusion

Page 32: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 32

One Ecosystem

Kafka can scale to millions of messages per second, and more– Operations must scale the cluster appropriately– Developers must use the right tuning and go parallel

Few problems are owned by only one side– Expanding partitions often requires coordination– Applications that need higher reliability drive cluster configurations

Either we work together, or we fail separately

Page 33: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 33

Would You Like To Know More?

Presentations: http://www.slideshare.net/toddpalino– More Datacenters, More Problems– Kafka As A Service– Always download the originals for slide notes!

Blog Posts: https://engineering.linkedin.com/blog– Development and SRE blogs on Kafka and other topics

LinkedIn Open Source: https://github.com/linkedin/streaming– Burrow Consumer Monitoring - https://github.com/linkedin/Burrow– Kafka Admin Tools - https://github.com/linkedin/kafka-tools

Page 34: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 34

Getting Involved With Kafka

http://kafka.apache.org

Join the mailing lists– [email protected][email protected]

irc.freenode.net - #apache-kafka

Meetups– Apache Kafka - http://www.meetup.com/http-kafka-apache-org– Bay Area Samza - http://www.meetup.com/Bay-Area-Samza-Meetup/

Contribute code

Page 35: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 35

Data @ LinkedIn is Hiring!

Streams Infrastructure– Kafka pub/sub ecosystem– Stream Processing Platform built on Apache Samza– Next Generation change capture technology (incubating)

LinkedIn– Strong commitment to open source– Do cool things and work with awesome people

Join us in working on cutting edge stream processing infrastructures– Please contact [email protected]– Software developers and Site Reliability Engineers at all levels

Page 36: Kafka at Peak Performance
Page 37: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 37

Appendix

Page 38: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 38

JDK Options

Heap Size -Xmx6g -Xms6g

Metaspace -XX:MetaspaceSize=96m -XX:MinMetaspaceFreeRatio=50-XX:MaxMetaspaceFreeRatio=80

G1 Tuning -XX:+UseG1GC -XX:MaxGCPauseMillis=20-XX:InitiatingHeapOccupancyPercent=35-XX:G1HeapRegionSize=16M

GC Logging -XX:+PrintGCDetails -XX:+PrintGCTimeStamps-XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps-XX:+PrintTenuringDistribution-Xloggc:/path/to/logs/gc.log -verbose:gc

Error Handling -XX:-HeapDumpOnOutOfMemoryError-XX:ErrorFile=/path/to/logs/hs_err.log

Page 39: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 39

OS Tuning Parameters

Networking:net.core.rmem_default = 124928net.core.rmem_max = 2048000net.core.wmem_default = 124928net.core.wmem_max = 2048000net.ipv4.tcp_rmem = 4096 87380 4194304net.ipv4.tcp_wmem = 4096 16384 4194304net.ipv4.tcp_max_tw_buckets = 262144net.ipv4.tcp_max_syn_backlog = 1024

Page 40: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 40

OS Tuning Parameters (cont.)

Virtual Memoryvm.oom_kill_allocating_task = 1vm.max_map_count = 200000vm.swappiness = 1vm.dirty_writeback_centisecs = 500vm.dirty_expire_centisecs = 500vm.dirty_ratio = 60vm.dirty_background_ratio = 5

Page 41: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 41

Kafka Broker Sensors

kafka.server:name=BytesInPerSec,type=BrokerTopicMetricskafka.server:name=BytesOutPerSec,type=BrokerTopicMetricskafka.server:name=MessagesInPerSec,type=BrokerTopicMetricskafka.server:name=PartitionCount,type=ReplicaManagerkafka.server:name=LeaderCount,type=ReplicaManagerkafka.server:name=UnderReplicatedPartitions,type=ReplicaManagerkafka.server:name=RequestHandlerAvgIdlePercent,type=KafkaRequestHandlerPoolkafka.controller:name=ActiveControllerCount,type=KafkaControllerkafka.controller:name=OfflinePartitionsCount,type=KafkaControllerkafka.log:name=max-dirty-percent,type=LogCleanerManagerkafka.network:name=NetworkProcessorAvgIdlePercent,type=SocketServerkafka.network:name=RequestsPerSec=*,type=RequestMetricskafka.network:name=RequestQueueTimeMs,request=*,type=RequestMetricskafka.network:name=LocalTimeMs,request=*,type=RequestMetricskafka.network:name=RemoteTimeMs,request=*,type=RequestMetricskafka.network:name=ResponseQueueTimeMs,request=*,type=RequestMetricskafka.network:name=ResponseSendTimeMs,request=*,type=RequestMetricskafka.network:name=TotalTimeMs,request=*,type=RequestMetrics

Page 42: Kafka at Peak Performance

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 42

Kafka Broker Sensors - Topics

kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics,topics=*kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics,topics=*kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics,topics=*kafka.server:name=TotalProduceRequestsPerSec,type=BrokerTopicMetrics,topic=*kafka.server:name=FailedProduceRequestsPerSec,type=BrokerTopicMetrics,topic=*kafka.server:name=TotalFetchRequestsPerSec,type=BrokerTopicMetrics,topic=*kafka.server:name=FailedFetchRequestsPerSec,type=BrokerTopicMetrics,topic=*kafka.log:type=Log,name=LogEndOffset,topic=*,partition=*