21
Twitter @instaclustr [email protected] instaclustr.com Lessons Learned from Building an Apache Kafka Managed Service

Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.comTwitter @instaclustr [email protected] instaclustr.com

Lessons Learned from Building an Apache Kafka Managed Service

Page 2: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Introduction

● Over 20 million node-hours of experience managing Cassandra, Spark and Elassandra

● Our platform provides automated provisioning, monitoring and management

● Available on AWS, GCP, Azure and IBM Cloud

● Managed Apache Kafka released May 21st

Page 3: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Agenda

● Context - our offering and development process

● Hardware choice and benchmarking

● Topic and user management

● Broker security configuration

● Monitoring

● Backup and Restore

Page 4: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Instaclustr Managed Kafka - Key Features

● Preview Release available:○ Open source Apache Kafka and Zookeeper provisioned in AWS, GCP and Azure○ Broker monitoring○ Instaclustr monitoring and provisioning API support○ Private network clusters (AWS only)○ Run in your cloud provider account or ours○ Topic management via a custom CLI tool

Page 5: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Instaclustr Managed Kafka - Key Features

● For GA (end June):○ SOC2 compliant○ User & credential management○ Providing more cluster config options○ Topic level and synthetic transaction monitoring○ Infrastructure config tuning

Page 6: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Instaclustr Managed Kafka - Development Process

● First customer requests 2016

● Internal infrastructure deployment and usage of Kafka mid 2017

● Managed service platform developmentcommenced November 2017

● Early access program with 4 customerscommenced December 2017

● Public preview release 21 May 2018

● GA expected 25 June 2018

Page 7: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Hardware Choice and Benchmarking - GP2 vs ST1

● Disk Type○ AWS benchmark - r4.large w 500GB disks

■ 1 x 500GB ST1 volume■ 10 x 50GB GP2 volumes in RAID0 configuration

○ Avg 10% improved throughput with ST1 vs GP2 EBS○ ST1 is 45% of the cost of GP2○ Non-RAIDed mount simplifies re-sizing EBS volumes

Type Writes (m/s) Reads (m/s) Mixed (m/s)

ST1 223,851 149,506 W: 171,305 / R: 49,898

GP2 203,409 127,127 W: 162,966 / R: 44,869

Page 8: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

ST1

GP2

Page 9: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Provider Comparison

Page 10: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Hardware Choice and Benchmarking - SSL vs non-SSL

● Encryption enabled on broker-to-broker and client-to-broker○ AWS benchmark - r4.large w 1500GB ST1 disk○ 512 byte messages○ ~30% decrease in throughput with Broker and Client SSL enabled

● Follow-up benchmarks on OpenJDK 8 vs. 9, based on KAFKA-2561○ 50% increased throughput in writes○ 80% increased throughput in reads

Page 11: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Page 12: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Hardware Choice and Benchmarking - Number of Topics

● Possible urban myth that increasing topics reduces performance

● However, more topics = more partitions

● Significantly slows recovery time from node failure

10Topic

s

100Topic

s

1000Topic

s

5000Topic

s

Page 13: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Hardware Choice and Benchmarking -Colocated Zookeeper

● Often recommended to host zookeeper separately to Kafka● However, recent changes have significantly reduced load on Zookeeper from Kafka

○ Consumer offsets are no longer stored in Zookeeper● Our benchmarking showed no measurable difference in performance, at least for smaller clusters

Page 14: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Hardware Choice and Benchmarking -Colocated Zookeeper

Consumer Rate - Separate Consumer Rate - Colocated

● 6 node cluster with broker restart○ Similar results with dedicated Zookeeper disk vs. shared

Page 15: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Topic and User Configuration Management

● Kafka utilities require direct access to Zookeeper● Zookeeper does not have a robust external security model● Felt that providing access to Zookeeper was a risk

● Solutions○ Developed command line tool to use Kafka API for topic configuration

https://github.com/instaclustr/ic-kafka-tools■ Future: Console UI support?■ Value topic configuration versioning and management

○ Adding user management to Instaclustr Console■ Additional authentication required

Page 16: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Broker Security Configuration

● Using SCRAM (Salted Challenge Response Authentication Mechanism) authentication○ Used for client->broker○ Broker->broker uses SASL plaintext

● Using SASL plaintext authentication○ Used for broker->broker○ Were planning on integrating SCRAM authentication, but dynamic configuration still requires

broker restart○ Instead planning on short-lived signed broker keys as dynamic configuration does not require

restart

Page 17: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Broker Security Configuration

● Access to managed clusters○ Public IPs and whitelisting in firewall (security group or equivalent)○ Private IPs with VPC Peering (or equivalent in other cloud providers)○ Private Network Clusters where nodes are not allocated public IPs and gateway box is used for

admin access○ Don’t expose Zookeeper through firewall due to weak security model

Page 18: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Monitoring

● Metrics exposed via JMX○ Custom collection agent -> RabbitMQ (planned to migrate to Kafka) -> Riemann ->

Cassandra+Spark -> Console, APIs, Grafana● Exposing broker-level and per-topic metrics ● Alerting

○ Basics: service state, disk usage free space, server still exists○ Kafka metrics: offline partitions, active controllers != 1, partition under replicated

■ Active controller very sensitive, are re-assessing alert thresholds○ Synthetic transactions: publish and consume message to controlled topic, measure success and

latency

Page 19: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Monitoring

● Central Logging○ Fleet logs transferred via Kafka to an Elassandra cluster○ 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra○ Kafka experience in this project has been very positive

● Only issue○ Auto offset commit failed for group logstash: Commit offsets failed with retriable exception. You

should retry committing offsets.○ We weren’t monitoring consumer lag closely enough○ Increased consumer session and request timeouts

Page 20: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Backup and Restore

● Internet wisdom = Kafka Backups is not a thing○ Rely on replication within cluster or mirror maker

replication to another cluster● Cassandra experience says backups are valuable

○ Hardware failure is not an issue but corruption due to app bugs or user error can occur and be spread by replication

● Future○ Working on regular automated backup and restore of

topic and security configuration○ Consider using Kafka Connect to write important

messages to offline backup

Page 21: Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an Elassandra cluster 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra

instaclustr.com

Thanks for listening!

● Currently in Preview● Would love any feedback, suggestions or just telling us what we missed● 14-day free trial option (no CC needed) - console.instaclustr.com