44
New age Distributed Messaging Kafka & Concepts explored !! Dileep Varma Kalidindi Nov 2014

Distributed messaging through Kafka

Embed Size (px)

Citation preview

Page 1: Distributed messaging through Kafka

New age Distributed Messaging

Kafka & Concepts explored !!Dileep Varma KalidindiNov 2014

Page 2: Distributed messaging through Kafka

05/03/2023 Confidential 2

Who Am I ?

Name: Dileep Varma Kalidindi

Status: Senior Engineer @Responsys (since Apr’14), Circles Team.

Fascination: Problem Solving , Distributed & BigData churning systems.

Past: 8+yrs with VeriSign, Informatica Labs, NTT Data.

Hobbies: Jumping (Water & Air)

Page 3: Distributed messaging through Kafka

05/03/2023 Confidential 3

What is brewing today ?

Responsys Technology Road Map.

Data off the limits - Handling & Processing BigData

Scope for New Age capabilities (in distributed msg’ng) – Architecture peek through

Existing System bottlenecks & shortfalls

Rethinking from fundamentals – Distributed Commit Log

Kafka Messaging – Concept, Architecture, API & Demo

Kafka Internals – ZooKeeper in depth, Atomic broadcast & Quorum

Performance & feature comparisons – Traditional vs New Age

Page 4: Distributed messaging through Kafka

05/03/2023 Confidential 4

Are we good ?

Page 5: Distributed messaging through Kafka

05/03/2023 Confidential 5

Data off the limits – Handling larger Data sets

Kafka on Responsys technology Road map - Antonio

Data evolution from Traditional to BigData

Characterized by Volume, Variety, Velocity, Variability, Veracity & Complexity

Volume -> Quantity of data. Storage & Processing (Hadoop, NoSQL) Variety -> Diversity of data sets, OLTP, OLAP (NoSQL, NewSQL) Velocity -> Speed of data handling in real time (Kafka, Storm, Flume)

Deeper market penetration implicitly transforms Data

Our focus is on Velocity

Need of the hour is Systems to handle – BigData Technologies

Page 6: Distributed messaging through Kafka

05/03/2023 Confidential 6

BigData Technologies – MindMap view

Page 7: Distributed messaging through Kafka

7

Uber

Application

Database

UI PUB WS CN BounceIS

LA JMS EC SPAM ETLAB

Apache MTA SMTP Fileserver

EventDB

CustDB

ReportDBSysAdmDB

DataWarehouse

AuditDB

UsageDB

EMD CL PD

ICR

Content

IDDP

Short URL

SUL DIS

SMS PGPUSH

SMSL

Identifying Scope – Architecture Peek-in

REAL TIME PROCESSING

Page 8: Distributed messaging through Kafka

05/03/2023 Confidential 8

Is a there problem with my current System ? Existing systems are good (IBM MQ) in traditional sense.

Delivery guarantee is good for Emails, what for events (PubWeb, Bounce, AB) ?

Focus on throughput. Existing brokers have limitations.

Scaling and Replication, cost of Cluster maintenance in existing MQ.

Dynamic rebalancing of Brokers, Consumers

Page 9: Distributed messaging through Kafka

05/03/2023 Confidential 9

Rethink from Fundamentals

LOGS

Page 10: Distributed messaging through Kafka

05/03/2023 Confidential 10

Log’s – fundamental System blocks• Log (as a foundation) :

Append-only, totally-ordered sequence of records ordered by time.

Unique –sequential log entry (Clock Decoupled time stamp) Deterministic

• Logging (as a core process) :

• IS Machine readable logging Ex: Write ahead logs, Commit logs &Transaction logs

• IS NOT Application logging (Human readable) Ex: Log4j, slf4j etc..

• Backbone of Distributed Messaging, Databases, NoSQL, Key-Value stores, replication, Hadoop, Version Control…

• Logs for Data Integration, Real time processing & System building.

Page 11: Distributed messaging through Kafka

05/03/2023 Confidential 11

Log’s – solving Problems

• Logs are not new in Databases !! Started with IBM SystemR

Physical logging – Values of rows changed, Logical logging – SQL Queries

Logs implementations – ACID to Replication (Goldengate)

• State Machine Replication Principle 2 identical, deterministic process -> begin with same state, gets same inputs in order, produce same output and

ends in same state

• In Distributed Systems they Solve core problems

Ordering changes Distributing data

• Processing and replication

Active – Passive Active - Active

Page 12: Distributed messaging through Kafka

05/03/2023 Confidential 12

Log’s – driving Architecture

• Log-structured data flow Cache system Asynchronous Production & Consumption

• Kafka Log Centric approach:

Not a Database, Log file collection, Typical messaging system • Event driven architecture:

Kafka – event driven, Multi-subscriber system (Topic) Example – which performs multiple ops on one event job

Page 13: Distributed messaging through Kafka

05/03/2023 Confidential 13

Logs in ACTION

APACHE KAFKA

Page 14: Distributed messaging through Kafka

05/03/2023

Kafka

Confidential 14

Introducing Kafka“Should I wake-up now? ..why ? “

Kafka Core Concepts Topics, partitions, replicas, producers, consumers, brokers

Operating KafkaArchitecture, deploying, monitoring, P&S tuning

Page 15: Distributed messaging through Kafka

05/03/2023

Introducing Kafka

Confidential 15

http://kafka.apache.org/ Originated at LinkedIn, open sourced in early 2011Implemented in Scala, some Java9 core committers, plus ~ 20 contributors

Kafka is a distributed, partitioned, replicated commit log service. A uniquely designed pub-sub messaging system

Designed for, High throughput to support high volume event feeds. Support real-time processing of these feeds to create new, derived feeds. low-latency delivery to handle traditional messaging use cases. Guarantee fault-tolerance

Page 16: Distributed messaging through Kafka

05/03/2023

Kafka in Real business

Confidential 16

Page 17: Distributed messaging through Kafka

05/03/2023 17Confidential

Kafka is Amazingly fast – How ?

• “Up to 2 million writes/sec on 3 cheap machines”• Using 3 producers on 3 different machines, 3x async replication

• Only 1 producer/machine because NIC already saturated

• Sustained throughput as stored data grows• Slightly different test config than 2M writes/sec above.

Page 18: Distributed messaging through Kafka

05/03/2023 18Confidential

Kafka is Amazingly fast – Why ?

• Fast writes:• While Kafka persists all data to disk, essentially all writes go to the

page cache of OS, i.e. RAM.• Cf. hardware specs and OS tuning (we cover this later)

• Fast reads:• Very efficient to transfer data from page cache to a network socket• Linux: sendfile() system call

• Combination of the two = fast Kafka!• Example (Operations): On a Kafka cluster where the consumers are mostly caught

up you will see no read activity on the disks as they will be serving data entirely from cache.

Page 19: Distributed messaging through Kafka

05/03/2023 19Confidential

Kafka Core Concepts - A first look

• The who is who• Producers write data to brokers.• Consumers read data from brokers.• All this is distributed.

• The data• Data is stored in topics.• Topics are split into partitions, which are replicated

Page 20: Distributed messaging through Kafka

05/03/2023 20Confidential

Kafka Concepts - Topics

• Topic: feed name to which messages are published• Example: “pubweb.event.2”

Page 21: Distributed messaging through Kafka

05/03/2023 21Confidential

Kafka Concepts - Topics

Page 22: Distributed messaging through Kafka

05/03/2023 22Confidential

Kafka Concepts -Creating a Topic

• Creating a topic• CLI

• APIhttps://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/storm/KafkaStormDemo.scala

• Auto-create via auto.create.topics.enable = true

• Modifying a topic- Add partitions- Add configs- Remove Configs- Deleting topics

$ kafka-topics.sh --zookeeper zookeeper1:2181 --create --topic zerg.hydra \ --partitions 3 --replication-factor 2 \ --config x=y

Page 23: Distributed messaging through Kafka

05/03/2023 23Confidential

Kafka Concepts - Partitions

• A topic consists of partitions• Partition: ordered + immutable sequence of messages

that is continually appended to

• Partitions of a topic are Configurable

Page 24: Distributed messaging through Kafka

05/03/2023 24Confidential

Kafka Concepts - Partition Offset

• Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset

• Consumers track their pointers via (offset, partition, topic) tuples

Consumer group C1

Page 25: Distributed messaging through Kafka

05/03/2023 25Confidential

Kafka Concepts - Partition Replica’s

• Replicas: “backups” of a partition• They exist solely to prevent data loss.• Replicas are never read from, never written to.

• They do NOT help to increase producer or consumer parallelism!

Page 26: Distributed messaging through Kafka

05/03/2023 26Confidential

Topics vs Partitions vs Replica’s

Page 27: Distributed messaging through Kafka

05/03/2023 27Confidential

Kafka Concepts - Topic inspection

• --describe the topic

• Leader: brokerID of the currently elected leader broker• Replica ID’s = broker ID’s

• ISR = “in-sync replica”, replicas that are in sync with the leader

• In this example:• Broker 0 is leader for partition 1.• Broker 1 is leader for partitions 0 and 2.• All replicas are in-sync with their respective leader partitions.

$ kafka-topics.sh --zookeeper zookeeper1:2181 --describe --topic zerg.hydraTopic:zerg2.hydra PartitionCount:3 ReplicationFactor:2 Configs: Topic: zerg2.hydra Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0 Topic: zerg2.hydra Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1 Topic: zerg2.hydra Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1,0

Page 28: Distributed messaging through Kafka

05/03/2023 28Confidential

Kafka Concepts - Consumers & Producers

df

Page 29: Distributed messaging through Kafka

05/03/2023 29Confidential

Kafka Concepts - Producer

df

• Code• Start Producer

Page 30: Distributed messaging through Kafka

05/03/2023 30Confidential

Kafka Concepts - Consumers

df

• Code• Start Consumer• Multithreaded Consumer for multiple

partitions

Page 31: Distributed messaging through Kafka

05/03/2023 31Confidential

Kafka Core Concepts - Recap

• The who is who• Producers write data to brokers.• Consumers read data from brokers.• All this is distributed.

• The data• Data is stored in topics.• Topics are split into partitions, which are replicated

Page 32: Distributed messaging through Kafka

05/03/2023 Confidential 32

Monitoring & Testing

Page 33: Distributed messaging through Kafka

05/03/2023 33Confidential

Kafka – Monitoring and Testing

• JMX Enabled• System tools

• Describe

• Quantified Offset Monitor• Monitoring DEMO

Page 34: Distributed messaging through Kafka

05/03/2023 Confidential 34

Empowering Kafka

Page 35: Distributed messaging through Kafka

05/03/2023

Apache ZooKeeper

Confidential 35

Apache Kafka uses ZooKeeper to detect crashes, implement topic discovery, and maintain production & consumption state for topics.

High-performance coordination service for distributed applications.

SoC – Separates Coordination overhead from Application logic.

Centralized service for naming (registry), configuration management, synchronization, and group membership services.

Zookeeper is backbone for Hbase, Solr, Facebook messaging apps & many more distributed apps.

Simple, Replicated, Ordered and Fast

Page 36: Distributed messaging through Kafka

05/03/2023

Zookeeper- Internals

Confidential 36

Znodes Persistent – exists till deleted Ephemeral - session scope

Reads by all Nodes and Writes through Leaders

Data is stored as byte array

Allows Watches and notifications

Ensemble – a group of Servers available to service

Quorum determined leader selection

Page 37: Distributed messaging through Kafka

05/03/2023

ZooKeeper – Guarantees

Confidential 37

• Follows principles of ATOMIC broadcast

Sequential Consistency – Updates are applied in order Atomicity – Updates either succeed or fail Single system image – Same view of service regardless of ZK server Reliability – Persistence of updates Timeliness – System is guaranteed to be up-to-date within time bound

• In Summary - Zookeeper { Leader Activation + Message delivery }

Page 38: Distributed messaging through Kafka

05/03/2023 Confidential 38

Kafka Performance

Page 39: Distributed messaging through Kafka

05/03/2023

Kafka performance – Producer tests(LinkedIn benchmark test)

Confidential 39

• HW Set-up with 2 linux nodes• Each with 8 2 GHZ cores (8 Cores/Mac ~ 16 GHZ processing)• 16 GB of RAM, 6 disks with RAID 10 and 1GB network connection.

• Producer test• Single producer ~ 10 million msgs each of 200bytes• Kafka msg batch 1 and 50. Other MQ’s no batching• X-axis – Msg sent to broker, Y-axis – Producer throughput

• Why is Producer fast• No ACK• Batching• Kafka storage format

Page 40: Distributed messaging through Kafka

05/03/2023

Kafka performance – Consumer tests(LinkedIn benchmark test)

Confidential 40

• HW Set-up with 2 linux nodes• Each with 8 2 GHZ cores (8 Cores/Mac ~ 16 GHZ processing)• 16 GB of RAM, 6 disks with RAID 10 and 1GB network connection.

• Consumer test• Single consumer retrives 10 million msgs each of 200bytes• Each pull request for 1000 msgs (200kb)• X-axis – Msg consumed from broker, Y-axis – consumer throughput

• Why is Producer fast• No Delivery state storage• Kafka storage format

(less data transmitted)

Page 41: Distributed messaging through Kafka

05/03/2023 Confidential 41

Summary, Conclusions

&References

Page 42: Distributed messaging through Kafka

05/03/2023

Summary – quick Recap

Confidential 42

Importance Handling & Processing BigData

Scope for introduction in Responsys Architecture

Existing System bottlenecks & shortfalls

Distributed Commit Log

Kafka Messaging

Kafka Internals – ZooKeeper

Performance & feature comparisons – Traditional vs New Age

Page 43: Distributed messaging through Kafka

05/03/2023

Conclusion – Open ended

Confidential 43

• Limitation is on Data – not on Systems

• No need for complete revamp

• Choice of Right systems at right time is the recipe.

References

1. https://kafka.apache.org/2. http://zookeeper.apache.org/3. http://

engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

4. http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Page 44: Distributed messaging through Kafka

05/03/2023 Confidential 44

THANK YOU