View
153
Download
1
Category
Preview:
Citation preview
New age Distributed Messaging
Kafka & Concepts explored !!Dileep Varma KalidindiNov 2014
05/03/2023 Confidential 2
Who Am I ?
Name: Dileep Varma Kalidindi
Status: Senior Engineer @Responsys (since Apr’14), Circles Team.
Fascination: Problem Solving , Distributed & BigData churning systems.
Past: 8+yrs with VeriSign, Informatica Labs, NTT Data.
Hobbies: Jumping (Water & Air)
05/03/2023 Confidential 3
What is brewing today ?
Responsys Technology Road Map.
Data off the limits - Handling & Processing BigData
Scope for New Age capabilities (in distributed msg’ng) – Architecture peek through
Existing System bottlenecks & shortfalls
Rethinking from fundamentals – Distributed Commit Log
Kafka Messaging – Concept, Architecture, API & Demo
Kafka Internals – ZooKeeper in depth, Atomic broadcast & Quorum
Performance & feature comparisons – Traditional vs New Age
05/03/2023 Confidential 4
Are we good ?
05/03/2023 Confidential 5
Data off the limits – Handling larger Data sets
Kafka on Responsys technology Road map - Antonio
Data evolution from Traditional to BigData
Characterized by Volume, Variety, Velocity, Variability, Veracity & Complexity
Volume -> Quantity of data. Storage & Processing (Hadoop, NoSQL) Variety -> Diversity of data sets, OLTP, OLAP (NoSQL, NewSQL) Velocity -> Speed of data handling in real time (Kafka, Storm, Flume)
Deeper market penetration implicitly transforms Data
Our focus is on Velocity
Need of the hour is Systems to handle – BigData Technologies
05/03/2023 Confidential 6
BigData Technologies – MindMap view
7
Uber
Application
Database
UI PUB WS CN BounceIS
LA JMS EC SPAM ETLAB
Apache MTA SMTP Fileserver
EventDB
CustDB
ReportDBSysAdmDB
DataWarehouse
AuditDB
UsageDB
EMD CL PD
ICR
Content
IDDP
Short URL
SUL DIS
SMS PGPUSH
SMSL
Identifying Scope – Architecture Peek-in
REAL TIME PROCESSING
05/03/2023 Confidential 8
Is a there problem with my current System ? Existing systems are good (IBM MQ) in traditional sense.
Delivery guarantee is good for Emails, what for events (PubWeb, Bounce, AB) ?
Focus on throughput. Existing brokers have limitations.
Scaling and Replication, cost of Cluster maintenance in existing MQ.
Dynamic rebalancing of Brokers, Consumers
05/03/2023 Confidential 9
Rethink from Fundamentals
LOGS
05/03/2023 Confidential 10
Log’s – fundamental System blocks• Log (as a foundation) :
Append-only, totally-ordered sequence of records ordered by time.
Unique –sequential log entry (Clock Decoupled time stamp) Deterministic
• Logging (as a core process) :
• IS Machine readable logging Ex: Write ahead logs, Commit logs &Transaction logs
• IS NOT Application logging (Human readable) Ex: Log4j, slf4j etc..
• Backbone of Distributed Messaging, Databases, NoSQL, Key-Value stores, replication, Hadoop, Version Control…
• Logs for Data Integration, Real time processing & System building.
05/03/2023 Confidential 11
Log’s – solving Problems
• Logs are not new in Databases !! Started with IBM SystemR
Physical logging – Values of rows changed, Logical logging – SQL Queries
Logs implementations – ACID to Replication (Goldengate)
• State Machine Replication Principle 2 identical, deterministic process -> begin with same state, gets same inputs in order, produce same output and
ends in same state
• In Distributed Systems they Solve core problems
Ordering changes Distributing data
• Processing and replication
Active – Passive Active - Active
05/03/2023 Confidential 12
Log’s – driving Architecture
• Log-structured data flow Cache system Asynchronous Production & Consumption
• Kafka Log Centric approach:
Not a Database, Log file collection, Typical messaging system • Event driven architecture:
Kafka – event driven, Multi-subscriber system (Topic) Example – which performs multiple ops on one event job
05/03/2023 Confidential 13
Logs in ACTION
APACHE KAFKA
05/03/2023
Kafka
Confidential 14
Introducing Kafka“Should I wake-up now? ..why ? “
Kafka Core Concepts Topics, partitions, replicas, producers, consumers, brokers
Operating KafkaArchitecture, deploying, monitoring, P&S tuning
05/03/2023
Introducing Kafka
Confidential 15
http://kafka.apache.org/ Originated at LinkedIn, open sourced in early 2011Implemented in Scala, some Java9 core committers, plus ~ 20 contributors
Kafka is a distributed, partitioned, replicated commit log service. A uniquely designed pub-sub messaging system
Designed for, High throughput to support high volume event feeds. Support real-time processing of these feeds to create new, derived feeds. low-latency delivery to handle traditional messaging use cases. Guarantee fault-tolerance
05/03/2023
Kafka in Real business
Confidential 16
05/03/2023 17Confidential
Kafka is Amazingly fast – How ?
• “Up to 2 million writes/sec on 3 cheap machines”• Using 3 producers on 3 different machines, 3x async replication
• Only 1 producer/machine because NIC already saturated
• Sustained throughput as stored data grows• Slightly different test config than 2M writes/sec above.
05/03/2023 18Confidential
Kafka is Amazingly fast – Why ?
• Fast writes:• While Kafka persists all data to disk, essentially all writes go to the
page cache of OS, i.e. RAM.• Cf. hardware specs and OS tuning (we cover this later)
• Fast reads:• Very efficient to transfer data from page cache to a network socket• Linux: sendfile() system call
• Combination of the two = fast Kafka!• Example (Operations): On a Kafka cluster where the consumers are mostly caught
up you will see no read activity on the disks as they will be serving data entirely from cache.
05/03/2023 19Confidential
Kafka Core Concepts - A first look
• The who is who• Producers write data to brokers.• Consumers read data from brokers.• All this is distributed.
• The data• Data is stored in topics.• Topics are split into partitions, which are replicated
05/03/2023 20Confidential
Kafka Concepts - Topics
• Topic: feed name to which messages are published• Example: “pubweb.event.2”
05/03/2023 21Confidential
Kafka Concepts - Topics
05/03/2023 22Confidential
Kafka Concepts -Creating a Topic
• Creating a topic• CLI
• APIhttps://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/storm/KafkaStormDemo.scala
• Auto-create via auto.create.topics.enable = true
• Modifying a topic- Add partitions- Add configs- Remove Configs- Deleting topics
$ kafka-topics.sh --zookeeper zookeeper1:2181 --create --topic zerg.hydra \ --partitions 3 --replication-factor 2 \ --config x=y
05/03/2023 23Confidential
Kafka Concepts - Partitions
• A topic consists of partitions• Partition: ordered + immutable sequence of messages
that is continually appended to
• Partitions of a topic are Configurable
05/03/2023 24Confidential
Kafka Concepts - Partition Offset
• Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset
• Consumers track their pointers via (offset, partition, topic) tuples
Consumer group C1
05/03/2023 25Confidential
Kafka Concepts - Partition Replica’s
• Replicas: “backups” of a partition• They exist solely to prevent data loss.• Replicas are never read from, never written to.
• They do NOT help to increase producer or consumer parallelism!
05/03/2023 26Confidential
Topics vs Partitions vs Replica’s
05/03/2023 27Confidential
Kafka Concepts - Topic inspection
• --describe the topic
• Leader: brokerID of the currently elected leader broker• Replica ID’s = broker ID’s
• ISR = “in-sync replica”, replicas that are in sync with the leader
• In this example:• Broker 0 is leader for partition 1.• Broker 1 is leader for partitions 0 and 2.• All replicas are in-sync with their respective leader partitions.
$ kafka-topics.sh --zookeeper zookeeper1:2181 --describe --topic zerg.hydraTopic:zerg2.hydra PartitionCount:3 ReplicationFactor:2 Configs: Topic: zerg2.hydra Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0 Topic: zerg2.hydra Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1 Topic: zerg2.hydra Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1,0
05/03/2023 28Confidential
Kafka Concepts - Consumers & Producers
df
05/03/2023 29Confidential
Kafka Concepts - Producer
df
• Code• Start Producer
05/03/2023 30Confidential
Kafka Concepts - Consumers
df
• Code• Start Consumer• Multithreaded Consumer for multiple
partitions
05/03/2023 31Confidential
Kafka Core Concepts - Recap
• The who is who• Producers write data to brokers.• Consumers read data from brokers.• All this is distributed.
• The data• Data is stored in topics.• Topics are split into partitions, which are replicated
05/03/2023 Confidential 32
Monitoring & Testing
05/03/2023 33Confidential
Kafka – Monitoring and Testing
• JMX Enabled• System tools
• Describe
• Quantified Offset Monitor• Monitoring DEMO
05/03/2023 Confidential 34
Empowering Kafka
05/03/2023
Apache ZooKeeper
Confidential 35
Apache Kafka uses ZooKeeper to detect crashes, implement topic discovery, and maintain production & consumption state for topics.
High-performance coordination service for distributed applications.
SoC – Separates Coordination overhead from Application logic.
Centralized service for naming (registry), configuration management, synchronization, and group membership services.
Zookeeper is backbone for Hbase, Solr, Facebook messaging apps & many more distributed apps.
Simple, Replicated, Ordered and Fast
05/03/2023
Zookeeper- Internals
Confidential 36
Znodes Persistent – exists till deleted Ephemeral - session scope
Reads by all Nodes and Writes through Leaders
Data is stored as byte array
Allows Watches and notifications
Ensemble – a group of Servers available to service
Quorum determined leader selection
05/03/2023
ZooKeeper – Guarantees
Confidential 37
• Follows principles of ATOMIC broadcast
Sequential Consistency – Updates are applied in order Atomicity – Updates either succeed or fail Single system image – Same view of service regardless of ZK server Reliability – Persistence of updates Timeliness – System is guaranteed to be up-to-date within time bound
• In Summary - Zookeeper { Leader Activation + Message delivery }
05/03/2023 Confidential 38
Kafka Performance
05/03/2023
Kafka performance – Producer tests(LinkedIn benchmark test)
Confidential 39
• HW Set-up with 2 linux nodes• Each with 8 2 GHZ cores (8 Cores/Mac ~ 16 GHZ processing)• 16 GB of RAM, 6 disks with RAID 10 and 1GB network connection.
• Producer test• Single producer ~ 10 million msgs each of 200bytes• Kafka msg batch 1 and 50. Other MQ’s no batching• X-axis – Msg sent to broker, Y-axis – Producer throughput
• Why is Producer fast• No ACK• Batching• Kafka storage format
05/03/2023
Kafka performance – Consumer tests(LinkedIn benchmark test)
Confidential 40
• HW Set-up with 2 linux nodes• Each with 8 2 GHZ cores (8 Cores/Mac ~ 16 GHZ processing)• 16 GB of RAM, 6 disks with RAID 10 and 1GB network connection.
• Consumer test• Single consumer retrives 10 million msgs each of 200bytes• Each pull request for 1000 msgs (200kb)• X-axis – Msg consumed from broker, Y-axis – consumer throughput
• Why is Producer fast• No Delivery state storage• Kafka storage format
(less data transmitted)
05/03/2023 Confidential 41
Summary, Conclusions
&References
05/03/2023
Summary – quick Recap
Confidential 42
Importance Handling & Processing BigData
Scope for introduction in Responsys Architecture
Existing System bottlenecks & shortfalls
Distributed Commit Log
Kafka Messaging
Kafka Internals – ZooKeeper
Performance & feature comparisons – Traditional vs New Age
05/03/2023
Conclusion – Open ended
Confidential 43
• Limitation is on Data – not on Systems
• No need for complete revamp
• Choice of Right systems at right time is the recipe.
References
1. https://kafka.apache.org/2. http://zookeeper.apache.org/3. http://
engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
4. http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
05/03/2023 Confidential 44
THANK YOU
Recommended