View
596
Download
10
Category
Preview:
Citation preview
Introduction to KafkaBY DUCAS FRANCIS
The problem
Web Security System
Real-time Monitoring
Logging SystemOther
services
Mobile
API
Job
It’s simple enough at first…
Then it gets a little busy…
And ends up a mess.
The solution
Web Security System
Real-time Monitoring
Logging SystemOther
services
Mobile
API
Job
Pub/Sub
Decouple data pipelines using a pub/sub system
Producers Brokers Consumers
Apache KafkaA UNIFIED, HIGH-THROUGHPUT, LOW-LATENCY PLATFORM FOR HANDLING REAL-TIME DATA FEEDS
A brief history lesson
Originally developed at LinkedIn in 2011 Graduated Apache Incubator in 2012 Engineers from LinkedIn formed Confluent in 2014 Up to version 0.9.4 with 0.10 on horizon
Motivation
Unified platform for all real-time data feeds High throughput for high volume streams Support periodic data loads from offline systems Low latency for traditional messaging Support partitioned, distributed, real-time processing Guarantee fault-tolerance
Common use cases
Messaging Website activity tracking Metrics Log aggregation Stream processing Event sourcing Commit log
Benefits of Kafka
High throughput Low latency Load balancing Fault tolerant Guaranteed delivery Secure
Performance comparison
Batch performance comparison
Some terminology
Topic – feed of messages Producer – publishes messages to a topic Consumer – subscribes to topics and processes the feed of messages Broker – server instance that acts in a cluster
@apachekafka
powers @
microsot…
Libraries
Python – kafka-python / pykafka Go – sarama / go_kafka_client / … C/C++ - librdkafka / libkafka / … .NET – kafka-net (x2) / rdkafka-dotnet / CSharpClient-for-Kafka Node.js – kafka-node / sutoiku/node-kafka / ... HTTP – kafka-pixy / kafka-rest
etc.
Architecture
Producer Producer
Broker BrokerBroker
Consumer ConsumerZookeeper
Cluster
x3
Show me the Kafka!!! VAGRANT TO THE RESCUE
Anatomy of a topic
Topics are broken into partitions Messages are assigned sequential
ID called and offset Data is retained for a
configurable period of time Number of partitions can be
increased after creation, but not decreased
Partitions are assigned to brokers
Each partition is an ordered, immutable sequence of messages that is continually appended to…a commit log.
Broker
Kafka service running as part of a cluster Receives messages from producers and serves them to consumers Coordinated using Zookeeper Need odd number for quorum Store messages on the file system Replicate messages to/from other brokers Answer metadata requests about brokers and topics/partitions As of 0.9.0 – coordinate consumers
Replication
Partitions on a topic should be replicated Each partition has 1 leader and 0 or more followers An In-Sync Replica (ISR) is one that’s communicating with Zookeeper
and not too far behind the leader Replication factor can be increased after creation, not decreased
./kafka-topics--CREATE--REPLICATION-FACTOR--PARTITIONS
--DESCRIBE
Producers
Publishes messages to a topic Distributes messages across partitions
Round-robin Key hashing
Send synchronously or asynchronously to the broker that is the leader for the partition ACKS = 0 (none),1 (leader), -1 (all ISRs) Synchronous is obviously slower, but more durable
Testing... Testing… 1 2 3
LET’S SEE HOW FAST WE CAN PUSH
Consumers
Read messages from a topic Multiple consumers can read from the same topic Manage their offsets Messages stay on Kafka after they are consumed
Testing... Testing… 1 2 3
LET’S SEE HOW FAST WE CAN RECEIVE
It’s fast! But why…?
Efficient protocol based on message set Batching messages to reduce network latency and small I/O operations Append/chunk messages to increase consumer throughput
Optimised OS operations pagecache sendfile()
Broker services consumers from cache where possible End-to-end batch compression
Load balanced consumers
Distribute load across instances in a group by allocating partitions Handle failure by rebalancing partitions to other instances Commit their offsets to Kafka
ClusterBroker 1 Broker 2P0 P1 P2 P3
Consumer Group 1
C0 C1Consumer Group 2
C2 C3 C4 C6
Consumer groups and offsets
ClusterBroker 1 Broker 2P0 P1 P2 P3
Consumer Group 1
C0 C1
0 1 2 3 4 5 6 7 8 9 10P3
C1read
C1commit
C0read
C0commit
Guarantees
Messages sent by a producer to a particular topic’s partition will be appended in the order they are sent
A consumer instance sees messages in the order they are stored in the log
For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any messages committed to the log
Ordered delivery
Messages are guaranteed to be delivered in order by partition, NOT topic
M1 M3 M5
M2 M4 M6
P0
P1
M1 before M3 before M5 – YES M1 before M2 – NO M2 before M4 before M6 – YES M2 before M3 - NO
Enough ALT… now .NET USING RDKAFKA-DOTNET
FIN. THANK YOU
Resources
http://kafka.apache.org/documentation.html http://www.confluent.io/ https://kafka.apache.org/090/configuration.html https://github.com/edenhill/librdkafka https://github.com/ah-/rdkafka-dotnet
Log compaction
Keep the most recent payload for a key Use cases
Database change subscription Event sourcing Journaling for HA
Log compaction
Recommended