Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern...

Preview:

Citation preview

Apache KafkaYinhao HeJiaqi Xiao

Ananth Gottumukkala

Publish/subscribe messaging pattern● Publisher: classify the message without

knowing any subscribers exist

● Subscriber: subscribe to the message

without knowing any publishers exist

● Broker: decouples publishers from

subscribers

(Similar to a bulletin board)

What is Kafka?● Open source publish/subscribe messaging system

● Distributed event log (persistent on disk)

● Hybrid between a messaging system and a database

● High throughput platform

● Real-time data streams

● Used by Twitter, Netflix, and originally developed by LinkedIn

Kafka structure

Message● Single Unit of Data (Byte Array)

● Batch○ collection of messages produced for the same topic

and partition

○ trade-off between latency and throughput

○ can be compressed

● Additional Structure○ E.g. JSON, XML, AVRO or PROTOBUF

● Message ordering not guaranteed across multiple partitions

Producer & ConsumerProducer

● create new messages & send to specific topic

Consumer

● read messages○ In order

● Offset○ Created when message is written to Kafka○ Consumer remember what offset each partition is at○ Zookeeper

Consumer Group● each partition only

consumed by one member of a consumer group

Broker● Kafka cluster consists of

multiple servers called brokers

● Controller Broker responsible for administrative operations○ Assign partitions to brokers○ Monitor Broker Failure

● Provides redundancy of messages in the partition○ Avoid Broker Failure

Retention● Provides a certain time period durable

storage for messages

● Time

● Size

● Individual topics can also configure their

own retention settings

Reliability Guarantees● Guarantees the order of messages in one partition

● Committed messages won't be lost as long as at least one replica

remains alive and retention policy holds

● Consumers can only read committed messages

● At least once message delivery semantics

Advantages of KafkaDeals with Integration Complexity

High Throughput and Fairly Low Latency

Handles Big Data

Many Configuration Options

Data Retention

Multiple Producers/Consumers

Disadvantages of KafkaSteep Learning Curve

Not Low Enough Latency

Susceptible to Data Loss

● Split-Brain● Partition Lead Failover

Kafka vs JMS/ActiveMQ

Kafka JMS/ActiveMQ

Real-Time Data Stream Traditional Messaging

Consumers Pull Messages from Brokers Messages Pushed to Consumers

Implements Backpressure Hard to Achieve Backpressure

Data Retention to Disk No Data Retention

Guarantees Message Ordering in Partition No Ordering Guarantees

Can rewind and re-consume data Consumer does not track offset

Kafka vs Kinesis

Kafka Kinesis

Requires setting up your own cluster, nodes, replicas, partitions, etc.

AWS manages infrastructure, config, etc.

Flexible config but need to tune producers (amt. of data to send to broker), consumers (# replicas, # consumers per partition/topic)

Config not as flexible but AWS ensures availability/durability for 7 days. Configure # shards for throughput

Higher Maintenance/Risk Mgmt Cost Pay-as-you-go / Per # Shards

Thank you

Recommended