Introduction to Apache Kafka & LinkedIn Camus Deep Shah Software Engineer Intern Twitter : @dsshah22 LinkedIn : https://www/linkedin.com/in/deepshah22

Copy of Kafka-Camus

Embed Size (px)

Citation preview

Page 1: Copy of Kafka-Camus

Introduction to Apache Kafka & LinkedIn Camus

Deep Shah Software Engineer Intern Twitter : @dsshah22 LinkedIn : https://www/linkedin.com/in/deepshah22

Page 2: Copy of Kafka-Camus


A Distributed system consists of multiple computers that communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal.

● Configuration Management

○ Cluster member nodes Bootstrapping configuration from a central source

● Distributed Cluster Management

○ Node Join/Leave

○ Node Status in real time

● Naming Service – e.g. DNS

● Distributed Synchronization – locks, barriers

● Leader election

● Centralized and Highly reliable Registry 2

Apache Zookeeper

Page 3: Copy of Kafka-Camus


Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.● Broker: The cluster consisting of one or more servers in Kafka.

● Topics: The categories in which Kafka maintains its feeds of messages.

● Producers: The processes that publish messages to a topic.

● Consumers: The processes that subscribe to topics so as to fetch the above published messages.

● Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker)


Apache Kafka


Broker Producer


Page 4: Copy of Kafka-Camus






Broker - 2Broker - 1 Broker - 3

Apache Kafka


● It is kafka cluster consisting of one or more servers.





Page 5: Copy of Kafka-Camus


● Receive messages from Producers (push), deliver messages to Consumers (pull).● Broker: The cluster consisting of one or more servers in Kafka.

● Topics: The categories in which Kafka maintains its feeds of messages.

● Producers: The processes that publish messages to a topic.

● Consumers: The processes that subscribe to topics so as to fetch the above published messages.


Apache Kafka


Broker Producer


Page 6: Copy of Kafka-Camus

● A topic is a category or feed name to which messages are published. ● Each partition is an ordered, immutable sequence of messages that is

continually appended to—a commit log.● The messages in the partitions are each assigned a sequential id number

called the offset that uniquely identifies each message within the partition. This offset are in data folder of kafka.

● This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads messages, but in fact the position is controlled by the consumer and it can consume messages in any order it likes.

● The partitions of the log are distributed over the servers.● Each server handling data and requests for a share of the partitions. ● Each partition is replicated across a configurable number of servers for

fault tolerance.

Topics and Partitions


Apache Kafka

Page 7: Copy of Kafka-Camus

● Each partition has one server which acts as the leader and zero or more servers which act as followers.

● The leader handles all read and write requests for the partition, while followers passively replicate the leader.

● This replication helps to retain messages on leader’s failure. If the leader fails, one of the followers automatically becomes the new leader.

● Each server acts as a leader for some of its partitions and a follower for others, so load is well balanced within the cluster


Apache Kafka

Leader and Follower























Partition - 1 Partition - 1 Partition - 1

Page 8: Copy of Kafka-Camus

● Receive messages from Producers (push), deliver messages to Consumers (pull).● Broker: The cluster consisting of one or more servers in Kafka.

● Topics: The categories in which Kafka maintains its feeds of messages.

● Producers: The processes that publish messages to a topic.

● Consumers: The processes that subscribe to topics so as to fetch the above published messages.



Apache Kafka


Broker Producer


Page 9: Copy of Kafka-Camus

● Receive messages from Producers (push), deliver messages to Consumers (pull).

● Producer○ Producers publish data to the topics of their choice.○ The producer is responsible for choosing which message to assign to which

partition within the topic.○ Using Round-robin fashion or Simple partition function.

● Consumers○ Consumers request a range of messages from a Broker.○ Messaging Models: queuing and publish-subscribe. ○ Consumers label themselves with a consumer group name and each message

published to a topic is delivered to one consumer instance within each subscribing consumer group.

○ Consumer instances can be on separate processes or on separate machines.

Producers and Consumers


Apache Kafka

Page 10: Copy of Kafka-Camus

Offset Management


● All the consumer offset commit requests are sent as produce requests to a special topic named “__offsets”. Refer to this topic as the “offsets topic” here on.

● The offset commit messages are partitioned based on the consumer group in the key. This would result in all the messages of a given consumer group ending to a single broker and thus facilitates offset fetch requests without having to scatter-gather from several brokers.

Apache Kafka

Page 11: Copy of Kafka-Camus

● Camus is LinkedIn's Kafka --> HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka.● A single execution of Camus consists of three stages:

a. Setup stage fetches available topics and partitions from Zookeeper and the latest offsets from the Kafka Nodes.b. Hadoop job stage allocates topic pulls among a set number of tasks. c. Cleanup Stage reads counts from the all tasks, aggregates values and submits the result to the Kafka for Consumption of Kafka




LinkedIn Camus

Page 12: Copy of Kafka-Camus

● Setup stage fetches from Zookeeper Kafka broker urls and topics (in /brokers/id and /brokers/topics). This data is transient and will be gone

once Kafka server is down.

● Topic offsets stored in HDFS. Camus maintains its own status by storing offset for each topic in HDFS. This data is persistent.

● Setup stage allocates all topics and partitions among a fixed number of tasks.

1. Setup Stage


LinkedIn Camus

Page 13: Copy of Kafka-Camus

I. Pulling the Data

Each hadoop task uses a list of topic partitions with offsets generated by setup stage as input. It uses them to initialize Kafka requests and fetch events from Kafka brokers. Each task generates four types of outputs (by using a custom MultipleOutputFormat): Data files, Count statistics files, Updated offset files, and Error files.

II. Committing the Data

Once a task has successfully completed, all topics pulled are committed to their final output directories. If a task doesn't complete successfully, then none of the output is committed. When a task appears to be running slowly. In that case the job tracker then schedules the task on a different node and runs both the main task and the speculative task in parallel. Once one of the tasks completes, the other task is killed.

III. Producing Audit Counts

Successful tasks also write audit counts to HDFS.

IV. Storing the Offsets

Final offsets are written to HDFS and consumed by the subsequent job.

2. Hadoop Job


LinkedIn Camus

Page 14: Copy of Kafka-Camus

● Once the hadoop job has completed, the main client reads all the written audit counts and aggregates them. The aggregated results are then submitted to Kafka.

3. Job Cleanup


LinkedIn Camus

Page 15: Copy of Kafka-Camus

1. http://kafka.apache.org/ 2. https://github.com/linkedin/camus 3. http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
