40
SPRINGONE2GX WASHINGTON, DC Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Developing Real-Time Data Pipelines with Apache Kafka Joe Stein @allthingshadoop

Developing Real-Time Data Pipelines with Apache Kafka

Embed Size (px)

Citation preview

Page 1: Developing Real-Time Data Pipelines with Apache Kafka

SPRINGONE2GX WASHINGTON,

DC

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Developing Real-Time Data Pipelines with Apache Kafka

Joe Stein @allthingshadoop

Page 2: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

CEO of Elodina http://www.elodina.net/ a big data as a service platform built on top open source software. The Elodina platform enables customers to analyze data streams and programmatically react to the results in real-time. We solve today’s data analytics needs by providing the tools and support necessary to utilize open source technologies. As users, contributors and committers, Elodina also provides support for frameworks that run on Mesos including Apache Kafka, Exhibitor (Zookeeper), Apache Storm, Apache Cassandra and a whole lot more!

Apache Kafka Committer & PMC Member LinkedIn: http://linkedin.com/in/charmalloc Twitter : @allthingshadoop

whoami

2

Page 3: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Contents •  Introduction To Kafka

•  Overview •  Topics, Partitions & Segments •  Data Durability •  Replication •  Producers •  Consumers •  Performance •  Integration •  Quick Start •  Operations

3

•  Designs •  Distributed RPC

o  Request o  Process o  Response

•  Storage & Analytics o  Stream o  Transform o  Analyze o  Store o  Search

Page 4: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Apache Kafka

4

Page 5: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Apache Kafka Apache Kafka was first open sourced by LinkedIn in 2011 Papers ●  Building a Replicated Logging System with Apache Kafka http://www.vldb.org/pvldb/vol8/p1654-wang.pdf

●  Kafka: A Distributed Messaging System for Log Processing http://research.microsoft.com/en-us/um/people/srikanth/netdb11/

netdb11papers/netdb11-final12.pdf ●  Building LinkedIn’s Real-time Activity Data Pipeline http://sites.computer.org/debull/A12june/pipeline.pdf

●  The Log: What Every Software Engineer Should Know About Real-time Data's Unifying Abstraction http://

engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

http://kafka.apache.org/

5

Page 6: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

How Big Data Usually Starts

6

Page 7: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

More Big Data!

7

Page 8: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Ah!

8

Page 9: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

eesh

9

Page 10: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Kafka de-couples data pipelines

10

Page 11: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Distributed Replicated Log

Read and write In real time As much as you want As fast as your network can go

11

Page 12: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Topics and Partitions

12

Page 13: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Log Segments

13

Page 14: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Distributed Replicated Log

14

Page 15: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Data Durability

15

Page 16: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Replication

16

Page 17: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Producers

17

Page 18: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Consumers

18

Page 19: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Consumer Failover

19

Page 20: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Producer Performance

20

https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

Page 21: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Consumer Performance

http://kafka.apache.org/documentation.html#maximizingefficiency

21

Page 22: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Client Libraries Community Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients ●  Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer

implementations included, GZIP and Snappy compression supported.

●  Python - Pure Python implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported.

●  C - High performance C library with full protocol support ●  Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy

compression supported. Ruby 1.9.3 and up (CI runs MRI 2. ●  Clojure - Clojure DSL for the Kafka API ●  JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation

Wire Protocol Developer's Guide https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol

22

Page 23: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Spring Integration

Good blog about it https://spring.io/blog/2015/04/15/using-apache-kafka-for-integration-and-data-processing-pipelines-with-spring

Kafka Integration Source

https://github.com/spring-projects/spring-integration-kafka Spring XD samples https://github.com/spring-projects/spring-xd-samples/tree/master/kafka-source

23

Page 24: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Quick Start

https://kafka.apache.org/documentation.html#quickstart Download the 0.8.2.2 release and un-tar it. > tar -xzf kafka_2.10-0.8.2.2.tgz > cd kafka_2.10-0.8.2.2 (use at least four terminal windows) > bin/zookeeper-server-start.sh config/zookeeper.properties > bin/kafka-server-start.sh config/server.properties > bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test > bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test This is a message This is another message > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning This is a message This is another message

24

Page 25: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Operationalizing Kafka

https://kafka.apache.org/documentation.html#basic_ops

Basic Kafka Operations

●  Adding and removing topics

●  Modifying topics

●  Graceful shutdown

●  Balancing leadership

●  Checking consumer position

●  Mirroring data between clusters

●  Expanding your cluster

●  Decommissioning brokers

●  Increasing replication factor

25

Page 26: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Running on Mesos

26

Page 27: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Static Partitioning

27

Page 28: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Scaling is manual (even if orchestrated)

28

Page 29: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Static failures require manual intervention

29

Page 30: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Application Elasticity

30

Page 31: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

An operating system for your data center

31

Page 32: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Everything goes on Mesos

32

Page 33: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Kafka on Mesos

https://github.com/mesos/kafka

●  smart broker.id assignment.

●  preservation of broker placement (through constraints and/or new features). ●  ability to-do configuration changes.

●  rolling restarts (for things like configuration changes).

●  scaling the cluster up and down with automatic, programmatic and manual

options.

●  smart partition assignment via constraints visa vi roles, resources and attributes.

33

Page 34: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Kafka on Mesos

Scheduler ●  Provides the operational automation for a Kafka Cluster. ●  Manages the changes to the broker's configuration. ●  Exposes a REST API for the CLI to use or any other client. ●  Runs on Marathon for high availability.

Executor ●  The executor interacts with the kafka broker as an intermediary to the scheduler

34

Page 35: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

REST API & CLI ●  scheduler - starts the scheduler. ●  add - adds one more more brokers to the cluster.

●  update - changes resources, constraints or broker properties one or more brokers.

●  remove - take a broker out of the cluster.

●  start - starts a broker up.

●  stop - this can either a graceful shutdown or will force kill it (./kafka-mesos.sh help stop)

●  rebalance - allows you to rebalance a cluster either by selecting the brokers or topics to rebalance. Manual

assignment is still possible using the Apache Kafka project tools. Rebalance can also change the replication factor on a topic.

●  help - ./kafka-mesos.sh help || ./kafka-mesos.sh help {command}

35

Page 36: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Launch 20 brokers in seconds

36

Page 37: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Kafka 0.9 KIP (Kafka Improvement Process)

•  https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals New Consumer

•  https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design

Security •  https://cwiki.apache.org/confluence/display/KAFKA/Security •  https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=51809888

JIRA

•  https://issues.apache.org/jira/browse/KAFKA/fixforversion/12328745/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel

37

Page 38: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Distributed RPC

38

Page 39: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Reference Architecture

39

Page 40: Developing Real-Time Data Pipelines with Apache Kafka

Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Questions?

http://www.elodina.net

40