27
1 © Copyright 2014 EMC Corporation. All rights reserved. Real Time Data Streaming + Speakers: Sumit Gupta, Data Intelligene Engineer, EMC Kartikeya Putturaya, Data Intelligence Engineer, EMC ChandraSekarRao Venkata, Data Intelligence Engineer, EMC

Real time data processing with kafla spark integration

  • Upload
    tcs

  • View
    346

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Real time data processing with kafla spark integration

1© Copyright 2014 EMC Corporation. All rights reserved.

Real Time Data Streaming

+

Speakers:

Sumit Gupta, Data Intelligene Engineer, EMCKartikeya Putturaya, Data Intelligence Engineer, EMCChandraSekarRao Venkata, Data Intelligence Engineer, EMC

Page 2: Real time data processing with kafla spark integration

2© Copyright 2014 EMC Corporation. All rights reserved.

Data Engineering at EMC ITStack

Distributed Frameworks: Apache Spark, Pivotal Hadoop, Apache StormMessaging Systems: Rabbit MQ, Apache KafkaRelation Store: Greenplum

A glimpse on what we do

Predictive Maintenance of Exchange Servers - Monitoring over 145 exchange servers in real time, with an analytics engine running on a 8 node cluster, processing data volumes of ~100MB per 2 minutes

User Behavior Analytics for Network Threat Detection – Real time monitoring of EMC’s internal networks and performing user behavior pattern analysis for threats, again on a 8 node cluster, processing a stream of ~150MB of data any point of time

Page 3: Real time data processing with kafla spark integration

3© Copyright 2014 EMC Corporation. All rights reserved.

Predictive Maintenance of Exchange Servers

Page 4: Real time data processing with kafla spark integration

4© Copyright 2014 EMC Corporation. All rights reserved.

User Behavior Analytics for Network Threat Detection

Page 5: Real time data processing with kafla spark integration

5© Copyright 2014 EMC Corporation. All rights reserved.

Apache Kafka

Page 6: Real time data processing with kafla spark integration

6© Copyright 2014 EMC Corporation. All rights reserved.

OverviewAn apache project initially developed at LinkedIn

Distributed publish-subscribe messaging system• Designed for processing of real time activity stream data e.g. logs, metrics collections• Written in Scala• Does not follow JMS Standards, neither uses JMS APIs

FeaturesPersistent messagingHigh-throughputSupports both queue and topic semantics Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker)and many more…

http://kafka.apache.org/

Page 7: Real time data processing with kafla spark integration

7© Copyright 2014 EMC Corporation. All rights reserved.

How it works

Page 8: Real time data processing with kafla spark integration

8© Copyright 2014 EMC Corporation. All rights reserved.

Real time transferBroker does not Push messages to Consumer, Consumer Polls messages from Broker.

Page 9: Real time data processing with kafla spark integration

9© Copyright 2014 EMC Corporation. All rights reserved.

Kafka maintains a feed of messages in categories called topics. For each topic Kafka cluster maintains a partitioned log

Page 10: Real time data processing with kafla spark integration

10© Copyright 2014 EMC Corporation. All rights reserved.

Kafka InstallationDownload

http://kafka.apache.org/downloads.html

Untar it> tar -xzf kafka_<version>.tgz> cd kafka_<version>

Page 11: Real time data processing with kafla spark integration

11© Copyright 2014 EMC Corporation. All rights reserved.

Start ServersStart the Zookeeper server

> bin/zookeeper-server-start.sh config/zookeeper.properties

Pre-requisite: Zookeeper should be up and running.

Now Start the Kafka Server > bin/kafka-server-start.sh config/server.properties

Page 12: Real time data processing with kafla spark integration

12© Copyright 2014 EMC Corporation. All rights reserved.

Create a topic> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

List down all topics> bin/kafka-topics.sh --list --zookeeper localhost:2181 Output: test

Create/List Topics

Page 13: Real time data processing with kafla spark integration

13© Copyright 2014 EMC Corporation. All rights reserved.

ProducerSend some Messages

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test Now type on console: This is a message This is another message

Page 14: Real time data processing with kafla spark integration

14© Copyright 2014 EMC Corporation. All rights reserved.

ConsumerReceive some Messages

> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning This is a message This is another message

Page 15: Real time data processing with kafla spark integration

15© Copyright 2014 EMC Corporation. All rights reserved.

Copy configs> cp config/server.properties config/server-1.properties > cp config/server.properties config/server-2.properties

Changes in the config files.config/server-1.properties: broker.id=1 port=9093 log.dir=/tmp/kafka-logs-1 config/server-2.properties: broker.id=2 port=9094 log.dir=/tmp/kafka-logs-2

Multi-Broker Cluster

Page 16: Real time data processing with kafla spark integration

16© Copyright 2014 EMC Corporation. All rights reserved.

Start other Nodes with new configs> bin/kafka-server-start.sh config/server-1.properties &> bin/kafka-server-start.sh config/server-2.properties &

Create a new topic with replication factor as 3> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic

List down the all topics> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topicTopic:my-replicated-topic PartitionCount:1 ReplicationFactor:3 Configs: Topic: my-replicated-topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0

Start with New Nodes

Page 17: Real time data processing with kafla spark integration

17© Copyright 2014 EMC Corporation. All rights reserved.

Spark StreamingMakes it easy to build scalable fault-tolerant streaming applications.

Ease of UseFault ToleranceCombine streaming with batch and interactive queries.

Page 18: Real time data processing with kafla spark integration

18© Copyright 2014 EMC Corporation. All rights reserved.

Page 19: Real time data processing with kafla spark integration

19© Copyright 2014 EMC Corporation. All rights reserved.

Page 20: Real time data processing with kafla spark integration

20© Copyright 2014 EMC Corporation. All rights reserved.

Spark Steaming Programming Model Spark streaming provides a high level abstraction called Discretized Stream or DStream - represents a stream of data - implemented as a sequence of RDDS

Page 21: Real time data processing with kafla spark integration

21© Copyright 2014 EMC Corporation. All rights reserved.

Page 22: Real time data processing with kafla spark integration

22© Copyright 2014 EMC Corporation. All rights reserved.

Spark Streaming + Kafka

There are two approaches to receive the data from Kafka for spark streaming

• Receiver based approach • Direct approach

Page 23: Real time data processing with kafla spark integration

23© Copyright 2014 EMC Corporation. All rights reserved.

Page 24: Real time data processing with kafla spark integration

24© Copyright 2014 EMC Corporation. All rights reserved.

#import Streaming Context and KafkaUtils from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils sc = SparkContext(appName="PythonStreamingKafkaWordCount") ssc = StreamingContext(sc, 1) #create KafkaStream by passing zookeeper server address and topic SparkStreaming kvs = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer", {“sparkStream":1}) #lines Dstream from KafkaStream

lines = kvs.map(lambda x: x[1]) #count Dstream from lines Dstream

counts = lines.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a+b)

counts.pprint()

ssc.start() ssc.awaitTermination()

Page 25: Real time data processing with kafla spark integration

25© Copyright 2014 EMC Corporation. All rights reserved.

Page 26: Real time data processing with kafla spark integration

26© Copyright 2014 EMC Corporation. All rights reserved.

from pyspark.streaming.kafka import KafkaUtils directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})offsetRanges = []  def storeOffsetRanges(rdd): global offsetRanges offsetRanges = rdd.offsetRanges() return rdd  def printOffsetRanges(rdd): for o in offsetRanges: print "%s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset)  directKafkaStream\ .transform(storeOffsetRanges)\ .foreachRDD(printOffsetRanges)

Page 27: Real time data processing with kafla spark integration

27© Copyright 2014 EMC Corporation. All rights reserved.

Thank You