Real time data processing with kafla spark integration

1© Copyright 2014 EMC Corporation. All rights reserved.

Real Time Data Streaming

+

Speakers:

Sumit Gupta, Data Intelligene Engineer, EMCKartikeya Putturaya, Data Intelligence Engineer, EMCChandraSekarRao Venkata, Data Intelligence Engineer, EMC


Data Engineering at EMC ITStack

Distributed Frameworks: Apache Spark, Pivotal Hadoop, Apache StormMessaging Systems: Rabbit MQ, Apache KafkaRelation Store: Greenplum

A glimpse on what we do

Predictive Maintenance of Exchange Servers - Monitoring over 145 exchange servers in real time, with an analytics engine running on a 8 node cluster, processing data volumes of ~100MB per 2 minutes

User Behavior Analytics for Network Threat Detection – Real time monitoring of EMC’s internal networks and performing user behavior pattern analysis for threats, again on a 8 node cluster, processing a stream of ~150MB of data any point of time


Predictive Maintenance of Exchange Servers


User Behavior Analytics for Network Threat Detection


Apache Kafka


OverviewAn apache project initially developed at LinkedIn

Distributed publish-subscribe messaging system• Designed for processing of real time activity stream data e.g. logs, metrics collections• Written in Scala• Does not follow JMS Standards, neither uses JMS APIs

FeaturesPersistent messagingHigh-throughputSupports both queue and topic semantics Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker)and many more…

http://kafka.apache.org/

http://kafka.apache.org/


How it works


Real time transferBroker does not Push messages to Consumer, Consumer Polls messages from Broker.


Kafka maintains a feed of messages in categories called topics. For each topic Kafka cluster maintains a partitioned log


Kafka InstallationDownload

http://kafka.apache.org/downloads.html

Untar it> tar -xzf kafka_<version>.tgz> cd kafka_<version>

http://kafka.apache.org/downloads.html


Start ServersStart the Zookeeper server

> bin/zookeeper-server-start.sh config/zookeeper.properties

Pre-requisite: Zookeeper should be up and running.

Now Start the Kafka Server > bin/kafka-server-start.sh config/server.properties


Create a topic> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

List down all topics> bin/kafka-topics.sh --list --zookeeper localhost:2181 Output: test

Create/List Topics


ProducerSend some Messages

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test Now type on console: This is a message This is another message


ConsumerReceive some Messages

> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning This is a message This is another message


Copy configs> cp config/server.properties config/server-1.properties > cp config/server.properties config/server-2.properties

Changes in the config files.config/server-1.properties: broker.id=1 port=9093 log.dir=/tmp/kafka-logs-1 config/server-2.properties: broker.id=2 port=9094 log.dir=/tmp/kafka-logs-2

Multi-Broker Cluster


Start other Nodes with new configs> bin/kafka-server-start.sh config/server-1.properties &> bin/kafka-server-start.sh config/server-2.properties &

Create a new topic with replication factor as 3> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic

List down the all topics> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topicTopic:my-replicated-topic PartitionCount:1 ReplicationFactor:3 Configs: Topic: my-replicated-topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0

Start with New Nodes


Spark StreamingMakes it easy to build scalable fault-tolerant streaming applications.

Ease of UseFault ToleranceCombine streaming with batch and interactive queries.




Spark Steaming Programming Model Spark streaming provides a high level abstraction called Discretized Stream or DStream - represents a stream of data - implemented as a sequence of RDDS



Spark Streaming + Kafka

There are two approaches to receive the data from Kafka for spark streaming

• Receiver based approach • Direct approach



#import Streaming Context and KafkaUtils from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils sc = SparkContext(appName="PythonStreamingKafkaWordCount") ssc = StreamingContext(sc, 1) #create KafkaStream by passing zookeeper server address and topic SparkStreaming kvs = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer", {“sparkStream":1}) #lines Dstream from KafkaStream

lines = kvs.map(lambda x: x[1]) #count Dstream from lines Dstream

counts = lines.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a+b)

counts.pprint()

ssc.start() ssc.awaitTermination()



from pyspark.streaming.kafka import KafkaUtils directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})offsetRanges = [] def storeOffsetRanges(rdd): global offsetRanges offsetRanges = rdd.offsetRanges() return rdd def printOffsetRanges(rdd): for o in offsetRanges: print "%s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset) directKafkaStream\ .transform(storeOffsetRanges)\ .foreachRDD(printOffsetRanges)


Thank You

Technology

Real time data processing with kafla spark integration