Stream Processing using Apache Spark and Apache Kafka

Apache SparkStreaming Processing with Kafka

Please introduce yourselves using the Q&A window that appears on the right while others join us.

● Session - 3 hours Duration (due to high demand increased it from 2:30 hrs)● First Half: Apache Spark Introduction & Streaming basics● 10 mins. break● Second Half: Hands-on demo using CloudxLab

● Session is being recorded. Recording & presentation will be shared after the session

● Asking Questions?● Every one except the instructor is muted● Please ask questions by typing in the Q&A window (requires logging

in to google+)● Instructor will read out the question before answering● To get better answers, keep your messages short and avoid chat

language

WELCOME TO THE SESSION

WELCOME TO CLOUDxLAB SESSION

A cloud based lab forstudents to gain hands-on experience in Big Data Technologies

such as Hadoop and Spark

● Learn Through Practice

● Real Environment

● Connect From Anywhere

● Connect From Any Device

● Centralized Data sets

● No Installation

● No Compatibility Issues

● 24x7 Support

TODAY’S AGENDA

I Introduction to Apache SparkII Introduction to stream processingIII Understanding RDD (Resilient Distributed Datasets)IV Understanding DstreamV Kafka IntroductionVI Understanding Stream Processing flowVII Real time Hands-on using CloudxLabVIII Questions and Answers

About Instructor?

2015 CloudxLab Founded2014 KnowBigData Founded2014

Amazon Built High Throughput Systems for Amazon.com site using in-house NoSql.

20122012 InMobi Built Recommender that churns 200 TB2011

tBits Global Founded tBits GlobalBuilt an enterprise grade Document Management System

D.E.Shaw Built the big data systems before the term was coined

20022002 IIT Roorkee Finished B.Tech.

Apache

A fast and general engine for large-scale data processing.

● Really fast MapReduce

● 100x faster than Hadoop MapReduce in memory,

● 10x faster on disk.

● Builds on similar paradigms as MapReduce

● Integrated with Hadoop

SPARK STREAMING

Extension of the core Spark API: high-throughput, fault-tolerant

Input Sources Output

SPARK STREAMING

Workflow

• Spark Streaming receives live input data streams

• Divides the data into batches

• Spark Engine generates the final stream of results in batches.

Provides a discretized stream or DStream - a continuous stream of data.

SPARK STREAMING - DSTREAMInternally represented using RDD

Each RDD in a DStream contains data from a certain interval.

SPARK STREAMING - EXAMPLE

Problem: do the word count every second.Step 1: Create a connection to the service

from pyspark import SparkContextfrom pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working thread and # batch interval of 1 secondsc = SparkContext("local[2]", "NetworkWordCount")ssc = StreamingContext(sc, 1)# Create a DStream that will connect to hostname:port, # like localhost:9999lines = ssc.socketTextStream("localhost", 9999)

Step 2: Split each line into words, convert to tuple and then count.

# Split each line into wordswords = lines.flatMap(lambda line: line.split(" "))

# Count each word in each batchpairs = words.map(lambda word: (word, 1))

#Do the countwordCounts = pairs.reduceByKey(lambda x, y: x + y)

Problem: do the word count every second.

Step 3: Print the stream. It is a periodic event

# Print the first ten elements of each RDD generated # in this DStream to the consolewordCounts.pprint()

Step 4: Every Thing Setup: Lets Start

# Start the computationssc.start()

# Wait for the computation to terminatessc.awaitTermination()

SPARK STREAMING - EXAMPLEProblem: do the word count every second.

spark-submit spark_streaming_ex.py

2>/dev/null

(Also available in HDFS at /data/spark)

nc -l 9999

SPARK STREAMING - EXAMPLEProblem: do the word count every second.

spark-submit spark_streaming_ex.py

2>/dev/nullyes|nc -l 9999

Spark Streaming + Kafka Integration

Apache Kafka● Publish-subscribe messaging● A distributed, partitioned, replicated commit log service.

Prerequisites● Zookeeper● Kafka● Spark● All of above are installed by Ambari with HDP (CloudxLab)● Kafka Library - you need to download from maven

○ also available in /data/spark

Step 1: Download the spark assembly from here. Include essentials

from __future__ import print_functionfrom pyspark import SparkContextfrom pyspark.streaming import StreamingContextfrom pyspark.streaming.kafka import KafkaUtilsimport sys

Problem: do the word count every second from kafka

Step 2: Create the streaming objects

sc = SparkContext(appName="KafkaWordCount")ssc = StreamingContext(sc, 1)

#Read name of zk from argumentszkQuorum, topic = sys.argv[1:]

#Listen to the topickvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})

Step 3: Create the RDDs by Transformations & Actions

#read lines from streamlines = kvs.map(lambda x: x[1])

# Split lines into words, words to tuples, reducecounts = lines.flatMap(lambda line: line.split(" ")) \.map(lambda word: (word, 1)) \.reduceByKey(lambda a, b: a+b)

#Do the printcounts.pprint()

Step 4: Start the process

ssc.start() ssc.awaitTermination()

Step 5: Create the topic

#Login via ssh or Consolessh centos@e.cloudxlab.com# Add following into pathexport PATH=$PATH:/usr/hdp/current/kafka-broker/bin

#Create the topickafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

#Check if createdkafka-topics.sh --list --zookeeper localhost:2181

Step 6: Create the producer

# find the ip address of any broker from zookeeper-client using command get /brokers/id/0

kafka-console-producer.sh --broker-list ip-172-31-13-154.ec2.internal:6667 --topic session2

#Test if producing by consuming in another terminalkafka-console-consumer.sh --zookeeper localhost:2181 --topic session2 --from-beginning

#Produce a lotyes|kafka-console-producer.sh --broker-list ip-172-31-13-154.ec2.internal:6667 --topic test

Step 7: Do the stream processing. Check the graphs at :4040/

(spark-submit --jars spark-streaming-kafka-assembly_2.10-1.6.0.jar kafka_wordcount.py localhost:2181 session2) 2>/dev/null

● The updateStateByKey operation allows you to maintain arbitrary state while continuously updating it with new information.

● To use this, you will have to do two steps.○ Define the state - The state can be an arbitrary data type.○ Define the state update function - Specify with a function how to update the state

using the previous state and the new values from an input stream● In every batch, Spark will apply the state update function for all existing keys, regardless

of whether they have new data in a batch or not. ● If the update function returns None then the key-value pair will be eliminated

UpdateStateByKey OperationCompute Aggregation across whole day

UpdateStateByKey Operation

def updateFunction(newValues, runningCount): if runningCount is None: runningCount = 0

# add the new values with the previous running count # to get the new count

return sum(newValues, runningCount)

runningCounts = pairs.updateStateByKey(updateFunction)

Objective: Maintain a running count of each word seen in a text data stream.The running count is the state and it is an integer

Feedback

http://bit.ly/1ZQwUAn

Help us improve

Apache Spark

Thank you.

reachus@cloudxlab.com +1 412 568 3901 (US)

+91 803 951 3513 (IN)

Subscribe to our Youtube channel for latest videos - https://www.youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA

Stream Processing using Apache Spark and Apache Kafka

Technology

User's Guide Apache Kafka Software Release 2 diagram shows the parts (green) of Apache Kafka: Core Apache Kafka (light green), including the Kafka client API and the Kafka broker

Real time Analytics with Apache Kafka and Apache Spark

Overview - 24b4dt1v60e526bo2p349l4c-wpengine.netdna-ssl.com€¦ · such as Apache Cassandra®, Apache Kafka®, Apache Spark™ and Elasticsearch. Our expertize stems from delivering

Spark stream - Kafka

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Accelerating Anomaly Detection Algorithms on FPGA-Based ...matutani/papers/matsutani_mpsoc2018.pdf(Apache Kafka) Stream processing (Apache Spark Streaming) Batch processing (Apache

[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story

Making Apache Kafka Elastic with Apache Mesos

Apache Kafka - · PDF fileOverview What is Apache Kafka? Data pipelines Architecture How does Apache Kafka work? Brokers Producers Consumers Topics

Intro to Apache Kafka

Introduction to Cassandra • Why Spark - Apache Cassandra | Apache Kafka | Apache Spark · 2017. 12. 20. · • Introduction to Cassandra • Why Spark + Cassandra • Problem background

Apache Kafka

Spark streaming with apache kafka

Security, Privacy and IPR/Data Protection … Flink Apache Kafka and Apache Spark Integration Aspects Appendix D: Overview of Message Bus Technology MQTT Apache Kafka Apache Flume

Apache Kafka - RainFocus · Apache Kafka Scalable Message ... Introduction& Motivation Apache Kafka -Scalable Message Processing and more! Apache Kafka -Overview ... • Apache Spark

Apache Spark and Apache Cassandra Processing 200K … · /usr/bin/whoami • Ben Bromhead, CTO of Instaclustr • We provide managed Cassandra, Spark and Kafka in the cloud (AWS,

Integrating Apache Hive with Kafka, Spark, and BI · 2020-05-10 · Data Access Apache Hive-Kafka integration Related Information Apache Kafka Documentation Perform ETL by ingesting

Kafka Tutorial - Introduction to Apache Kafka (Part 1)

How Apache Kafka is transforming Hadoop, Spark and Storm

Introduction to Apache Kafka