27
1 © Cloudera, Inc. All rights reserved. IoT with Spark Streaming Anand Iyer, Senior Product Manager

IoT Austin CUG talk

Embed Size (px)

Citation preview

Page 1: IoT Austin CUG talk

1© Cloudera, Inc. All rights reserved.

IoT with Spark StreamingAnand Iyer, Senior Product Manager

Page 2: IoT Austin CUG talk

2© Cloudera, Inc. All rights reserved.

Spark Streaming• Incoming data stream is represented as DStreams (Discretized Streams)• Stream is broken down into micro-batches• Each micro-batch is an RDD – process using RDD operations• Micro-batches usually 0.5 sec in size

Page 3: IoT Austin CUG talk

3© Cloudera, Inc. All rights reserved.

Cloudera customer use case examples – Streaming

• On-line fraud detection

Financial Services

• On-line recommender systems

• Inventory management

Retail

• Incident prediction (sepsis)

Health

• Analysis of ad performance in real-time

Ad tech

Page 4: IoT Austin CUG talk

4© Cloudera, Inc. All rights reserved.

Concrete end-to-end IoT Use CaseUsing Spark Streaming with Kafka, HBase & Solr

Page 5: IoT Austin CUG talk

5© Cloudera, Inc. All rights reserved.

Proactive maintenance and accident prevention in Railways

• Sensor information continuously streaming in from railway carriages• Goal: Early detection of damage to rail carriage wheels or to railway tracks• Proactively fix issues before they become severe• Prevent derailments, save money and lives• Based on real-world use case, modified to fit the talk

Page 6: IoT Austin CUG talk

6© Cloudera, Inc. All rights reserved.

Locomotive Wheel Axle Sensors

Each Sensor Reading Contains:- Unique ID- Locomotive ID- Speed- Temperature- Pressure- Acoustic signals - GPS Co-ordinates- Timestamp- etc

Page 7: IoT Austin CUG talk

7© Cloudera, Inc. All rights reserved.

Identify Damage to locomotive axle or wheelsManifests as sustained increasein sensor readings like temperature,pressure, acoustic noise, etc.

Page 8: IoT Austin CUG talk

8© Cloudera, Inc. All rights reserved.

Identify Damage on railway tracksManifests as a sudden spikein sensor readings forpressure or acoustic noise.

Page 9: IoT Austin CUG talk

9© Cloudera, Inc. All rights reserved.

Real-Time Detection of Locomotive Wheel Damage

Kafka

- Enrich incoming events with relevant meta-data- Locomotive information from

locomotive ID: type, weight, cargo,etc- Sensor information from Sensor ID:

precise location, type, etc- GPS co-ordinates to location

characteristics such as gradient of track.- Recommend HBase as metadata store.- Use HBase-spark module to fetch data.

- Apply application logic to determine if sensor readings indicate damage- Simple rule based- Complex predictive machine learning

model

Page 10: IoT Austin CUG talk

10© Cloudera, Inc. All rights reserved.

Real-Time Detection of Locomotive Wheel Damage

Kafka Kafka

https://github.com/harishreedharan/spark-streaming-kafka-output

HDFS

Page 11: IoT Austin CUG talk

11© Cloudera, Inc. All rights reserved.

Real-Time Detection of Locomotive Wheel Damage

- When an alert is thrown, technician will need to diagnose the event

- Requires visualizing sensor data as a time-series:- Over arbitrary windows of time- Compare with values from prior trips- Software for visualization: http://grafana.org/

- Technician can take appropriate action based on analysis:- Send rail carriage for maintenance- Stop train immediately to prevent accident

Visualize Time-Series Sensor Data

Page 12: IoT Austin CUG talk

12© Cloudera, Inc. All rights reserved.

Data Store for Time-Series Data

Ideal solution: Kudu- Time series data entails sequential scans for writes and reads, interspersed with

random seeks

Until Kudu is GA:- Use HBase and model tables for time-series data- OpenTSDB:

- Built on top of HBase- Uses a HBase table schema optimized for time-series data- Simple HTTP API

Page 13: IoT Austin CUG talk

13© Cloudera, Inc. All rights reserved.

Real-Time Detection of Locomotive Wheel Damage

Kafka Kafka

HDFS

Page 14: IoT Austin CUG talk

14© Cloudera, Inc. All rights reserved.

Detecting damage to Railtracks

• They manifest as a sharp spike in sensor readings (pressure, acoustic noise)• Multiple sensors will demonstrate the same spike at the same location (GPS co-

ordinates)• Multiple sensors from multiple trains will give similar readings at the same

location.

How to detect?• Index each sensor reading, in Solr, such that they can be queried by GPS co-

ordinates• When a “spike” is observed, and corresponding alert event is fired, trigger a

search

Page 15: IoT Austin CUG talk

15© Cloudera, Inc. All rights reserved.

Detecting damage to Rail tracks• Index each sensor reading, with the Morphlines library• Embed call to Morphlines in your Spark Streaming application• Values can be kept in the index for specified period of time, such as a month. Solr can automatically

purge old documents from the index.

• When a “spike” is observed, and corresponding alert event is fired, trigger a search (manually or programmatically)

• Search for sensor readings at the same GPS co-ordinates as the latest spike.• Filter out irrelevant readings (e.g. readings on the left track, if spike was observed on the right track)• Sort results by time, latest to oldest

• If majority of recent readings show a “spike”, indicative of track damage

Page 16: IoT Austin CUG talk

16© Cloudera, Inc. All rights reserved.

Final Architecture

Kafka Kafka

HDFS

Morphlines

HBase-Sp

ark/R

EST

Page 17: IoT Austin CUG talk

17© Cloudera, Inc. All rights reserved.

Noteworthy Streaming Constructs

Page 18: IoT Austin CUG talk

18© Cloudera, Inc. All rights reserved.

Sliding Window Operations

Example usages: - compute counts of items in latest window of time, such as occurrences of exceptions in

a log or trending hashtags in a tweet stream- Join two streams by matching keys within same window

Note: Provide adequate memory to hold a window’s worth of data

Define operations on data within a sliding window.Window Parameters: - window length- sliding interval

Page 19: IoT Austin CUG talk

19© Cloudera, Inc. All rights reserved.

Maintain and update arbitrary stateupdateStateByKey(...)• Define initial state• Provide state update function• Continuously update with new information

Examples: • Running count of words seen in text stream• Per user session state from activity stream

Note: Requires periodic check-pointing to fault-tolerant storage.

Page 20: IoT Austin CUG talk

20© Cloudera, Inc. All rights reserved.

Lessons from Production

Page 21: IoT Austin CUG talk

21© Cloudera, Inc. All rights reserved.

Use Kafka Direct Connector whenever possible• Better efficiency and performance than Receiver based Connectors

• Automatic back-pressure: steady performance

KafkaSparkDriver

Executor

Executor

Executor

Executor

Receiver

Receiver

SparkDriver

Executor

Executor

Executor

Executor

Page 22: IoT Austin CUG talk

22© Cloudera, Inc. All rights reserved.

The challenge with Checkpoints

• Spark checkpoints are java serialized• Upgradeability can be an issue – upgrading the version of Spark or your

application can make checkpointed data unreadable

But long running applications need updates and upgrades!!

Page 23: IoT Austin CUG talk

23© Cloudera, Inc. All rights reserved.

Upgrades with Checkpoints

• Most often, all you need to pick up is some previous state – maybe an RDD or some “state”(updateStateByKey), or last processed Kafka offsets

• The solution: Disable Spark Checkpoints

• Use foreachRDD to persist state yourself, to HDFS, in a format your application can understand• E.g. Avro, Protobuf, Parquet…

• For upateStateByKey, generate the new state - then persist

Page 24: IoT Austin CUG talk

24© Cloudera, Inc. All rights reserved.

updateStateByKey(…) upcoming improvements

• Time-out: Automatically delete data after a preset number of micro-batches

• Efficient Updates: Only update a subset of the keys

• Callback to persist state during graceful shutdown

Page 25: IoT Austin CUG talk

25© Cloudera, Inc. All rights reserved.

Exactly Once SemanticsWhat is it?Given a stream of incoming data, any operator is applied exactly once on each item.

Why is it important? Prevent erroneous processing of data stream. E.g., Double counting of aggregations or throwing of redundant alerts

Spark Streaming provides exactly one semantics for data transformations.However, output operations provide at-least once semantics!!

Page 26: IoT Austin CUG talk

26© Cloudera, Inc. All rights reserved.

Exactly Once Semantics with Spark Streaming & Kafka

• Associate a “key” with each value written to external store, that can be used for de-duping

• This key needs to be unique for a given micro-batch

• Kafka Direct Connector provides the following associated with each record, which will be the same for a given micro-batch:Kafka-Partition + start-offset + end-offset• Check out org.apache.spark.streaming.kafka.OffsetRanges and

org.apache.spark.streaming.kafka.HasOffsetRanges

Page 27: IoT Austin CUG talk

27© Cloudera, Inc. All rights reserved.

Thank You