34
Spark Streaming the Industrial IoT Washington DC Area Spark Interactive Jim Haughwout, Chief Architect & VP of Software May 24, 2016

Spark Streaming the Industrial IoT

Embed Size (px)

Citation preview

Page 1: Spark Streaming the Industrial IoT

Spark Streaming the Industrial IoTWashington DC Area Spark InteractiveJim Haughwout, Chief Architect & VP of SoftwareMay 24, 2016

Page 2: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 2

Today’s Talk

• Discuss challenges of streaming in general with tips on doing this with Spark

• Special focus: IoT’s complexities of immediately tying together physical and data world

• Our talks are in three parts:- Part I: Top-level POV of using Spark Streaming for Industrial IoT (Jim)- Part II: Spark Streaming and Expert Systems – Spark + Drools (James)- Part III: Overcoming Deficiencies in Streams (Anderson of MetiStream)

Page 3: Spark Streaming the Industrial IoT

About Us

Page 4: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 4

Savi Technology

• Sensor analytics solutions for Industrial IoT

• Focus areas are risk and performance

• Customers are Fortune-1000 and government

• Real-time visibility using complex event processing and machine learning algorithms

• Strategic insights using batch analytics

• Hardware Engineers, Data Engineers, Software Engineers, and Data Scientists

• HQ in Alexandria; offices across world

Some examples of what we do… HARDWAREAPPLICATIONS SERVICESANALYTICS

Page 5: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 5

Our version of Google Now: Parking -> Stationary

Page 6: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 6

Progressive streaming analysis of IoT data: Rules + ML

Times in UTC

Page 7: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 7

Alerting with predictive analytics: Commercial ETA

• 22 hours outwe predicted driver would be late (giving advanced notice)

• That prediction was < 5 minutes vs. actual (on a 68-hour trip)

Times in America/New_York

Page 8: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 8

Batch discovery and prescriptive analytics: reducing theft

Third largest transport firm had 2x the median suspect issues

Page 9: Spark Streaming the Industrial IoT

Use of Spark @ Savi

Page 10: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 10

We have fully embraced Apache Spark

Spark is the core of our tech stack:• Using Spark for batch processing since Spark 1.0, for streaming since Spark 1.2.1

- We use discretized streams (DStreams); our fastest batch interval is 1 second

• 24x7 production operation, with full monitoring and high levels of test coverage

• Supporting Fortune-500 customers, managing billions of dollars of stuff in near real-time

• Fully-automated CI & CD with SOC II certification

• We launch new Spark software several times every week—Push-button with no visible downtime to customers

• Gives use enormous scale and cost advantages vs. traditional enterprise technologies

• Uptime in last 12 months has been 100%—knock on wood13 months ago we had a brief outage due to a DNS outage in AWS US-West-2

Page 11: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 11

Spark is at the core of our “hybrid” Lambda architecture

In-house Analytic

Tools

Sensor Readers

MobileApps

Enterprise Data

OpenData

PartnerData

Sensor Meshes

I N T E G R AT I O N L AY E RA M Q P

C o A P

F T P

H T T P

M Q T T

S O A P

T C P

U D P

X M P P

S E R V I N G L AY E R

B AT C H L AY E R

S P E E D L AY E R

Savi IoTAdapter

Batch Processing

Domain Specific

CEP

Sensor Agnostic

CEP

Modeling, Machine Learning

R S - 2 3 2

U S B

p R F I D

B l u e t o o t h

Z i g B e e

8 0 2 . 1 1

6 L o W P A N

a R F I D

G S M

G P R S

3 G

4 G / L T E

S A T C O M

DataServingLayer

Notifications

Savi Apps

Immutable Data Store

Customer Export

REST APIs

Page 12: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 12

The Details: Tech stack distributions and versions

Data Applications Tools Jetty 9.3.9 Kafka (0.8.2) via CDH 5.3.3 Spark (1.4.1 -> 1.6.1) Scikit-learn 0.15.2 Cassandra 2.1.8 via DSE 4.7 GlusterFS 3.7 PostgreSQL 9.3.3 with PostGIS Hadoop 2.5.0 via CDH 5.3.3 Hive 0.13.1 via CDH 5.3.3 Hue 3.7.0 via CDH 5.3.3 Parquet-format 2.2.0 via Spark Parquet-mr 1.6.0 via Spark Gobblin 0.7.0 Drools 6.3.0 ZooKeeper 3.4.5 via CDH 5.3.3

Nginx Bootstrap D3.js, AmCharts, Flot WildFly Flask Shibboleth PostgreSQL DSE Cassandra DSE Solr Also mobile on iOS, Android

Github (Github Flow) Ansible Docker Jenkins Maven Bower Slack Fluentd Graylog Sentry Jupyter (PySpark, Folium, Pandas,

Matplotlib, Scikit-learn, etc.)

We program in Scala 2.10, Java 8, Python 2.7, HTML5, LESS.css, and JavaScript We are hosted in AWS but are not using any AWS-specific solutions (e.g., EMR)

Page 13: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 13

Why we chose Spark

• We started on Apache Storm and MapReduce (we use a Lambda architecture)

• Moved to 100% Spark over the last 18 months (finished last Summer)

• Spark is NOT the best at everything

• However, it is advancing quickly

• We are an analytics company: Spark provides a single unified framework- Speed layer and Batch Layer- Use by Engineering and Data Science- Product apps and ad-hoc analytics

• Ultimately this gives us better agility and cost (development + operations)

For more on our journey see: http://bit.do/savi-spark

Page 14: Spark Streaming the Industrial IoT

Spark Streaming @ SaviTips and lessons streaming data 24x7

Page 15: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 15

Spark Streaming is a different animal

20 seconds

Page 16: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 16

Time is much more precious (and important) in the Streaming world time- Seconds vs. minutes or hours- Down-time or interruption is immediately visible to end users—in IoT this can lead to missing key events - Need to avoid breakdown in stream due to surges or failures—both of which are more common in IoT

Streaming resource utilization is different than batch- CPU is rarely the limiting factor- Memory is less of a limitation than typical for Spark - I/O is a much more common limiting factor

Some tips and lessons learned managing these differences…

Spark Streaming is a different animal

Page 17: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 17

Tip 1: Leverage Kafka- Faster than HDFS, more durable than in-memory

- Supports parallel, independent consumption from multiple processing streams

- Supports FIFO within partitions

Tip 2: DAG of DAGs (DAG of Streaming Apps and Kafka topics)- Break down process graph—even near real-time—into critical and non-critical paths- Route non-critical processing to separate streams, with their own persisted queues- Do same for interactions with lower-durability sources and targets

Tip 3: (Caveat to Tip 2) Avoid over-complicating your DAG- Every time you re-queue: you create opportunities to get data out-of-order- Instead rely on at-least-once processing and add non-more-than-once protection to non-idempotent processing

Tips to defensively architect Spark Streaming

Page 18: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 18

Tip 4: Offload bad data to non-blocking paths- Bad data will happen- Design your apps to offload this to non-blocking paths (vs. failing)—keeps the stream alive

Tip 5: (Caveat to Tip 4) Wind-down if infrastructure fails- Running a streaming process with broken infrastructure will create lots of bad issues- Instead wind-down (and alert) and allow Kafka to help you recover- Wind-down and re-start will often “clear up” network or memory bottlenecks

Tip 6: Preserve data lineage (and immutability)- Preserve full data lineage of each stage of processing – will save you when dealing with real-world issues- Keep everything, even failures – this allows you to replay data for analysis, recovery (you will need it)

Tips to defensively architect Spark Streaming (cont.)

Page 19: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 19

Tip 1: Over-subscribe your cores- Minimum core count needed is NsourceTopics + 2- For efficiency over-subscribe your cores. - High multiples are fine

Tip 2: Use broadcast variables topersist shared ephemeral rules

Tip 3: Limit Kafka topics per App- Counter-intuitive for defensive programming- Avoids starvation due to imbalanced loads

Tip 4: Avoid the shuffle- Shuffle is tough on I/O,

with streaming it is worse- Instead rely on Kafka partitioning- However, Kafka offset partitioning

is still a work-in-progress

Tips: performance tuning Spark Streaming

Page 20: Spark Streaming the Industrial IoT

Streaming Real-world Industrial IoT Data:“It’s very different than the Canonical Twitter stream analysis teaching example”

Page 21: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 21

All the normal “dirty data” issues plus

Streaming means you have to handle much of this in near-real time

IoT + Spark Streaming = Physical + Data (in near real-time)

Page 22: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 22

The “IoT Menagerie”- Millions of source IPs: white listing is impossible- Many transport protocols and standard: HTTP, FTP, CoAP, MQTT, TCP, UDP, X12, GPRS

Several tools available to ingest IoT transactions into your platform- Even some directly to Kafka for processing by Spark, Storm and Flink

However, not everything is a simple transaction – most is not

The “obvious” solution: increasing MAX KAFKA SIZE does not work:- Bottlenecking and serialization issues- Ultimately will not be able to increase enough

Lessons Learned: Use hybrid ingestion- Append critical metadata immediately at point of ingestion- Includes transaction ID and digital signature- Split metadata from payload for complex and large data types- Keeps memory low and is fully scalable

Challenge 1: Ingesting IoT data

S T R E A M I N G D ATA T Y P E S

Micro-batches

Simple transactions

Loggers

Sensor constellations

30% of xacs10% of data

20% of xacs35% of data

5% of xacs10% of data

45% of xacs15% of data

MIME media transactions

<1% of xacs30% of data

Page 23: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 23

Challenge 2: Handing stream interruptions and surges

• Massive increase in stream interruptions(vs. normal server flows)- Loss of power

- Movement in and out of coverage

- Bad OTA updates (can cause false DDoS events)

• Often undetected by anyone but Spark

• Overcoming these- Monitor and alarm on anomalous values

- Tune your fetch rates to avoid overwhelming I/O

- Our Hope: New Spark back-pressure (still in beta)

Page 24: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 24

Transmission, authentication, and formatting errors are much more frequent in IoT- Ever had a cellphone call dropped or duplicate text?- Data is rarely self-describing- Firmware configuration management issues- Standards non-compliance

Duplication is much more common (and complex) than traditional Lambda- Duplicate data can “hide” in unique wrappers

- Duplicate data can be obscured by transaction IDs

- Duplicates can come beyond viably sustainable window durations

Lessons Learned:- Accept everything—even authentication errors

- Capture entire lineage of processing (metadata and payload)

- Route failures away from DAG—but preserve to replay and recover

- Map data to based atomic unit THEN digitally sign and de-duplicate data

Challenge 3: Cleansing and transforming IoT data

U N I Q U E T R A N S A C T I O N S E T

Duplicate Facts(from prior set)

Unique Header

Unique Facts(to this set)

Unique Header

Incomplete Facts(in this set)

Unique Header

Page 25: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 25

Transformation and Cleansing example: “Simple” raw IoT data

$690300SR86506702256878020160321155058-16060a34ST-663-000p00105300090008000030b74a67db433310266000047000a67fb4333102660000330009cd9b43331026600002500010g81-00077.09254000038.8064970066.70000020160321155142000010006.926480003.01000020e21000000000000000100e21000000000000000200e2200000000000000055246a8

Page 26: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 26

Transformation and Cleansing: Canonical format for analytics

Turning machine data into cleaned, self-describing, agnostic data that can be readily used for analytics and machine learning

Sensor Message Universal Read Format

Page 27: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 27

Streaming data is many, many small files- 100s or 1000s per second

Adding to HDFS creates the small file problem- Many files (Name node swamping)- Much smaller than HDFS block size (inefficient)

Delaying too long makes batch analysis stale- Kafka dues not support complex queries

Lots of back and forth on this; current best practice:- Organize streams by volume and type into Kafka topics- Batch extract by topic based on volume AND time- Ultimately convert to parquet-format for batch analytics

Challenge 4: The small files problem

And now, the hardest challenge: Streaming CEP…

Page 28: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 28

Challenge 5: CEP processing messy physical realities

Spark Streaming needs to make decisions quick enough to matter…

In the physical world, real-time gets stale very quickly

Page 29: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 29

Sometimes, data just gets lost (or significantly delayed)

0

1

23

When streaming the IoT, the time lag of information is ever-present

Page 30: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 30

Also, people have been known to “contradict” sensors Sometimes legitimate

Sometimes mistaken

Sometimes malicious

People will argue that the sensors (or rules) are wrong

Page 31: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 31

Finally, once I alert you to something, I cannot undo it

Human memory is not a batch layer: it’s hard to forget Type I errors

Page 32: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 32

Prioritize Type I bias vs. Type II based on context

Windowing can be helpful, but not always- Data can be delayed hours or days (windowing is not cost-effective)

Use self-healing rule sets (and algorithms) - Immutable journal data models for state management- Keep track of multiple time dimensions: latest, most recent- Keep track of multiple signal dimensions: detected, reported

Use batch layer to assist with self-healing- Re-order on review- Auto-resolve based on new data

Add a human signals (to build trust)- Do not hide corrections, make them clear- Show full time lineage- Allow human to re-order to understand effects of outages and delays

CEP in IoT: (Timeliness + Good Enough) > (Late + Perfect)

James will dive into this deeper…

Page 33: Spark Streaming the Industrial IoT

© 2016 Savi Technology • May 1, 2023 • Page 33

Some challenges to overcome streaming IoT

Once you overcome these—and share insights with customers—the the real fun begins. There is lots you can do with Spark

Questions, ideas, comments: [email protected]

Starting to open source some tools at:https://github.com/sensoranalytics/

Visit us at 3601 Eisenhower Avenue

Thank you!

Page 34: Spark Streaming the Industrial IoT