21
1 © Cloudera, Inc. All rights reserved. Harnessing Data within Hadoop in the Connected Brewery: Kafka, Spark Streaming, and Kudu Jason Hubbard [email protected] Cloudera

IoT Connected Brewery

Embed Size (px)

Citation preview

Page 1: IoT Connected Brewery

1© Cloudera, Inc. All rights reserved.

Harnessing Data within Hadoop in the Connected Brewery: Kafka, Spark Streaming, and KuduJason [email protected]

Page 2: IoT Connected Brewery

2© Cloudera, Inc. All rights reserved.

Internet of Things (IoT)

$1.7 Trillion

In Value

20% Annual Growth

30 BillionThings

250 Million

Connected Vehicles

Source - IDC & Gartner Estimates

Internet of Things

IoT Markets - 2020

Page 3: IoT Connected Brewery

3© Cloudera, Inc. All rights reserved.

IoT Will Drive An Explosion of Data…

Data expected to explode to 44 ZB by 2020

Source: IDC

44 Trillion GB!80% of data will be unstructured

Page 4: IoT Connected Brewery

4© Cloudera, Inc. All rights reserved.

Value is maximized when data is combined with other sources

Value of Data is multiplied when you combine and correlate it with other data from relevant sources

Improvement in value that can be unlocked by combining data from multiple IoT applications and sources

SOURCE: McKinsey Global Institute analysis

Interoperability would significantly improve performance bycombining sensor data from different machines and systems to provide decision makers with an integrated view of performance

40%

Page 5: IoT Connected Brewery

5© Cloudera, Inc. All rights reserved.

The IoT EcosystemConsumer

Industrial

IoT Gateway

Data Center

Data Analytics

Sensors/ Things

Data Characteristics• Un-structured• Intermittent• Volume & Variety

Gateway• Data Routing• Edge-Processing• Edge-Storage

Sensors/ Things•To grow by 50X•Drop in prices by 70% in last 5 years

Data Storage, Processing & Analytics

IOT Data Characteristics• More processing in the

cloud• Analytics on the cloud

IOT Data Analytics• Key to Value Creation• Combine data from multiple

sources & types• Drive business insights

IOT Data Characteristics• Distributed Data

Processing• Cloud & On-Premise

Cloud

Page 6: IoT Connected Brewery

6© Cloudera, Inc. All rights reserved.

IoT Attributes

• Low powered devices, possibly battery powered• Highly Distributed• Gateway/Controller possibly mesh network• Compact messages

Page 7: IoT Connected Brewery

7© Cloudera, Inc. All rights reserved.

IoT Challenges

• Multiple protocols (Z-wave, Zigbee, Thread, etc)• Distributed, low power may mean data coming from multiple locations• May power off to save battery or away from controller, need to handle late data• Calibration between devices may be limited• Very fast and bursty traffic• Low bandwidth last mile

Page 8: IoT Connected Brewery

8© Cloudera, Inc. All rights reserved.

Use Cases

• Yes, Contrived• But a good excuse to:• Brew Beer• Buy more sensors and microprocessors• Sorry Wife

Page 9: IoT Connected Brewery

9© Cloudera, Inc. All rights reserved.

Use Case - Calibration

• Sensors need to continually be calibrated• Calibration takes resources and down time• Instead use historical raw data• Calibrate on known values• For temperature sensors use bowling point and triple point

• Temperature sensor is typically linear between these points• Fit a curve instead

Page 10: IoT Connected Brewery

10© Cloudera, Inc. All rights reserved.

Use Case - Optimize Models

• Kalman Filter is used to estimate variable with presence of noise• Need to know accuracy of sensor• Usually published by manufacturer but generalized• Accuracy can degrade over time

• PID Controller• 3 parameters control performance• Parameters different for each application

Page 11: IoT Connected Brewery

11© Cloudera, Inc. All rights reserved.

Use Case - Predictive Maintenance

• No, not just for heavy machinery• Sensors fail too• Can save money by not replacing too early• More importantly, schedule downtime

• Better Model with more data – Sensors same application many factories

Page 12: IoT Connected Brewery

12© Cloudera, Inc. All rights reserved.

Technologies

• Apache Kafka• Messaging Framework – Scalable, Fault Tolerant• Pub/Sub• Retains Data

• Apache Spark• General Purpose Distributed Processing Framework• Multiple Components including Streaming• Streaming continually processes data

• Apache Kudu

Page 13: IoT Connected Brewery

13© Cloudera, Inc. All rights reserved.

Kudu for IoT

Why it matters

Page 14: IoT Connected Brewery

14© Cloudera, Inc. All rights reserved.

Kudu use cases

Kudu is best for use cases requiring a simultaneous combination ofsequential and random reads and writes

• Machine data analytics• Example: IOT, Connected Cars, Network threat detection• Workload: Inserts, scans, lookups

• Time series• Examples: Streaming market data, fraud detection / prevention, risk monitoring• Workload: Insert, updates, scans, lookups

• Online reporting• Example: Operational data store (ODS)• Workload: Inserts, updates, scans, lookups

Page 15: IoT Connected Brewery

15© Cloudera, Inc. All rights reserved.

How would we build the Analytics System Today?

• HDFS Excels at: • Full table scans• Ad-hoc analytics

Click to enter confidentiality information

Sensors Kafka / Pub-sub

Events

Today’s Partition

Yesterday’s Partition

Historic Data

AnalystIngest

1. Have we accumulated enough data?

2. Flush into HDFS

Page 16: IoT Connected Brewery

16© Cloudera, Inc. All rights reserved.Click to enter confidentiality information

Handling Late Arriving Data

/cars/01-13/

/cars/01-14/

/cars/01-15/HDFS (Storage)Real-time Write

Real-time W

rite

I’m back!I’ll upload yesterdays data!

Data from 1-13

Page 17: IoT Connected Brewery

17© Cloudera, Inc. All rights reserved.

Hybrid big data analytics pipelineBefore Kudu

Sensors Kafka / Pub-sub

Events

HBase

Consumer

HDFS (Storage)

Random Reads

Analyst

Analytics

Snapshot& Convert to

Parquet

Compact late arriving data

Page 18: IoT Connected Brewery

18© Cloudera, Inc. All rights reserved.

Hybrid big data analytics pipelineAfter Kudu

Sensors Kafka / Pub-sub

Events

Kudu ConsumerRandom Reads

Analyst

Analytics

Kudu supports simultaneous combination ofsequential and random reads and writes

Page 19: IoT Connected Brewery

19© Cloudera, Inc. All rights reserved.

What Kudu is *NOT*

• Not a SQL interface itself • It’s just the storage layer

• Not an application that runs on HDFS• It’s an alternative, native Hadoop storage engine

• Not a replacement for HDFS or HBase• Select the right storage for the right use case

Page 20: IoT Connected Brewery

20© Cloudera, Inc. All rights reserved.

Kudu Trade-Offs (vs Hbase)

• Random updates will be slower•HBase model allows random updates without incurring a disk seek• Kudu requires a key lookup before update, Bloom lookup before insert

• Single-row reads may be slower• Columnar design is optimized for scans• Future: may introduce “column groups” for applications where single-row

access is more important

Page 21: IoT Connected Brewery

21© Cloudera, Inc. All rights reserved.

Demo