34
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Reliable Data Ingestion in Big Data/ IoT Guido Schmutz @ gschmutz

Reliable Data Intestion in BigData / IoT

Embed Size (px)

Citation preview

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH

Reliable Data Ingestion in Big Data/IoT

Guido Schmutz

@gschmutz

Guido Schmutz

Working for Trivadis for more than 19 yearsOracle ACE Director for Fusion Middleware and SOACo-Author of different booksConsultant, Trainer, Software Architect for Java, SOA & Big Data / Fast DataMember of Trivadis Architecture BoardTechnology Manager @ Trivadis

More than 25 years of software development experience

Contact: [email protected]: http://guidoschmutz.wordpress.comSlideshare: http://www.slideshare.net/gschmutzTwitter: gschmutz

Reliable Data Ingestion in Big Data/IoT

Our company.

Reliable Data Ingestion in Big Data/IoT

Trivadis is a market leader in IT consulting, system integration, solution engineeringand the provision of IT services focusing on andtechnologiesin Switzerland, Germany, Austria and Denmark. We offer our services in the followingstrategic business fields:

Trivadis Services takes over the interacting operation of your IT systems.

O P E R A T I O N

COPENHAGEN

MUNICH

LAUSANNEBERN

ZURICHBRUGG

GENEVA

HAMBURG

DÜSSELDORF

FRANKFURT

STUTTGART

FREIBURG

BASEL

VIENNA

With over 600 specialists and IT experts in your region.

Reliable Data Ingestion in Big Data/IoT

14 Trivadis branches and more than600 employees

200 Service Level Agreements

Over 4,000 training participants

Research and development budget:CHF 5.0 million

Financially self-supporting andsustainably profitable

Experience from more than 1,900 projects per year at over 800customers

Reliable Data Ingestion in Big Data/IoT

Technology on its own won't help you.You need to know how to use it properly.

Reliable Data Ingestion in Big Data/IoT

Introduction

Big Data Definition (4 Vs)

+Timetoaction?– BigData+Real-Time=StreamProcessing

CharacteristicsofBigData:ItsVolume,VelocityandVarietyincombination

Reliable Data Ingestion in Big Data/IoT

Ever increasing volume and velocity - Internet of Things (IoT) WaveInternet of Things (IoT): Enabling communication between devices, people & processes to exchange useful information & knowledge that create value for humans

Term was first proposed by Kevin Ashton in 1999

Source:TheEconomistSource:Ericsson,June2016

Reliable Data Ingestion in Big Data/IoT

What is Data Ingestion?

Acquiring data as it is produced from Data Source(s)

Transforming into a consumable form

Delivering the transformed data to the consuming system(s)

The challenge: Doing this continuously and at scale across a wide variety of sources and consuming systems

Ingress and Egress are to other terms referring to data movement in and out of a system

Reliable Data Ingestion in Big Data/IoT

Hadoop ClusterdHadoop ClusterHadoop Cluster

Lambda Architecture for Big Data

Location

Social

Clickstream

Sensor Data

Billing &Ordering

CRM / Profile

MarketingCampaigns

CallCenter

MobileApps

Batch Analytics

Streaming Analytics

Event HubEvent

HubEvent Hub

NoSQL

ParallelProcessing

DistributedFilesystem

Stream AnalyticsNoSQL

Reference /Models

SQL

Search

Dashboard

BITools

Enterprise Data Warehouse

Search

Online&MobileApps

SQL Import

WeatherData

Reliable Data Ingestion in Big Data/IoT

SQL ImportHadoop ClusterdHadoop Cluster

Hadoop Cluster

Location

Social

Clickstream

Sensor Data

Billing &Ordering

CRM / Profile

MarketingCampaigns

CallCenter

WeatherData

MobileApps

Batch Analytics

Streaming Analytics

Event HubEvent

HubEvent Hub

NoSQL

ParallelProcessing

DistributedFilesystem

Stream AnalyticsNoSQL

Reference /Models

SQL

Search

Dashboard

BITools

Enterprise Data Warehouse

Search

Online&MobileApps

Integrate Sanitize / Normalize Deliver

IoT GW

MQTTBroker

Continuous Ingestion -DataFlow Pipelines

DBSource

BigDataLog

StreamProcessing

IoT Sensor

EventHub

Topic

Topic

REST

Topic

IoT GW

CDCGW

Conn

ect

CDC

DBSource

Log CDC

Native

IoT Sensor

IoT Sensor

12DataflowGW

Topic

Topic

Queue

MessagingGWTopic

DataflowGWDataflow

Topic

REST

12FileSourceLog

Log

Log

Social

Native

Reliable Data Ingestion in Big Data/IoT

DataFlow Pipeline

Reliable Data Ingestion in Big Data/IoT

• Flow-based ”programming”• Ingest Data from various sources• Extract – Transform – Load• High-Throughput, straight-through

data flows• Data Lineage• Batch- or Stream-Processing• Visual coding with flow editor

• Event Stream Processing (ESP) but not Complex Event Processing (CEP)

Source: Confluent

SQL Polling

Change Data Capture (CDC)

File Stream (File Tailing)

File Stream (Appender)

Continuous Ingestion –Integrating data sources

Sensor Stream

Reliable Data Ingestion in Big Data/IoT

Ingestion with/without Transformation?

Reliable Data Ingestion in Big Data/IoT

Zero Transformation• No transformation, plain ingest, no

schema validation• Keep the original format – Text,

CSV, …• Allows to store data that may have

errors in the schema

Format Transformation• Prefer name of Format Translation• Simply change the format• Change format from Text to Avro• Does schema validation

Enrichment Transformation• Add new data to the message• Do not change existing values• Convert a value from one system to

another and add it to the message

Value Transformation• Replaces values in the message• Convert a value from one system to

another and change the value in-place• Destroys the raw data!

Reliable Data Ingestion in Big Data/IoT

Challenges

Why is Data Ingestion Difficult?

Physical and Logical Infrastructure changes

rapidly

Key Challenges:

Infrastructure AutomationEdge Deployment

Infrastructure Drift

Data Structures and formats evolve and change

unexpectedly

Key Challenges:

Consumption ReadinessCorruption and Loss

Structure Drift

Data semantics change with evolving applications

Key Challenges

Timely InterventionSystem Consistency

Semantic Drift

Reliable Data Ingestion in Big Data/IoT

Source: Streamsets

Challenges for Ingesting Sensor Data

Reliable Data Ingestion in Big Data/IoT

Multitude of sensors

Real-Time Streaming

Multiple Firmware versions

Bad Data from damaged sensors

Regulatory Constraints

Data Quality

Source: Cloudera

Key Elements of Data Ingestion

Reliable Data Ingestion in Big Data/IoT

Idempotence

Batching (Bulk)

Data Transformation

Compression

Availability and Recoverability

Reliable Data Transfer and Data

Validation

Resource Consumption

Performance

Monitoring

Reliable Data Ingestion in Big Data/IoT

Implementing Event Hub – Apache Kafka

How to implement an Event Hub? Apache Kafka to the rescue

• Distributed publish-subscribe messaging system

• Designed for processing of high-volume, real time activity stream data (logs, metrics, social media, …)

• Stateless (passive) architecture, offset-based consumption

• Provides Topics, but does not implement JMS standard

• Initially developed at LinkedIn, now part of Apache• Peak Load on single cluster: 2 million messages/sec, 4.7

Gigabits/sec inbound, 15 Gigabits/sec outbound

Kafka Cluster

Consumer Consumer Consumer

Producer Producer Producer

Reliable Data Ingestion in Big Data/IoT

Reliable Data Ingestion in Big Data/IoT

Implementing Data Flow

Apache Flume

distributed data collection service

gets flows of data (like logs) from their source

aggregates them to where they have to be processed

Sources: files, syslog, avro, …Sinks: HDFS files, HBase, …

Reliable Data Ingestion in Big Data/IoT

Source: Flume Documentation

Apache Sqoop

Reliable Data Ingestion in Big Data/IoT

• Sqoop exchanges data between an RDBMS and Hadoop

• It can import all tables, single table, or a portion of a table into HDFS

• Does this very efficiently via a Map-only MapReduce job

• Result is a directory in HDFS containing comma-delimited text

• Scoop can also export data from HDFS back to the database

$ sqoop import --connect jdbc:mysql://localhost/company \--username twheeler --password bigsecret \--warehouse-dir /mydata \--table customers

Oracle GoldenGate

Reliable Data Ingestion in Big Data/IoT

• Provides low-impact change data capture solution for Oracle and Non-Oracle RDMBS

• Non-intrusive

• Low-Latency

• Open, modular Architecture

• Supports heterogeneous systems

• Oracle GoldenGate for Big Data provides Hadoop and Kafka Support

Apache Kafka Connect

• a tool for scalably and reliably streaming data between Apache Kafka and other data systems

• is not an ETL framework• Pre-build connectors available for Data

Source and Data Sinks• JDBC (Source)• Oracle GoldenGate (Source)• MQTT (Source)• HDFS (Sink)• Elasticsearch (Sink)• MongoDB (Sink)• Cassandra (Source & Sink)

Reliable Data Ingestion in Big Data/IoT

Source: Confluent

Apache NiFi & MiNiFi

• Originated at NSA as Niagarafiles• Open sourced December 2014, Apache

TLP July 2015• Opaque, file-oriented payload• Distributed system of processors with

centralized control• Based on flow-based programming

concepts• Data Provenance• Web-based user interface

• Apache MiNiFi focuses on the collection of data at the source of its creation

Reliable Data Ingestion in Big Data/IoT

StreamSets Data Collector

Founded by ex-Cloudera, InformaticaemployeesContinuous open source, intent-driven, big data ingestVisible, record-oriented approach fixes combinatorial explosionBatch or stream processing• Standalone, Spark cluster, MapReduce

clusterIDE for pipeline development by ‘civilians’Relatively new - first public release September 2015So far, vast majority of commits are from StreamSets staff

Reliable Data Ingestion in Big Data/IoT

Other Alternatives

Reliable Data Ingestion in Big Data/IoT

• Spring Cloud Data Flow

• Node-RED

• Project Flogo

• Oracle Streaming Analytics

• Spark Streaming

• …

Reliable Data Ingestion in Big Data/IoT

What about existing Integration Platforms?

Oracle’s Service Bus as a consumer of Kafka

ServiceBus12c

CloudApps

BusinessService Cl

oud

ProxyServiceKa

fka

Clou

dAP

I

MobileApps Pipeline

Routing

KafkaSensor/

IoT

WebApps

BusinessService RE

ST

BusinessService W

SDL

BackendAppsRE

ST

BackendAppsW

SDLProxy

ServiceKafka Pipeline

Routing

Database

DB CDC

StreamProcessing

Reliable Data Ingestion in Big Data/IoT

Oracle’s Service Bus as a producer to Kafka

ServiceBus12c

CloudApps

BusinessService Cl

oud

ProxyServiceRE

ST

Clou

dAP

I

MobileApps Pipeline

Routing

Sensor/IoT

WebApps

BusinessService RE

ST

BusinessService Ka

fka

BackendAppsRE

ST

ProxyServiceSO

AP

PipelineRouting

Reliable Data Ingestion in Big Data/IoT

Kafka

BackendApps

SOA/ BPM

Hybrid Integration Platforms (HIP) needed

Reliable Data Ingestion in Big Data/IoT

Source: Gartner

Trivadis @ DOAG 2016

Booth: 3rd Floor – next to the escalatorKnow how, T-Shirts, Contest and Trivadis Power to goWe look forward to your visitBecause with Trivadis you always win !

Reliable Data Ingestion in Big Data/IoT