Upload
guido-schmutz
View
469
Download
1
Embed Size (px)
Citation preview
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Reliable Data Ingestion in Big Data/IoT
Guido Schmutz
@gschmutz
Guido Schmutz
Working for Trivadis for more than 19 yearsOracle ACE Director for Fusion Middleware and SOACo-Author of different booksConsultant, Trainer, Software Architect for Java, SOA & Big Data / Fast DataMember of Trivadis Architecture BoardTechnology Manager @ Trivadis
More than 25 years of software development experience
Contact: [email protected]: http://guidoschmutz.wordpress.comSlideshare: http://www.slideshare.net/gschmutzTwitter: gschmutz
Reliable Data Ingestion in Big Data/IoT
Our company.
Reliable Data Ingestion in Big Data/IoT
Trivadis is a market leader in IT consulting, system integration, solution engineeringand the provision of IT services focusing on andtechnologiesin Switzerland, Germany, Austria and Denmark. We offer our services in the followingstrategic business fields:
Trivadis Services takes over the interacting operation of your IT systems.
O P E R A T I O N
COPENHAGEN
MUNICH
LAUSANNEBERN
ZURICHBRUGG
GENEVA
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
VIENNA
With over 600 specialists and IT experts in your region.
Reliable Data Ingestion in Big Data/IoT
14 Trivadis branches and more than600 employees
200 Service Level Agreements
Over 4,000 training participants
Research and development budget:CHF 5.0 million
Financially self-supporting andsustainably profitable
Experience from more than 1,900 projects per year at over 800customers
Reliable Data Ingestion in Big Data/IoT
Technology on its own won't help you.You need to know how to use it properly.
Big Data Definition (4 Vs)
+Timetoaction?– BigData+Real-Time=StreamProcessing
CharacteristicsofBigData:ItsVolume,VelocityandVarietyincombination
Reliable Data Ingestion in Big Data/IoT
Ever increasing volume and velocity - Internet of Things (IoT) WaveInternet of Things (IoT): Enabling communication between devices, people & processes to exchange useful information & knowledge that create value for humans
Term was first proposed by Kevin Ashton in 1999
Source:TheEconomistSource:Ericsson,June2016
Reliable Data Ingestion in Big Data/IoT
What is Data Ingestion?
Acquiring data as it is produced from Data Source(s)
Transforming into a consumable form
Delivering the transformed data to the consuming system(s)
The challenge: Doing this continuously and at scale across a wide variety of sources and consuming systems
Ingress and Egress are to other terms referring to data movement in and out of a system
Reliable Data Ingestion in Big Data/IoT
Hadoop ClusterdHadoop ClusterHadoop Cluster
Lambda Architecture for Big Data
Location
Social
Clickstream
Sensor Data
Billing &Ordering
CRM / Profile
MarketingCampaigns
CallCenter
MobileApps
Batch Analytics
Streaming Analytics
Event HubEvent
HubEvent Hub
NoSQL
ParallelProcessing
DistributedFilesystem
Stream AnalyticsNoSQL
Reference /Models
SQL
Search
Dashboard
BITools
Enterprise Data Warehouse
Search
Online&MobileApps
SQL Import
WeatherData
Reliable Data Ingestion in Big Data/IoT
SQL ImportHadoop ClusterdHadoop Cluster
Hadoop Cluster
Location
Social
Clickstream
Sensor Data
Billing &Ordering
CRM / Profile
MarketingCampaigns
CallCenter
WeatherData
MobileApps
Batch Analytics
Streaming Analytics
Event HubEvent
HubEvent Hub
NoSQL
ParallelProcessing
DistributedFilesystem
Stream AnalyticsNoSQL
Reference /Models
SQL
Search
Dashboard
BITools
Enterprise Data Warehouse
Search
Online&MobileApps
Integrate Sanitize / Normalize Deliver
IoT GW
MQTTBroker
Continuous Ingestion -DataFlow Pipelines
DBSource
BigDataLog
StreamProcessing
IoT Sensor
EventHub
Topic
Topic
REST
Topic
IoT GW
CDCGW
Conn
ect
CDC
DBSource
Log CDC
Native
IoT Sensor
IoT Sensor
12DataflowGW
Topic
Topic
Queue
MessagingGWTopic
DataflowGWDataflow
Topic
REST
12FileSourceLog
Log
Log
Social
Native
Reliable Data Ingestion in Big Data/IoT
DataFlow Pipeline
Reliable Data Ingestion in Big Data/IoT
• Flow-based ”programming”• Ingest Data from various sources• Extract – Transform – Load• High-Throughput, straight-through
data flows• Data Lineage• Batch- or Stream-Processing• Visual coding with flow editor
• Event Stream Processing (ESP) but not Complex Event Processing (CEP)
Source: Confluent
SQL Polling
Change Data Capture (CDC)
File Stream (File Tailing)
File Stream (Appender)
Continuous Ingestion –Integrating data sources
Sensor Stream
Reliable Data Ingestion in Big Data/IoT
Ingestion with/without Transformation?
Reliable Data Ingestion in Big Data/IoT
Zero Transformation• No transformation, plain ingest, no
schema validation• Keep the original format – Text,
CSV, …• Allows to store data that may have
errors in the schema
Format Transformation• Prefer name of Format Translation• Simply change the format• Change format from Text to Avro• Does schema validation
Enrichment Transformation• Add new data to the message• Do not change existing values• Convert a value from one system to
another and add it to the message
Value Transformation• Replaces values in the message• Convert a value from one system to
another and change the value in-place• Destroys the raw data!
Why is Data Ingestion Difficult?
Physical and Logical Infrastructure changes
rapidly
Key Challenges:
Infrastructure AutomationEdge Deployment
Infrastructure Drift
Data Structures and formats evolve and change
unexpectedly
Key Challenges:
Consumption ReadinessCorruption and Loss
Structure Drift
Data semantics change with evolving applications
Key Challenges
Timely InterventionSystem Consistency
Semantic Drift
Reliable Data Ingestion in Big Data/IoT
Source: Streamsets
Challenges for Ingesting Sensor Data
Reliable Data Ingestion in Big Data/IoT
Multitude of sensors
Real-Time Streaming
Multiple Firmware versions
Bad Data from damaged sensors
Regulatory Constraints
Data Quality
Source: Cloudera
Key Elements of Data Ingestion
Reliable Data Ingestion in Big Data/IoT
Idempotence
Batching (Bulk)
Data Transformation
Compression
Availability and Recoverability
Reliable Data Transfer and Data
Validation
Resource Consumption
Performance
Monitoring
How to implement an Event Hub? Apache Kafka to the rescue
• Distributed publish-subscribe messaging system
• Designed for processing of high-volume, real time activity stream data (logs, metrics, social media, …)
• Stateless (passive) architecture, offset-based consumption
• Provides Topics, but does not implement JMS standard
• Initially developed at LinkedIn, now part of Apache• Peak Load on single cluster: 2 million messages/sec, 4.7
Gigabits/sec inbound, 15 Gigabits/sec outbound
Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer
Reliable Data Ingestion in Big Data/IoT
Apache Flume
distributed data collection service
gets flows of data (like logs) from their source
aggregates them to where they have to be processed
Sources: files, syslog, avro, …Sinks: HDFS files, HBase, …
Reliable Data Ingestion in Big Data/IoT
Source: Flume Documentation
Apache Sqoop
Reliable Data Ingestion in Big Data/IoT
• Sqoop exchanges data between an RDBMS and Hadoop
• It can import all tables, single table, or a portion of a table into HDFS
• Does this very efficiently via a Map-only MapReduce job
• Result is a directory in HDFS containing comma-delimited text
• Scoop can also export data from HDFS back to the database
$ sqoop import --connect jdbc:mysql://localhost/company \--username twheeler --password bigsecret \--warehouse-dir /mydata \--table customers
Oracle GoldenGate
Reliable Data Ingestion in Big Data/IoT
• Provides low-impact change data capture solution for Oracle and Non-Oracle RDMBS
• Non-intrusive
• Low-Latency
• Open, modular Architecture
• Supports heterogeneous systems
• Oracle GoldenGate for Big Data provides Hadoop and Kafka Support
Apache Kafka Connect
• a tool for scalably and reliably streaming data between Apache Kafka and other data systems
• is not an ETL framework• Pre-build connectors available for Data
Source and Data Sinks• JDBC (Source)• Oracle GoldenGate (Source)• MQTT (Source)• HDFS (Sink)• Elasticsearch (Sink)• MongoDB (Sink)• Cassandra (Source & Sink)
Reliable Data Ingestion in Big Data/IoT
Source: Confluent
Apache NiFi & MiNiFi
• Originated at NSA as Niagarafiles• Open sourced December 2014, Apache
TLP July 2015• Opaque, file-oriented payload• Distributed system of processors with
centralized control• Based on flow-based programming
concepts• Data Provenance• Web-based user interface
• Apache MiNiFi focuses on the collection of data at the source of its creation
Reliable Data Ingestion in Big Data/IoT
StreamSets Data Collector
Founded by ex-Cloudera, InformaticaemployeesContinuous open source, intent-driven, big data ingestVisible, record-oriented approach fixes combinatorial explosionBatch or stream processing• Standalone, Spark cluster, MapReduce
clusterIDE for pipeline development by ‘civilians’Relatively new - first public release September 2015So far, vast majority of commits are from StreamSets staff
Reliable Data Ingestion in Big Data/IoT
Other Alternatives
Reliable Data Ingestion in Big Data/IoT
• Spring Cloud Data Flow
• Node-RED
• Project Flogo
• Oracle Streaming Analytics
• Spark Streaming
• …
Oracle’s Service Bus as a consumer of Kafka
ServiceBus12c
CloudApps
BusinessService Cl
oud
ProxyServiceKa
fka
Clou
dAP
I
MobileApps Pipeline
Routing
KafkaSensor/
IoT
WebApps
BusinessService RE
ST
BusinessService W
SDL
BackendAppsRE
ST
BackendAppsW
SDLProxy
ServiceKafka Pipeline
Routing
Database
DB CDC
StreamProcessing
Reliable Data Ingestion in Big Data/IoT
Oracle’s Service Bus as a producer to Kafka
ServiceBus12c
CloudApps
BusinessService Cl
oud
ProxyServiceRE
ST
Clou
dAP
I
MobileApps Pipeline
Routing
Sensor/IoT
WebApps
BusinessService RE
ST
BusinessService Ka
fka
BackendAppsRE
ST
ProxyServiceSO
AP
PipelineRouting
Reliable Data Ingestion in Big Data/IoT
Kafka
BackendApps
SOA/ BPM