Upload
hortonworks
View
2.223
Download
9
Embed Size (px)
Citation preview
Harnessing Data-in-Motion with Hortonworks DataFlow
Introduction to HDF 2.0
Haimo LiuProduct Manager
Aldrin PiriTechnical Staff
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda HDF 2.0: Flow Management– NiFi basics– NiFi use cases– NiFi demos
HDF 2.0: Streaming Analytics
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Simplistic View of Enterprise Data Flow
Data Flow
Process and Analyze DataAcquire Data
Store Data
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interacting with different business partners and customers
Realistic View of Enterprise Data Flow
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
• For agile and immediate creation, configuration, control of dataflowsVisual Command and Control
• Ensures trust of your dataData Lineage (Provenance)
• Because not all data is of equal importanceData Prioritization
• Since not all senders/receivers/connections work perfectly all the timeData Buffering/Back-Pressure
• Adapt to different situations with different requirementsControl Latency vs Throughput
• Security of data, and data accessSecure Control Plane/Data Plane
• ScalabilityScale out Clustering
• Ecosystem flexibility and growthExtensibility
Apache NiFi: Designed for 8 challenges of global enterprise dataflow
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Apache NiFi used for?• Reliable and secure transfer of data between systems• Delivery of data from sources to analytic platforms• Enrichment and preparation of data:
– Conversion between formats– Extraction/Parsing– Routing decisions
What is Apache NiFi NOT used for?• Distributed Computation• Complex Event Processing• Joins / Complex Rolling Window Operations
Use Cases for Apache NiFi
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
FlowFile• Unit of data moving through the system• Content + Attributes (key/value pairs)
Processor• Performs the work, can access FlowFiles
Connection• Links between processors• Queues that can be dynamically prioritized
Terminology
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HTTP Data FlowFile
HTTP/1.1 200 OKDate: Sun, 10 Oct 2010 23:26:07 GMTServer: Apache/2.2.8 (CentOS) OpenSSL/0.9.8gLast-Modified: Sun, 26 Sep 2010 22:04:35 GMTContent-Type: text/html
Hello world XXXXXXXXXXXXXXXXXXXXXXXXXXXX
Key: 'entryDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016'Key: 'fileSize’ Value: '23609'Key: 'filename’ Value: '15650246997242'Key: 'path’ Value: './’
0101010101110101010101010101 (Binary)
Header
Content
Analogy: FlowFiles are like HTTP Data
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
1. Drag and drop processors to build a flow2. Start, stop, and configure components in real time3. View errors and corresponding error messages4. View statistics and health of data flow5. Create templates of common processor & connections
Create, Run, View, Start, Stop, Change, Fix, Dataflows in Real-Time
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi Demo: Tail Logs, Route on Content, Buffer in Kafka, Deliver to HDFS
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Data Provenance and Why is it Important?
BEGIN
ENDLINEAGE
IT and Cloud Operators• Understand traceability, lineage• Enable recovery and replay
Compliance Regulations• Provide an audit trail• Remediation capabilities
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Provenance Enables Easy Access and Traceability of Changes
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Need Fine-Grained Security and Compliance?
Security• Secured authentication• Enterprise authorization services –
entitlements change often• Encrypted content, encrypted
communications• People and systems with different roles
require difference access levels• Tagged/classified data
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Repositories - Pass by reference
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Repositories – Copy on Write
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda HDF 2.0 Flow Management
HDF 2.0 Platform Evolution– Product offering– Example use case
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Constrained High-latency Localized context
Hybrid – cloud / on-premises Low-latency Global context
CoreInfrastructure
Hortonworks DataFlow Manages Data in MotionRegional
InfrastructureSources
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFlow Management and Stream ProcessingCore
InfrastructureSources
Constrained High-latency Localized context
Hybrid – cloud / on-premises Low-latency Global context
RegionalInfrastructure
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Edge Intelligence with Apache MiNiFi
Guaranteed delivery Data buffering
‒ Backpressure‒ Pressure release
Prioritized queuing Flow specific QoS
‒ Latency vs. throughput‒ Loss tolerance
Data provenance
Recovery / recording a rolling log of fine-grained history
Designed for extension
Different from Apache NiFi Design and Deploy Warm re-deploys
Key Features
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi vs. MiNiFi Java Agent
NiFi Framework
Components
MiNiFi
NiFi Framework
User Interface
Components
NiFi
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example: Company X provides alerting services when users’ resting heart rate higher than a threshold
Real-Time Insights Require DataFlow Mgmt and Stream Processing
Acquire Data
Company X Cloud Instance 1
Acquire Data
Company X Cloud Instance 2
Acquire Data
Company X Cloud Instance 3
Acquire Data Across Cloud
Instances
Parse, Filter, Validate, Enrich
and Route
Core Data Center
Analytics/Pattern Match
Data Store
Alerts
Dashboards/Visualization
Flow Management Stream ProcessingLegend:
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data in Motion Needs Dataflow Management and Stream Processing
Acquire data from various Wearable Device’s Cloud Instances
Move Data from Customer Cloud Instances to on-premise instance
Perform Intelligent Routing & Filtering of data. The routing and filtering rules will be often changed at run-time.
Deliver the data data to various downstream systems. New downstream apps should will always appear and the data should be fed to it when it comes online.
Parse the device data to standardized format that downstream sysem can understand
Enrich the data with contextual information including patient/customer info (age, sex, etc..)
Recognize the Pattern when the resting heart rate exceeds a certain threshold (the insight), and then create an alert/notification.
Run a Outlier detection model on streaming heart rate that comes in. If the score is above certain threshold, alert on the heart rate.
Flow Management (NiFi, MiNiFi)
StreamProcessing
(Storm, Kafka)
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Cases for Data in Motion
Use Cases for Data-in-Motion Using DataFlow Mgmt• Data Ingestion • Edge Intelligence• First Mile Problem • Physical Data Movement • Simple event processing such as Route, Filter, Enrich,
Transform, etc.
When Only DataFlow Management is
Required
Use Cases for Data-in-Motion Using DataFlow Mgmt and Steam Processing• Flow Management to deliver data for Stream Processing• PLUS: Complex pattern matching on unbounded streams of
data.
When Both DataFlow Management and Stream Processing
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Flow management
D A T A I N M O T I O N D A T A A T R E S T
IoT Data Sources AWSAzure
Google CloudHadoop
NiFiKafka
Storm
Others…NiFi
NiFi NiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
NiFi
HDF 2.0: Data-in-Motion Platform
Enterprise Services
Ambari Ranger Other services
Flow management + Stream Processing
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
New Stream Processing Features HDF 2.0
New Storm Connectors Storm-Kafka Spout using new
client APIs Storm Distributed Log Search Storm Dynamic Worker
Profiling Kafka Grafana Integration Storm Grafana Integration
Improved Nimbus HA Storm Automatic Back
Pressure Storm Distributed cache Storm Windowing and State
Management Storm Performance
improvements Improved Kafka SASL
Storm Topology Event inspector Storm Resource Aware
Scheduling Storm Dynamic Log Levels Pacemaker Storm Daemon Kafka Rack Awareness
Developer Productivity Enterprise Readiness Operational Simplicity
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
For More Info: https://community.hortonworks.com/
Hortonworks Community Connection:Data Ingestion and Streaming