Upload
aldrin-piri
View
183
Download
1
Embed Size (px)
Citation preview
HDF 3.0 Deep DiveAldrin Piri
© Hortonworks Inc. 2011 – 2016. All Rights Reserved2
AgendaQuick Overview of HDF & Apache NiFi
What’s new in HDF 3.0?
Latest Efforts in the Apache NiFi Community
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDF Data-In-Motion PlatformHDF provides the flow management, stream processing, and enterprise services
needed to collect, curate, analyze and act on data-in-motion across the data center and cloud.
FlowManagement
Dataacquisi*onanddelivery
Simpletransforma*onanddatarou*ng
Simpleeventprocessing
Endtoendprovenance
Edgeintelligence&bi-direc*onalcommunica*on
StreamProcessing
Scalabledatabrokerforstreamingapps
Scaleoutcomplextransforma6on
StreamAnaly2csPa8ernMatching
Prescrip6ve&Predic6veStreamAnaly6cs
ComplexEventProcessing
Con6nuousInsight
EnterpriseServices
Provisioning,Management,Monitoring,Security,Audit,Compliance,Governance,Mul: -tenancy
Java
Agent
C++
Agent
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Empower users to manage the collection and flow of data
© Hortonworks Inc. 2011 – 2016. All Rights Reserved5
The Problem at Hand
Producers A.K.A ThingsAnything
AND Everything
Internet!
Consumers• User• Storage• System• …More Things
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Moving data effectively is hard
Standards: http://xkcd.com/927/
© Hortonworks Inc. 2011 – 2016. All Rights Reserved7
Apache NiFi: A Primer
Key Features and Principles
• Guaranteed delivery
• Data buffering - Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS- Latency vs. throughput
- Loss tolerance
• Data provenance
• Recovery/recording a rolling log of fine-grained history
• Visual command and control
• Flow templates
• Pluggable/multi-role security
• Designed for extension
• Clustering
© Hortonworks Inc. 2011 – 2016. All Rights Reserved8
NiFi is based on Flow Based Programming (FBP)
FBP Term NiFi Term Description
Information Packet
FlowFile Each object moving through the system.
Black Box FlowFile Processor
Performs the work, doing some combination of data routing, transformation, or mediation between systems.
Bounded Buffer
Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates.
Scheduler Flow Controller
Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use.
Subnet Process Group
A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi & Data Agnosticism
NiFi is data agnostic!
But, NiFi was designed understanding that users
can care about specifics and provides tooling
to interact with specific formats, protocols, etc.
ISO 8601 - http://xkcd.com/1179/
Robustness principle
Be conservative in what you do, be liberal in what you accept from others“
© Hortonworks Inc. 2011 – 2016. All Rights Reserved10
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2011 – 2016. All Rights Reserved12
Apache NiFi - MiNiFi
Let me get the key parts of NiFi close to where data begins
Bidirectional data transfer
Greater illuminate journey with provenance
NiFi lives in the data center. Give it an enterprise server or a cluster of them.
MiNiFi lives as close to where data is born and is a guest on that device or system
© Hortonworks Inc. 2011 – 2016. All Rights Reserved13
Connecting the Drops
SOURCESREGIONAL
INFRASTRUCTURECORE
INFRASTRUCTURE
© Hortonworks Inc. 2011 – 2016. All Rights Reserved14
Managing data flow for a courier service
Physical Store
Gateway Server
Mobile Devices
Registers
Server Cluster
Distribution Center
Kafka
Core Data Center at HQ
Server Cluster
Others
Storm / Spark / Flink / Apex
Kafka
Storm / Spark / Flink / Apex
On Delivery Routes
Trucks Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/
Deliverer: Rigo Peter, https://thenounproject.com/rigo/
Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/
Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
Client Libraries
Client Libraries
MiNiFi
MiNiFi
NiFi NiFi NiFi NiFi NiFi NiFi
Client Libraries
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s new in HDF 3.0?
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Record Based Processing Mechanism
Why?– Improve operational efficiency – Intuitive and flexible filtering/routing strategies powered by ‘QueryRecord’– Simpler dataflow design
What?– Introduce ‘record’ based operation model– ‘RecordReader’ and ‘RecordWriter’ controller services– A series of processors supporting the reader/writer processing mechanism
• Plugin record reader to de-serialize bytes to record objects• Plugin record writer to serialize record objects to bytes• Enable operations against in-memory record objects
How?
Record Processing
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Record Reader CS
Record Processing
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Record Reader/Writer Schema Access Strategy
Record Processing
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi Record Based Processing
Data Source
Centralized Schema
Repository
Lookup Schema
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Schema Registry? What Value Does it Provide?
What is Schema Registry?
• A shared repository of schemas that allows applications to flexibly interact with each other - in order to save or retrieve schemas for the data they need to access
What Value does Schema Registry Provide?
1. Data Governance
• Provide reusable schema (centralized registry)
• Define relationship between schemas (version management)
• Enable generic format conversion, and generic routing (schema validation)
2. Operational Efficiency
To avoid attaching schema to every piece of data (centralized registry)
Consumers and producers can evolve at different rates (version management)
Data quality (schema validation)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema Registry Concepts
Schema Group
• Group Name
• Schema Group - A logical grouping/container for similar type of schemas or based any criteria that the customer has from managing the schemas
Schema Metadata 1
• Schema Name
• Schema Type
• Description
• Compatibility Policy
• Serializers
• Deserializers
• Schema Meta - Metadata associated with a named
schema.
Schema Version 1
• Schema Version
• Schema Text
• Schema Version- The actual versioned schema associated a schema meta definition
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema Registry
Schema Registry Component Architecture
SR Web Server
Schema Registry
Web App
REST API
Pluggable Storage
Schema Metadata Storage
Serializer/Deserializer Jar Storage
Supported Meta Stores
mySQL In-Memory
Supported Jar Storage
Local File System
HDFS
Schema Registry Client
Java Client
Integrations
NiFi Processors Kafka Se/DeSerializersSAM
Processors
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Integration of Schema Registry with HDF
NiFi Processors for Schema Registry (HDF 2.2+)
Fetch Schema
Serialize/Deserialize with Schema
SAM processors for Schema Registry (HDF 2.2, Available in Early Access Bundle)
Lookup Schema of a Kafka Topic
Atlas integration with Schema (Upcoming)
Just like Atlas pull schema info from Hive MetaStore, Atlas can now capture schema, format and semantic metadata from events in HDF via the registry. Coming in HDF 2.3
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
‘QueryRecord’ Processor
Record Processing
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
‘QueryRecord’ Use Case
Record Processing
Use Case Background
– Oil & Gas: log data collection from remote drilling station
– expensive BGAN satellite network connectivity (64KB)
Problem Statement
– 1000 sensors, costly to send ALL data ALL THE TIME
– Collect all sensor data only if: upstream pump is off, BUT you have an elevated pipe pressure
Solution
– Define routing strategy using ’QueryRecord’
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Component versioning
Component Version
Why?– Foundational work to enable extension registry
– Foundational work to enhance flow migration experience
What?– Support multiple versions of the same NAR in a single NIFI instance
– E.g. Hadoop NAR version A: Apache Hadoop client lib; Hadoop NAR version B: proprietary Hadoop client lib
How?
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Component versioning
Component Version
© Hortonworks Inc. 2011 – 2016. All Rights Reserved29
Latest Efforts in the Apache NiFi Community
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing common asks
How can I … How do I ... What about ...
Version my flows?
Drive CI/CD processes?
Migrate flows between environments?
Provision distributions of NiFi with a set of components?
Make reference datasets/extensions available to the entirety of my data
flow?
Certify / Audit / Sign-off on flows as compliant per regulations?
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Capturing the essence of a flow in your organization
The n-dimensions of data flow
Consider a flowfile to be a singular event at a given juncture in its processing
A flow is the directed graph of processing at a given point in time
With each component’s:
Configuration
Version
Referenced Assets
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Introducing
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Registry is an enabler
SDLC
Manage variables, sensitive properties for environments
Extension Registry
Association/tagging of data with the flow that created it
Enhanced Command and Control of MiNiFi instances
© Hortonworks Inc. 2011 – 2016. All Rights Reserved40
Apache NiFi - MiNiFi: Centralized Command & Control (C2)
Provide flow updates, information and assets to instances where they live
Act as a gateway to/from network enclaves
Provide a user interface/experience for design & deploy and monitoring
Extend the reach of user experience and operations
© Hortonworks Inc. 2011 – 2016. All Rights Reserved41
Docker Containerization
First Apache community released version in 1.2.0
https://hub.docker.com/r/apache/nifi
Provide Docker images for all components of the NiFi ecosystem
Further supplement operations
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Evolution of Apache NiFi
Our core substrate for data flow is NiFi & MiNiFi
Command and Control facilitates operations and management of components
Registry for common tasks with disparate resources across the NiFi ecosystem
© Hortonworks Inc. 2011 – 2016. All Rights Reserved43
Why the Apache NiFi Ecosystem?
Moving data is multifaceted in its challenges and these are present in different contexts at varying scopes
Provide components and a platform with common tooling and extensions that are commonly needed but be flexible for extension in all aspects– Allow organizations to integrate with their existing infrastructure
Empower folks managing your infrastructure to make changes and reason about issues that are occurring– Data Provenance to show context and data’s journey
– User Interface/Experience a key component
© Hortonworks Inc. 2011 – 2016. All Rights Reserved44
Thank You