44
HDF 3.0 Deep Dive Aldrin Piri

Future of Data New Jersey - HDF 3.0 Deep Dive

Embed Size (px)

Citation preview

Page 1: Future of Data New Jersey - HDF 3.0 Deep Dive

HDF 3.0 Deep DiveAldrin Piri

Page 2: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved2

AgendaQuick Overview of HDF & Apache NiFi

What’s new in HDF 3.0?

Latest Efforts in the Apache NiFi Community

Page 3: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

HDF Data-In-Motion PlatformHDF provides the flow management, stream processing, and enterprise services

needed to collect, curate, analyze and act on data-in-motion across the data center and cloud.

FlowManagement

Dataacquisi*onanddelivery

Simpletransforma*onanddatarou*ng

Simpleeventprocessing

Endtoendprovenance

Edgeintelligence&bi-direc*onalcommunica*on

StreamProcessing

Scalabledatabrokerforstreamingapps

Scaleoutcomplextransforma6on

StreamAnaly2csPa8ernMatching

Prescrip6ve&Predic6veStreamAnaly6cs

ComplexEventProcessing

Con6nuousInsight

EnterpriseServices

Provisioning,Management,Monitoring,Security,Audit,Compliance,Governance,Mul: -tenancy

Java

Agent

C++

Agent

Page 4: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Empower users to manage the collection and flow of data

Page 5: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved5

The Problem at Hand

Producers A.K.A ThingsAnything

AND Everything

Internet!

Consumers• User• Storage• System• …More Things

Page 6: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Moving data effectively is hard

Standards: http://xkcd.com/927/

Page 7: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved7

Apache NiFi: A Primer

Key Features and Principles

• Guaranteed delivery

• Data buffering - Backpressure

- Pressure release

• Prioritized queuing

• Flow specific QoS- Latency vs. throughput

- Loss tolerance

• Data provenance

• Recovery/recording a rolling log of fine-grained history

• Visual command and control

• Flow templates

• Pluggable/multi-role security

• Designed for extension

• Clustering

Page 8: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved8

NiFi is based on Flow Based Programming (FBP)

FBP Term NiFi Term Description

Information Packet

FlowFile Each object moving through the system.

Black Box FlowFile Processor

Performs the work, doing some combination of data routing, transformation, or mediation between systems.

Bounded Buffer

Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates.

Scheduler Flow Controller

Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use.

Subnet Process Group

A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.

Page 9: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

NiFi & Data Agnosticism

NiFi is data agnostic!

But, NiFi was designed understanding that users

can care about specifics and provides tooling

to interact with specific formats, protocols, etc.

ISO 8601 - http://xkcd.com/1179/

Robustness principle

Be conservative in what you do, be liberal in what you accept from others“

Page 10: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved10

Page 11: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 12: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved12

Apache NiFi - MiNiFi

Let me get the key parts of NiFi close to where data begins

Bidirectional data transfer

Greater illuminate journey with provenance

NiFi lives in the data center. Give it an enterprise server or a cluster of them.

MiNiFi lives as close to where data is born and is a guest on that device or system

Page 13: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved13

Connecting the Drops

SOURCESREGIONAL

INFRASTRUCTURECORE

INFRASTRUCTURE

Page 14: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved14

Managing data flow for a courier service

Physical Store

Gateway Server

Mobile Devices

Registers

Server Cluster

Distribution Center

Kafka

Core Data Center at HQ

Server Cluster

Others

Storm / Spark / Flink / Apex

Kafka

Storm / Spark / Flink / Apex

On Delivery Routes

Trucks Deliverers

Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/

Deliverer: Rigo Peter, https://thenounproject.com/rigo/

Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/

Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/

Client Libraries

Client Libraries

MiNiFi

MiNiFi

NiFi NiFi NiFi NiFi NiFi NiFi

Client Libraries

Page 15: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

What’s new in HDF 3.0?

Page 16: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Record Based Processing Mechanism

Why?– Improve operational efficiency – Intuitive and flexible filtering/routing strategies powered by ‘QueryRecord’– Simpler dataflow design

What?– Introduce ‘record’ based operation model– ‘RecordReader’ and ‘RecordWriter’ controller services– A series of processors supporting the reader/writer processing mechanism

• Plugin record reader to de-serialize bytes to record objects• Plugin record writer to serialize record objects to bytes• Enable operations against in-memory record objects

How?

Record Processing

Page 17: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Record Reader CS

Record Processing

Page 18: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Record Reader/Writer Schema Access Strategy

Record Processing

Page 19: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

NiFi Record Based Processing

Data Source

Centralized Schema

Repository

Lookup Schema

Page 20: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

What is Schema Registry? What Value Does it Provide?

What is Schema Registry?

• A shared repository of schemas that allows applications to flexibly interact with each other - in order to save or retrieve schemas for the data they need to access

What Value does Schema Registry Provide?

1. Data Governance

• Provide reusable schema (centralized registry)

• Define relationship between schemas (version management)

• Enable generic format conversion, and generic routing (schema validation)

2. Operational Efficiency

To avoid attaching schema to every piece of data (centralized registry)

Consumers and producers can evolve at different rates (version management)

Data quality (schema validation)

Page 21: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Schema Registry Concepts

Schema Group

• Group Name

• Schema Group - A logical grouping/container for similar type of schemas or based any criteria that the customer has from managing the schemas

Schema Metadata 1

• Schema Name

• Schema Type

• Description

• Compatibility Policy

• Serializers

• Deserializers

• Schema Meta - Metadata associated with a named

schema.

Schema Version 1

• Schema Version

• Schema Text

• Schema Version- The actual versioned schema associated a schema meta definition

Page 22: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Schema Registry

Schema Registry Component Architecture

SR Web Server

Schema Registry

Web App

REST API

Pluggable Storage

Schema Metadata Storage

Serializer/Deserializer Jar Storage

Supported Meta Stores

mySQL In-Memory

Supported Jar Storage

Local File System

HDFS

Schema Registry Client

Java Client

Integrations

NiFi Processors Kafka Se/DeSerializersSAM

Processors

Page 23: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Integration of Schema Registry with HDF

NiFi Processors for Schema Registry (HDF 2.2+)

Fetch Schema

Serialize/Deserialize with Schema

SAM processors for Schema Registry (HDF 2.2, Available in Early Access Bundle)

Lookup Schema of a Kafka Topic

Atlas integration with Schema (Upcoming)

Just like Atlas pull schema info from Hive MetaStore, Atlas can now capture schema, format and semantic metadata from events in HDF via the registry. Coming in HDF 2.3

Page 24: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

‘QueryRecord’ Processor

Record Processing

Page 25: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

‘QueryRecord’ Use Case

Record Processing

Use Case Background

– Oil & Gas: log data collection from remote drilling station

– expensive BGAN satellite network connectivity (64KB)

Problem Statement

– 1000 sensors, costly to send ALL data ALL THE TIME

– Collect all sensor data only if: upstream pump is off, BUT you have an elevated pipe pressure

Solution

– Define routing strategy using ’QueryRecord’

Page 26: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Demo

Page 27: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Component versioning

Component Version

Why?– Foundational work to enable extension registry

– Foundational work to enhance flow migration experience

What?– Support multiple versions of the same NAR in a single NIFI instance

– E.g. Hadoop NAR version A: Apache Hadoop client lib; Hadoop NAR version B: proprietary Hadoop client lib

How?

Page 28: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Component versioning

Component Version

Page 29: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved29

Latest Efforts in the Apache NiFi Community

Page 30: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Addressing common asks

How can I … How do I ... What about ...

Version my flows?

Drive CI/CD processes?

Migrate flows between environments?

Provision distributions of NiFi with a set of components?

Make reference datasets/extensions available to the entirety of my data

flow?

Certify / Audit / Sign-off on flows as compliant per regulations?

Page 31: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Capturing the essence of a flow in your organization

The n-dimensions of data flow

Consider a flowfile to be a singular event at a given juncture in its processing

A flow is the directed graph of processing at a given point in time

With each component’s:

Configuration

Version

Referenced Assets

Page 32: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Introducing

Page 33: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 34: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 35: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 36: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 37: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 38: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 39: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Registry is an enabler

SDLC

Manage variables, sensitive properties for environments

Extension Registry

Association/tagging of data with the flow that created it

Enhanced Command and Control of MiNiFi instances

Page 40: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved40

Apache NiFi - MiNiFi: Centralized Command & Control (C2)

Provide flow updates, information and assets to instances where they live

Act as a gateway to/from network enclaves

Provide a user interface/experience for design & deploy and monitoring

Extend the reach of user experience and operations

Page 41: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved41

Docker Containerization

First Apache community released version in 1.2.0

https://hub.docker.com/r/apache/nifi

Provide Docker images for all components of the NiFi ecosystem

Further supplement operations

Page 42: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

The Evolution of Apache NiFi

Our core substrate for data flow is NiFi & MiNiFi

Command and Control facilitates operations and management of components

Registry for common tasks with disparate resources across the NiFi ecosystem

Page 43: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved43

Why the Apache NiFi Ecosystem?

Moving data is multifaceted in its challenges and these are present in different contexts at varying scopes

Provide components and a platform with common tooling and extensions that are commonly needed but be flexible for extension in all aspects– Allow organizations to integrate with their existing infrastructure

Empower folks managing your infrastructure to make changes and reason about issues that are occurring– Data Provenance to show context and data’s journey

– User Interface/Experience a key component

Page 44: Future of Data New Jersey - HDF 3.0 Deep Dive

© Hortonworks Inc. 2011 – 2016. All Rights Reserved44

Thank You