View
159
Download
0
Category
Tags:
Preview:
Citation preview
History
3
• Developed at NSA for over eight years
• Donated to the Apache Software Foundation Nov 2014
• Undergoing incubation
• Three ASF releases to date • 0.1.0 out last night!
The problem space: Enterprise Dataflow
4
Automate the flow of data from any source
…to systems which extract meaning and insight
…and to those that store and make it available for users
The challenges we faced
5
• Transport / Messaging was not enough
• Needed to understand the big picture
• Needed the ability to make *immediate* changes
• Must maintain chain of custody for data • Rigorous security and compliance requirements
Why transport and messaging was not enough?
6
• Data access exceeded resources to transport
• Decoupling systems is about more than the connectivity
• Message sizes ranged from B to GB
• Not all data is created equal
• Needed precise security controls • SSL and topic level authorization insufficient
The basic building blocks
Real-time Command and Control
The Power of Provenance
7
Apache NiFi Foundational Concepts
2
3
1
HEADER -‐ UUID -‐ Name -‐ Size -‐ Entry Time
A3ributes Map [[Key | Value]]
CONTENT
Flow File
8
• Types • Events • Objects • Files • Messages • Media
• Formats • JSON • Avro • Text • Mp4 • Proprietary
• Sizes • Bytes to GBs
Flow File Processor
9
• Routing • Context • Content
• Transformation • Enrich • Obfuscate • Filter • Convert • Analyze • Split • Aggregate
• Mediation • Push / Pull • …
Tighten the feedback loop • Changes have consequences (good or bad) • And you see them as they occur
Continuous Improvement • Compare real-time vs. historical statistics • View data provenance • View Content at any stage Intuitive user experience • Visual programming • Logical flow graph
14
Real-time command and control 2
Latency Optimization • Intra process • Inter process • End-to-end Compliance • Prove handling • Assess impact Understanding • Step through time • View content • View Context
15
The Power of Provenance – Chain of custody for data 3
Flow File Repo – Write Ahead Log Content Repo
Add more partitions Input/Output Streams
Copy on Write Pass by Reference Allow tradeoffs of latency vs throughput
17
How fast is it and why?
- User to System and System to System - Authentication (2-Way SSL)
- Authorization (pluggable)
- Authorize a specific piece of data to a specific system
- Data provenance - Prove you have done the right thing - Recover when you have not
18
How does it deal with security?
Web UI Push API
Reporting Tasks (ganglia, graphite, etc…) Pull API
REST API
19
How can I monitor this at runtime?
Flow File Processors Advanced UI
Flow File Prioritizer Reporting Tasks Controller Services Build Clients against our REST API
20
What are the points of extension?
Status and direction for NiFi
21
Efficient use of each node - 100s of MB/s per node - 100Ks transactions/s per node Simple / Effective scaling model Runtime Command and Control Data Provenance
Distributed durability of data - Maybe Kafka backed queues High Availability Cluster Manager Live / Rolling Upgrades Provenance Query Language / Reporting A complete user experience enabled by provenance
Existing Strengths Roadmap Highlights
Recommended