Hydrator Code-free Data Pipelines for Hadoop, Spark, and HBase Jonathan Gray, CEO @ Cask Big Data Day LA - July 9th, 2016 cask.co Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.

Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016

Embed Size (px)

Citation preview

Page 1: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016

HydratorCode-free Data Pipelines

for Hadoop, Spark, and HBaseJonathan Gray, CEO @ Cask

Big Data Day LA - July 9th, 2016


Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.

Page 2: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


About Me


Page 3: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


Hadoop Enables New Apps and Patterns



Batch and Realtime Data Ingestion

Any type of data from anytype of source in any volume

Batch and Streaming ETLCode-free self-service creationand management of pipelines

SQL Exploration andData Science

All data is automaticallyaccessible via SQL and client SDKs

Data as a ServiceEasily expose generic or

custom REST APIs on any data

360o Customer ViewIntegrate data from any source

and expose through queries and APIs

Realtime DashboardsPerform realtime OLAP

aggregations and serve them through REST APIs

Time Series AnalysisStore, process and serve massive

volumes of time-series data

Realtime Log AnalyticsIngestion and processing of high-throughput streaming

log events

Recommendation EnginesBuild models in batch using

historical data and serve them in realtime

Anomaly Detection SystemsProcess streaming events and predictably compare them in

realtime to historical data

NRT Event MonitoringReliably monitor large streams of data and perform defined actions

within a specified time

Internet of ThingsIngestion, storage and processing of events that is highly-available,

scalable and consistent


Batch and Realtime Data Ingestion

Any type of data from anytype of source in any volume

Batch and Streaming ETLCode-free self-service creationand management of pipelines

SQL Exploration andData Science

All data is automaticallyaccessible via SQL and client SDKs

Data as a ServiceEasily expose generic or

custom REST APIs on any data

360o Customer ViewIntegrate data from any source

and expose through queries and APIs

Realtime DashboardsPerform realtime OLAP

aggregations and serve them through REST APIs

Time Series AnalysisStore, process and serve massive

volumes of time-series data

Realtime Log AnalyticsIngestion and processing of high-throughput streaming

log events

Recommendation EnginesBuild models in batch using

historical data and serve them in realtime

Anomaly Detection SystemsProcess streaming events and predictably compare them in

realtime to historical data

NRT Event MonitoringReliably monitor large streams of data and perform defined actions

within a specified time

Internet of ThingsIngestion, storage and processing of events that is highly-available,

scalable and consistent


Batch and Realtime Data Ingestion

Any type of data from anytype of source in any volume

Batch and Streaming ETLCode-free self-service creationand management of pipelines

SQL Exploration andData Science

All data is automaticallyaccessible via SQL and client SDKs

Data as a ServiceEasily expose generic or

custom REST APIs on any data

360o Customer ViewIntegrate data from any source

and expose through queries and APIs

Realtime DashboardsPerform realtime OLAP

aggregations and serve them through REST APIs

Time Series AnalysisStore, process and serve massive

volumes of time-series data

Realtime Log AnalyticsIngestion and processing of high-throughput streaming

log events

Recommendation EnginesBuild models in batch using

historical data and serve them in realtime

Anomaly Detection SystemsProcess streaming events and predictably compare them in

realtime to historical data

NRT Event MonitoringReliably monitor large streams of data and perform defined actions

within a specified time

Internet of ThingsIngestion, storage and processing of events that is highly-available,

scalable and consistent

Page 4: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


Web Analytics and Reporting Use Case

✦Hadoop ETL pipeline stitched together using hard-to-maintain, brittle scripts

✦Not enough people with expertise in all the Hadoop components (HDFS, MapReduce, Spark, YARN, HBase, Kafka) or a general lack of expertise

✦Hard to debug and validate, resulting in frequent failures in production environment

✦Difficult to integrate into SQL / BI reporting solutions for business users

✦As use cases advance into Data Science, Machine Learning, and Predictive Analytics you need to include scientists and advanced ML programmers

Transform web log data from S3 every hour to Hadoop cluster for backup, as well as, perform analytics and enable realtime reporting of metrics such as number of successful/failure responses, most popular pages, etc.

The Challenges —

Page 5: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


The Many Faces of Hadoop



Advanced Programming

Focused on App Logic

Data Scientist

Basic Dev & Complex Analytics

Focused on Data & Algorithms

IT Pro / Ops

Configuring & Monitoring

Focused on Infrastructure & SLA’s

LOB / Product

Decision Making & Driving Revenue

Focused on Apps & Insights

Challenge: The tools are missing to connect these users and take apps from prototype to production

Page 6: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


Enter Cask

Key Customers and Partners

Named a Gartner Cool Vendor 2016

Founded in 2011 by early Hadoop engineers from Facebook and Yahoo!

Page 7: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


Introducing the Data Application Platform


Deployment Models

On-premises Hybrid Cloud

Governance Operations

Pre-packaged Integrations


Core Application and Data Integration

Role-based User Experience

Developer Data Scientist

IT /Ops

Page 8: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


Introducing the Cask Data App Platform


Open Source, Integrated Framework for

Building and Running Data Applications

on Hadoop and Spark

• Supports all major Hadoop distros • Integrates the latest Big Data technologies • 100% open source and highly extensible

Page 9: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


What’s in CDAP ?

A self-service, re-configurable, code-free framework to build, run and operate real-time or batch data pipelines in cloud or on-premise.

A self-service tool for tracking the flow of data in and out of Data Lake. Track, Index and Search technical, business and operational metadata of applications and pipelines

An integration platform that integrates and abstracts underlying Hadoop technologies. Build data analytics solutions in cloud or on-premise.

The platform is powerful and versatile for you to build, publish and manage operational self-service analytics applications

Your Apps

Page 10: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


A self-service, code-free framework to build, run and operate data pipelines

on Apache Hadoop and Spark

Built for Productionon CDAP

Rich Drag-and-DropUser Interface

Open Source &Highly Extensible

Page 11: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


INGESTany data from any source

in real-time and batch

BUILDdrag-and-drop ETL/ELT

pipelines that run on Hadoop

EGRESSany data to any destination

in real-time and batch

Hydrator Data Pipelinesprovide the ability to automate complex workflows that involves fetching data, possibly from multiple

data sources, combining, performing non-trivial transformations and aggregations on the data, writing it to one more data sinks and making it available for applications and analytics

Page 12: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


Stack of Data Enablers

Page 13: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


Hydrator Studio

✦Drag-and-drop GUI for visual Data Pipeline creation

✦Rich library of pre-built sources, transforms, sinks for data ingestion and ETL use cases

✦Separation of pipeline creation from execution framework - MapReduce, Spark, Spark Streaming etc.

✦Hadoop-native and Hadoop Distro agnostic

Page 14: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


Hydrator Data Pipeline

✦Captures Metadata, Audit, Lineage info, discovered and visualized using Cask Tracker

✦Notifications, scheduling, and monitoring with centralized metrics and log collection for ease of operability

✦Simple Java API to build your own source, transforms, sinks with class loading isolation

✦Javascript and Python transforms

✦ Include arbitrary Spark jobs

Page 15: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


✦ Elastic, SFTP, Cassandra, Kafka, RDBMS, EDW and many more sources and sinks

✦ Parse/Encode/Hash, Distinct/Group By, Custom JavaScript/Python Transforms

Out of the box Integrations

Page 16: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


✦ Implement your own batch (or realtime) source, transform, sink plugins using simple Java API

Custom Plugins

Page 17: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


Pipeline Implementation

Logical Pipeline

Physical Workflow

MR/Spark Executions



✦Planner converts logical pipeline to a physical execution plan

✦Optimizes and bundles functions into one or more MR/Spark jobs

✦CDAP is the runtime environment where all the components of the data pipeline are executed

✦CDAP provides centralized log and metrics collection, transaction, lineage and audit information

Page 18: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


Pipeline Implementation

Page 19: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


Support for fine-grain role-based authorizing of entities in CDAP

Integration with Sentry and Ranger

Security — Authentication and Authorization

Ability to preview pipelines with real or injected data before deploying (Standalone)

Security — Impersonation and Encryption

Learn about how datasets are being used and the top applications accessing it

Tracker — Data Usage Analytics

Support for annotating business metadata based on business specified taxonomy

Metadata Taxonomy

Build and run Hydrator real-time pipelines using Spark Streaming

Hydrator — Spark Streaming

Ability to run CDAP and CDAP Apps as specified users and ability to

encrypt/decrypt sensitive configuration

Hydrator — Preview Mode

Capability to join multiple streams (inner & outer) and ability to configure actions allowing one to run binaries on designated nodes

Hydrator — Join & Action

Support for XML, Mainframe (COBOL Copybook), Value Mapper, Normalizer, Denormalizer, JsonToXml, SSH Action, Excel Reader, Solr & Spark ML

Hydrator — Plugins

3.5 - Latest Features

Page 20: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


✦Join across multiple data sources (CDAP-5588)

✦Live Debug/Preview of pipelines in build mode

✦Macro substitutions for configuration/properties

✦Custom Actions anywhere in pipeline

✦Spark streaming support for real-time pipelines

Hydrator Roadmap

Page 21: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


Use case mapping

• Build operational analytics applications

• Micro-service Enablement

• Self-Service Data Analytics / Data Science

• Data-As-A-Service

• Empower developers to easily build solution on Hadoop

• Abstract technologies, future proof

• Ingestion, Transformation, Blending (complex joins) and Lookup.

• Machine Learning, Aggregation and Reporting

• Realtime and Batch data pipelines

• DW Offloading (Netezza, Teradata, etc)

• Painless and Fast Ingest into Impala operationalized

• Data Ingestion from varied sources

• Easy way to catalog application and pipeline level metadata

• Search across technical, business and operational metadata

• Track Lineage and Provenance,

• Track across non-Hadoop integrations

• Usage Analytics of cluster data

• Data Quality Measure

• Integration with other MDM systems including Navigator

Page 22: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016


Demo ExampleLoad Log Files from S3 to HDFS and perform aggregations/analysis

• Start with web access logs stored in Amazon S3

• Store the raw logs into HDFS Avro Files

• Parse the access log lines into individual fields

• Calculate the total number of requests by IP and status code

• Find out IPs which received maximum successful status code and error codes - - [08/Feb/2015:04:36:40 +0000] "GET /ajax/planStatusHistory HTTP/1.1" 200 508 "http://builds.cask.co/log" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit Chrome/38.0.2125.122 Safari/537.36"

Fields: IP Address, Timestamp, Http Method, URI, Http Status, Response Size, URI, Client Info

Sample Web access log (Combined Log Format):

Page 23: Cask Hydrator: Code Free Data Pipelines - Big Data Day LA 2016



Jonathan Gray @jgrayla

Download CDAP w/ Hydrator: http://cask.co/downloads/