Upload
eric-kavanagh
View
178
Download
1
Embed Size (px)
Citation preview
Grab some coffee and enjoy the pre-show banter
before the top of the
hour! !
The Briefing Room
Time's Up! Getting Value from Big Data Now
u Reveal the essential characteristics of enterprise software, good and bad
u Provide a forum for detailed analysis of today’s innovative technologies
u Give vendors a chance to explain their product to savvy analysts
u Allow audience members to pose serious questions... and get answers!
Mission
Big Integration
u Old infrastructure lacking
u New pipes are needed
u Well begun is half done!
CASK
u CASK offers a unified integration platform for big data applications and data lakes
u Its CDAP architecture provides data containers, program containers and application containers for data and applications on Hadoop
u CASK also offers Hydrator for building and managing data pipelines and data lakes, and Tracker for data lake governance
Guest
Jonathan Gray Jonathan Gray, Founder & CEO of Cask, is an entrepreneur and software engineer with a background in startups, open source and all things data. Prior to founding Cask, Jonathan was a software engineer at Facebook where he drove HBase engineering efforts, including Facebook Messages and several other large-scale projects from inception to production. An open source evangelist, Jonathan was responsible for helping build the Facebook engineering brand through developer outreach and refocusing the open source strategy of the company. Prior to Facebook, Jonathan founded Streamy.com, where he became an early adopter of Hadoop and HBase and is now a core contributor and active committer in the community. Jonathan holds a bachelor’s degree in Electrical and Computer Engineering and Business Administration from Carnegie Mellon University.
Big Data on Tap
cask.co November 1, 2016
The Briefing Room
Jonathan GrayFounder & CEO
cask.co
Hadoop Enables New Applications and Architectures
2
ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS
Batch and Realtime Data Ingestion
Any type of data from anytype of source in any volume
Batch and Streaming ETLCode-free self-service creationand management of pipelines
SQL Exploration andData Science
All data is automaticallyaccessible via SQL and client SDKs
Data as a ServiceEasily expose generic or
custom REST APIs on any data
360o Customer ViewIntegrate data from any source
and expose through queries and APIs
Realtime DashboardsPerform realtime OLAP
aggregations and serve them through REST APIs
Time Series AnalysisStore, process and serve massive
volumes of time-series data
Realtime Log AnalyticsIngestion and processing of high-throughput streaming
log events
Recommendation EnginesBuild models in batch using
historical data and serve them in realtime
Anomaly Detection SystemsProcess streaming events and predictably compare them in
realtime to historical data
NRT Event MonitoringReliably monitor large streams of data and perform defined actions
within a specified time
Internet of ThingsIngestion, storage and processing of events that is highly-available,
scalable and consistent
ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS
Batch and Realtime Data Ingestion
Any type of data from anytype of source in any volume
Batch and Streaming ETLCode-free self-service creationand management of pipelines
SQL Exploration andData Science
All data is automaticallyaccessible via SQL and client SDKs
Data as a ServiceEasily expose generic or
custom REST APIs on any data
360o Customer ViewIntegrate data from any source
and expose through queries and APIs
Realtime DashboardsPerform realtime OLAP
aggregations and serve them through REST APIs
Time Series AnalysisStore, process and serve massive
volumes of time-series data
Realtime Log AnalyticsIngestion and processing of high-throughput streaming
log events
Recommendation EnginesBuild models in batch using
historical data and serve them in realtime
Anomaly Detection SystemsProcess streaming events and predictably compare them in
realtime to historical data
NRT Event MonitoringReliably monitor large streams of data and perform defined actions
within a specified time
Internet of ThingsIngestion, storage and processing of events that is highly-available,
scalable and consistent
ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS
Batch and Realtime Data Ingestion
Any type of data from anytype of source in any volume
Batch and Streaming ETLCode-free self-service creationand management of pipelines
SQL Exploration andData Science
All data is automaticallyaccessible via SQL and client SDKs
Data as a ServiceEasily expose generic or
custom REST APIs on any data
360o Customer ViewIntegrate data from any source
and expose through queries and APIs
Realtime DashboardsPerform realtime OLAP
aggregations and serve them through REST APIs
Time Series AnalysisStore, process and serve massive
volumes of time-series data
Realtime Log AnalyticsIngestion and processing of high-throughput streaming
log events
Recommendation EnginesBuild models in batch using
historical data and serve them in realtime
Anomaly Detection SystemsProcess streaming events and predictably compare them in
realtime to historical data
NRT Event MonitoringReliably monitor large streams of data and perform defined actions
within a specified time
Internet of ThingsIngestion, storage and processing of events that is highly-available,
scalable and consistent
Data Applications Drive Meaningful Business Value
cask.co3
But Getting Value from Big Data is Hard
Too much focus on infrastructure and integration, rather than applications and analytics
Divergence of distributions and technologies
Integration silos created by narrow point solutions
Proliferation of projects, services and APIs
Complexity of technologies and new user learning curve
cask.co4
Without a consistent set of tools, IT will not be an effective data enabler for the business
Developer
Architecture & Programming
Focused on Apps & Solutions
Ops
Configuring & Monitoring
Focused on Infrastructure & SLA’s
LOB / Product
Driving Revenue & Decision Making
Focused on Products & Insights
Data Scientist
Scripting & Machine Learning
Focused on Data & Algorithms
And There Are Many Faces of Hadoop
cask.co5
Enter Cask
AT&T, Cloudera and Ericsson Strategic Investors
3.5 Cask Data Application Platform, Cask Hydrator and Cask Tracker
Latest Release
AT&T, Ericsson, Lotame, Salesforce, Cloudera, Hortonworks, MapR, Microsoft, IBM, Tableau…
Key Customers & Partners
By early Hadoop engineers from Facebook and Yahoo!
Founded in 2011
Andreessen Horowitz, Safeguard, Battery Venture and Ignition Partners
Raised $37+ Million
Featuring Cask Market, the “big data app store”
NEW: CDAP 4 Preview
A Container Architecture that puts Big Data on Tap
Why “Cask” ?
cask.co6
Convergence of Big Data Apps and Data Integration
The Evolution of the Cask Platform
Big Data Apps + Data Integration
• Data ingest
• Data pipelines
• Workflows and metadata
“WebLogic Meets Informatica”
CDAPv3
Big Data App Server
• Abstractions & integrations
• Metrics & logs
• Debugging environment
“WebLogic for Hadoop”
CDAPv2
Unified Integration for Big Data
• Security & governance
• Self-service environment
• Enterprise integrations
“Unified Big Data Integration”
CDAPv4
cask.co
Introducing Cask Data Application Platform (CDAP)
7
First Unified Integration Platform for Big Data
Platform for distributed apps, bringing together application management with data integration
• 100% open source and built for extensibility
• Supports all major Hadoop distributions and clouds
• Integrates the latest open source big data technologies
Data Lake FraudDetection
RecommendationEngine
Sensor DataAnalytics
Customer360
Modern DataIntegration
DistributedApplicationFramework
Self-ServiceUser Experience
Enterprise-gradeSecurity &Governance
cask.co8
• Real-time and Batch • Reliable and Scalable • Simple and Self-Service
Modern Data Integration
EXPLOREfor analytics and
data science
PROCESSfor ETL and
machine learning
SERVEany data to any
destination
INGESTany data from
any source
cask.co9
Distributed Application Framework
DEVELOPrapidly build applications
TESTpowerful test and
CI framework
DEPLOYrun any apps in
any environment
SCALEhorizontally scale
apps and data
• Real-time and Batch • Memory, Local, Distributed • Analytics and Applications
cask.co10
Security and Governance
CAPTUREstore all metadata about your data
DISCOVEReasily locate any
of your data
TRACKevery audit plus lineage graphs
ANALYZEunderstand usage patterns of data
AUTHENTICATE AUTHORIZEENCRYPT
cask.co11
A data discovery tool to explore metadata and usageA code-free framework to build and run data pipelines
Self-Service User Experience
Drag & drop graphical interface
Create, debug,
deploy and manage
Separation of logic and execution
environment
Native to Hadoop & Spark —
scales out
Rich app-level
metadata
Track lineage and
audits
Analyze usage of datasets
MDM integration framework
cask.co
The CDAP Architecture
12
Applications
ProgramsMapReduce SparkTigon Workflow Service Worker
Metadata
DatasetsTable Avro ParquetTimeseries OLAP CubeGeospatial ObjectStore
Metadata
Metadata
• Application Container Architecture
• Reusable Programming Abstractions
• Global User and Machine Metadata
• Highly Extensible Plugin Architecture
cask.co13
Single framework for building and running data apps and data lakes on Hadoop and Spark
Rapid Development
• Standardization, deep integrations, tools and docs
• Separation of app logic from data logic and integration logic
• Conceptual integrity within applications and consistency across environments
Production Operations & Governance
• Simplified packaging, deployment and monitoring of apps on Hadoop
• Enhanced security and governance with centralized metrics and logs
• Tracking and exploration of metadata, data provenance, audit trails and usage analytics
CDAP Enables the Full Big Data Application Lifecycle
reduces time to develop and deploy big data apps by 80% reduces time to insights and accelerates business value removes barriers to innovation and future-proofs your apps
cask.co14
Customer Success Stories
Customer Situation
Lack of existing Hadoop expertise and frustration with hand-coding
and scripting tools
Cask Hydrator for rapid creation of data pipelines and Cask Tracker for
data discovery
POC in 2 daysProduction in 2 months
Cask Solution
Small team and significant technical challenges limit pace of development and solution scale
CDAP for real-time ingestion and consistent processing with
production operations support
Development in 1 monthProduction in 3 months
Hundreds of UsersThousands of Pipelines
Multiple teams and technologies with widely varied skillsets and incompatible design choices
CDAP for data lake management and orchestration, tightly
integrated into existing systems
Health Insurance Provideroffloading clinical / immunization
reporting from Netezza
Leading SaaS Platformtaking new real-time, massive
scale products to market
Large Telco Enterprisebuilding a centralized, secured,
multi-tenant Data Lake
cask.co15
Cask was Named a Gartner Cool Vendor 2016
Cask was Certified a Great Place to Work 2016
“ … for the rest of us who lack the technological chips or patience to make it all work, there’s good news: it will soon get easier, thanks to the work done by the big data pioneers, as well as vendors like Cask …”(Alex Woodie, Managing Editor, Datanami)
Awards and Accolades
“ … “Cask has tilted the playing field, earning a massive unfair advantage over proprietary point products for data integration and ingest …”(Nik Rouda, Senior Analyst, Enterprise Strategy Group)
“ … “CDAP is a big win for us … the amount of code we needed to write was minimal with CDAP, and it was much easier and faster than we ever expected …”(Jia-Long Wu, Data Architect, Lotame)
cask.co16
NEW: CDAP 4 — Big Data Apps on Tap!
Available for download now!Release of CDAP 4 Preview
“Big Data App Store”
Cask MarketInteractive Data Preparation
Cask Wrangler
Interactive Wizards for Common Tasks
Resource CenterRewrite based on React
Reimagined CDAP UI
cask.co17
The “App Store for Big Data”Cask Market
•Goal: Time to value in minutes w/ no existing experience•Application and Library Ecosystem with pre-built Hadoop solutions, reusable templates, and third-party plugins•Available from anywhere inside the CDAP UI with a click
• Initially, everything in the Cask Market has been bootstrapped by Cask based on ongoing work across our customers, is 100% open source and available on GitHub•Eventually, developers and ISVs will be able to showcase and market their own applications and libraries (ex: Graylog)
Cask Market includes Interactive, Guided Wizards for Configuring Pre-Built Templates
NEW: CDAP 4 — Big Data Apps on Tap!
cask.co18
Building Data Pipelines on Hadoop with Cask Hydrator
Data Lake Webinar
Introduction to Cask Hydrator
CDAP - Containers on Hadoop
CDAP Extensions - Cask Hydrator and Cask Tracker
ESG Solution Spotlight
CDAP Technical Concepts (video)
Cask / Cloudera Solution Brief
Cask Resources
cask.co
● CDAP provides the first unified integration platform for big data
● Cask Hydrator and Cask Tracker are visual extensions of CDAP for self-service access
● CDAP empowers enterprise IT to deliver faster time to value for Hadoop and Spark, from prototype to production
● Cask Market is a “big data app store” available in CDAP 4 with pre-built apps, pipelines, plugins
● CDAP is 100% open source, highly extensible, enterprise-ready, and commercially supported
Big Data on Tap
Summary
Perceptions & Questions
Analyst: Robin Bloor
Big Data Foundations?
Robin Bloor, PhD
Neither Hadoop Nor Spark Is a Solution
However, both are useful and increasingly versatile components for
Big Data applications
The Evolution of the Little Elephant
u Hortonworks: Apache pure play. No apparent vision.
u Cloudera: Some proprietary components (Cloudera Manager, Impala, Cloudera Search). Vision is corporate data hub(?)
u MapR: Also some proprietary components (MapR-FS, MapR Streams, MapR-DB)
u And then there’s the cloud.
The Ship of Fools
Until Hadoop’s direction is controlled by a single “captain” we may have to
tolerate the ship of fools
The “Big Data Hype Cycle” Is Misleading
u Big Data is an ecosystem, not a technology – which distorts this graph
u Some analytics applications have experienced “absurd acceleration”
u Hadoop is, in many instances, a laggard - Spark too
u Nevertheless, we seem to be exiting “the trough”
Data lake or Governance Hub?
The System Management Issue
MobileDevicesDesktopsServers
IoT
TheCloud
Archive
DataStores
DataAssaying
DataCapture
Real-TimeStreaming?
DataMgt
DataServing
The Prospecting Domain
AppsData
Life CycleMgt
StagingArea
(Hadoop?)
SystemManagement
The Fundamental Issue
Big Data does not really have a foundation. Neither, imho, does the
Data Lake.
Luckily, there are third parties…
u Regarding Hadoop, do you have any “preferred components?”
u How do you stay current with the various distros? Backward compatibility? Can a customer upgrade at will?
u How does your technology impact performance (if at all)?
u Do you provide a consultancy service?
u Which companies/services do you regard as competitive?
u Do you have any specific partners?
u What does an implementation look like?
THANK YOU for your
ATTENTION!
Some images provided courtesy of Wikimedia Commons