Upload
snaplogic
View
3.098
Download
1
Embed Size (px)
Citation preview
Building The Enterprise Data Lake
Today’s Presenters
Mark Madsen Industry Analyst
Third Nature @markmadsen
Craig Stewart Sr. Dir.
Product Management SnapLogic
@01Badger
Erin Curtis Sr. Dir.
Product Marketing
SnapLogic @erncrts
Building the Enterprise Data Lake Considera6ons before you jump in
December, 2015 Mark Madsen www.ThirdNature.net @markmadsen1
© Third Nature, Inc.
So we shiBed to data publishing
Industrialized data delivery for self-‐service access.
Events and sensors are a rela6vely new data source
Sensor data doesn’t fit well with current methods of modeling, collecEon and storage, or with the technology to process and analyze it.
These sorts of things slow user requests down
Conclusion: any methodology built on the premise that you must know and model all the data first is untenable
© Third Nature, Inc.
Analy6cs embiggens data volume problems
Many of the processing problems are O(n2) or worse, so moderate data can be a problem for scale-‐up plaOorms
© Third Nature, Inc.
Old market says: There’s nothing wrong with what you have, just keep buying new products from us
© Third Nature, Inc.
Views of the lake Is the business vs supports the business? ApplicaEon vs infrastructure?
Schema
In the DW world both data and processing are bounded
No consideration for feedback loops and change
Processing only happens here
Carefully controlled access here
Nobody here creates
new inform
ation
Sources few and well understood
Complex DI is controlled by IT
Schemas are few and designed
Tools are authorized, few in number and kind
One way flow
This is a monolithic, layered architecture
© Third Nature, Inc.
In the big data world flow is unbounded and con6nuous
Feedback loops allowed
End-of-analysis dataset may be start of a BI dataset
Continuous data integration and delivery
Files are back as both input and storage
Minimal barrier of / control on collection
Areas of provisioned data
Any shape in, rectangles out
This needs a distributed service architecture
© Third Nature, Inc.
Deconstruc6ng data environments
There are three things happening in a data warehouse: ▪ Data acquisiEon ▪ Data management ▪ Data delivery Isolate them from one another, allow read-‐write use, and you are on the path.
Data Warehouse
Data lake subsystems / components
The acquisi6on component allows any data to be collected at any latency. The management component allows some data to be standardized and integrated. The access component provides access at any latency and via any means an applica6on chooses. Processing can be done to any data at any 6me from any area.
Data AcquisiEon Collect & Store
Incremental
Batch
One-‐Eme copy
Real Eme
Data Lake PlaOorm Services
Data Management Process & Integrate
Data Access Deliver & Use
Data storage
In reality, you are building three systems, not one. Avoid the monolith.
© Third Nature, Inc.
Data lake func6ons depend on plaUorm services
Base Platform Services
Data Movement Metadata Data Persistence
Workflow Management
Processing Engines Dataflow Services
Data Curation Data Access Services
Data AcquisiEon Collect & Store
Data Management Process & Integrate
Data Access Deliver & Use
PlaOorm services needed
© Third Nature, Inc.
Decouple the Data Architecture
The core of the data lake isn’t a database or HDFS, it’s the data architecture that the tools implement. We need a data architecture that is not limiEng: ▪ Deals with change easily and at scale ▪ Does not enforce requirements and models up front ▪ Does not limit the format or structure of data ▪ Assumes the range of data latencies in and out, from streaming to one-‐Eme bulk
© Third Nature, Inc.
Food supply chain: an analogy for data
MulEple contexts of use, differing quality levels
You need to keep the original because just like baking, you can’t unmake dough once it’s mixed.
© Third Nature, Inc.
Data architecture is required by the services, and vice versa
Raw data in an immutable storage area
Standardized or enhanced data
Common or usage-specific data
Transient data
Data AcquisiE
on
Collect & Store
PlaOorm Services
Data Access Deliver &
Use
Data Management Process & Integrate
© Third Nature, Inc.
The data areas map (mostly) to func6onal areas of the lake
CollecEon can’t be limited by database scale and latency. Immutability, persistence and concurrency are required.
Incremental
Collect
Batch
One-‐Eme copy
Real Eme
Manage & Integrate Process, Deliver, Use
© Third Nature, Inc.
Stages, not layers Some tools require specific repositories or models. Others can reach in to get what they need. Do not enforce a single access point or model.
© Third Nature, Inc.
The geography has been redefined
The box IT created: • not any data, rigidly typed data • not any form, tabular rows and columns of typed data
• not any latency, persist what the DB can keep up with
• not any process, only queries The digital world was diminished to only what’s inside the box un6l we forgot the box was there.
© Third Nature, Inc.
Layered data architecture The DW assumed a single flat model of data, DB in the center. The data lake enables new ways to organize data: ▪ Raw – straight from the source ▪ Enhanced –cleaned, standardized ▪ Integrated – modeled, augmented, ~semi-‐persistent ▪ Derived – analyEc output, pacern based sets, ephemeral
Implies a new technology architecture and data modeling approaches.
© Third Nature, Inc.
The data lake enables evolu6onary design for data EvoluEonary design is required because data needs change. You need a system not for stability – we have that in the DW -‐ but for evoluEon and change, the data lake.
Data AcquisiEon Collect & Store
Incremental
Batch
One-‐Eme copy
Real Eme
Data Lake PlaOorm Services
Data Management Process & Integrate
Data Access Deliver & Use
Data storage
You can’t build this all at once. You need to grow it over 6me.
© Third Nature, Inc.
Away from “one throat to choke”, back to best of breed
Tight coupling leads to efficient reuse and standardizaEon, and to slow changes. In a rapidly evolving market componenEzed architectures, modularity and loose coupling are favorable over monolithic stacks, single-‐vendor architectures and Eght coupling. Architecture, not blueprints: there is no single answer. It depends on your goals and starEng posiEon.
Ques6ons? “When a new technology rolls over you, you're either part of the steamroller or part of the road.” – Stewart Brand
© Third Nature, Inc.
CC Image Abribu6ons Thanks to the people who supplied the creaEve commons licensed images used in this presentaEon: donuts_4_views.jpg -‐ hcp://www.flickr.com/photos/le_hibou/76718773/ glass_buildings.jpg -‐ hcp://www.flickr.com/photos/erikvanhannen/547701721
© Third Nature, Inc.
About the Presenter
Mark Madsen is president of Third Nature, a consulEng and advisory firm focused on analyEcs, business intelligence and data management. Mark is an award-‐winning author, architect and CTO. Over the past ten years Mark received awards for his work from the American ProducEvity & Quality Center, TDWI, and the Smithsonian InsEtute. He is an internaEonal speaker, a contributor to Forbes, member of the O’Reilly Strata program commicee. For more informaEon or to contact Mark, follow @markmadsen on Twicer or visit hcp://ThirdNature.net
About Third Nature
Third Nature is a consulEng and advisory firm focused on new and emerging technology and pracEces in informaEon strategy, analyEcs, business intelligence and data management. If your quesEon is related to data, analyEcs, informaEon strategy and technology infrastructure then you‘re at the right place.
Our goal is to help organizaEons solve problems using data. We offer educaEon, consulEng and research services to support business and IT organizaEons as well as technology vendors.
We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in strategy and architecture, so we look at emerging technologies and markets, evaluaEng how technologies are applied to solve problems rather than evaluaEng product features.
Anything apps | APIs | things | data
Anytime batch | streaming | real-time
Anywhere on premises | in the cloud
SnapLogic helps enterprises connect data and applications faster
Modern Architecture: Hybrid and Elastic
Streams: No data is stored/cached Secure: 100% standards-based Elastic: Scales out & handles data and app integration use cases
Metadata
Data Databases On Prem
Apps
Big Data
Cloud Apps and Data Cloud-Based Designer, Manager,
Dashboard
Cloudplex
Groundplex
Hadooplex Sparkplex
Firewall
z
Data Acquisition
On Prem Apps and Data
Data Access
z
Data Management
Data Lake
Add information and improve data
Spark Python Scala Java
R Pig
Collect and integrate data from multiple
sources
HDFSAWS S3
MS Azure Blob
• ERP • CRM • RDBMS
Cloud Apps and Data
• CRM • HCM • Social
IoT Data
• Sensors • Wearables • Devices
LakeshoreData Mart
• MS Azure • AWS
Redshift • …
BI / Analytics
• Tableau • MS
PowerBI / Azure
• AWS QuickSight
Organize and prepare data for
visualization
HDFSAWS S3
MS Azure Blob Hive
Batch
Streaming
Schedule and manage: Oozie, Ambari
Kafka, Sqoop, Flume
Real-time
Ingest Prepare Deliver
Impala, HiveSQL, SparkSQL
z
Data Acquisition
On Prem Apps and Data
Data Access
z
Data Management
The Modern Data Lake Powered by SnapLogic
• ERP • CRM • RDBMS
Cloud Apps and Data
• CRM • HCM • Social
IoT Data
• Sensors • Wearables • Devices
LakeshoreData Mart
• MS Azure • AWS
Redshift • …
BI / Analytics
• Tableau • MS
PowerBI / Azure
• AWS QuickSight
Batch
Streaming
Schedule and manage: SnapLogic SnapLogic Pipelines
Real-time
Ingest Prepare Deliver
SnapLogic Pipelines
Sort, Aggregate,
Join, Merge, Transform
SnapLogic abstracts and
operationalizes with
SnapReduce or Spark pipelines
Collect and integrate data from multiple
sources
SnapLogic pipelines with
standard mode execution
Organize and prepare data for
visualization
SnapLogic pipelines with
standard mode execution
Thank You Watch SnapLogic in action:"
video/snaplogic.com
Contact us: [email protected]
Follow us on Twitter:
@SnapLogic