35
Building the Enterprise Data Lake Considerations before you jump in December, 2015 Mark Madsen www.ThirdNature.net @markmadsen1

Building the Enterprise Data Lake: A look at architecture

Embed Size (px)

Citation preview

Building the Enterprise Data Lake Considerations before you jump in

December, 2015 Mark Madsen www.ThirdNature.net @markmadsen1

What This Session Isn’t

SQL..

.

SQL!

SQL?

SQL

The craft model of information delivery does not scale

© Third Nature, Inc.

So we shifted to data publishing

Industrialized data delivery for self-service access.

Events and sensors are a relatively new data source

Sensor data doesn’t fit well with current methods of modeling, collection and storage, or with the technology to process and analyze it.

There’s lots of other new data involved

© Third Nature, Inc.

You can store this data in an RDBMS, but…

These sorts of things slow user requests down

Conclusion: any methodology built on the premise that you must know and model all the data first is untenable

© Third Nature, Inc.

Analytics embiggens data volume problems

Many of the processing problems are O(n2) or worse, so moderate data can be a problem for scale-up platforms

© Third Nature, Inc.

Old market says: There’s nothing wrong with what you have, just keep buying new products from us

The emerging big data market has an answer…

© Third Nature, Inc.

The data lake

© Third Nature, Inc.

Views of the lake Is the business vs supports the business?

Application vs infrastructure?

© Third Nature, Inc.

The naïve idea of a data lake leads to predictable results

© Third Nature, Inc. You can’t install Hadoop and hope it solves all the problems

Big data no 2

Slide 16

The answer isn’t just technology, it’s architecture

Sc

he

ma

In the DW world both data and processing are bounded

No consideration for feedback loops and change

Processing only

happens here

Carefully

controlled

access

here

Nobo

dy h

ere

cre

ate

s

new

info

rmatio

n

Sources few and

well understood

Complex DI

is controlled

by IT

Schemas are few

and designed

Tools are authorized,

few in number and

kind

One way flow

This is a monolithic, layered architecture

© Third Nature, Inc.

In the big data world flow is unbounded and continuous

Feedback

loops allowed

End-of-analysis

dataset may be

start of a BI dataset

Continuous data

integration and delivery

Files are back as both

input and storage

Minimal

barrier of /

control on

collection

Areas of

provisioned

data

Any shape in,

rectangles out

This needs a distributed service architecture

© Third Nature, Inc.

Deconstructing data environments

There are three things happening in a data warehouse:

▪ Data acquisition

▪ Data management

▪ Data delivery

Isolate them from one another, allow read-write use, and you are on the path.

Data

Warehouse

Data lake subsystems / components

The acquisition component allows any data to be collected at any latency. The management component allows some data to be standardized and integrated. The access component provides access at any latency and via any means an application chooses. Processing can be done to any data at any time from any area.

Data Acquisition Collect & Store

Incremental

Batch

One-time copy

Real time

Data Lake Platform Services

Data Management Process & Integrate

Data Access Deliver & Use

Data storage

In reality, you are building three systems, not one. Avoid the monolith.

© Third Nature, Inc.

Data lake functions depend on platform services

Base Platform Services

Data Movement Metadata Data Persistence

Workflow

Management

Processing Engines Dataflow Services

Data Curation Data Access

Services

Data Acquisition Collect & Store

Data Management Process & Integrate

Data Access Deliver & Use

Platfo

rm services n

eeded

DATA ARCHITECTURE

We’re so focused on the light switch that we’re not talking about the light

© Third Nature, Inc.

Decouple the Data Architecture

The core of the data lake isn’t a database or HDFS, it’s the data architecture that the tools implement.

We need a data architecture that is not limiting:

▪ Deals with change easily and at scale

▪ Does not enforce requirements and models up front

▪ Does not limit the format or structure of data

▪ Assumes the range of data latencies in and out, from streaming to one-time bulk

© Third Nature, Inc.

Food supply chain: an analogy for data

Multiple contexts of use, differing quality levels

You need to keep the original because just like baking, you can’t unmake dough once it’s mixed.

© Third Nature, Inc.

Data architecture is required by the services, and vice versa

Raw data in an immutable

storage area

Standardized or

enhanced data

Common or

usage-

specific data

Transient data

Dat

a A

cqu

isit

ion

C

olle

ct &

Sto

re

Platform Services

Data A

ccess D

eliver & U

se

Data Management Process & Integrate

© Third Nature, Inc.

The data areas map (mostly) to functional areas of the lake

Collection can’t be limited by database scale and latency. Immutability, persistence and concurrency are required.

Incremental

Collect

Batch

One-time copy

Real time

Manage & Integrate Process, Deliver, Use

© Third Nature, Inc.

Stages, not layers

Some tools require specific repositories or models. Others can reach in to get what they need. Do not enforce a single access point or model.

© Third Nature, Inc.

The geography has been redefined

The box IT created:

• not any data, rigidly typed data

• not any form, tabular rows and columns of typed data

• not any latency, persist what the DB can keep up with

• not any process, only queries

The digital world was diminished to only what’s inside the box until we forgot the box was there.

© Third Nature, Inc.

Layered data architecture

The DW assumed a single flat model of data, DB in the center.

The data lake enables new ways to organize data:

▪ Raw – straight from the source

▪ Enhanced –cleaned, standardized

▪ Integrated – modeled, augmented, ~semi-persistent

▪ Derived – analytic output, pattern based sets, ephemeral

Implies a new technology architecture and data modeling approaches.

© Third Nature, Inc.

The data lake enables evolutionary design for data

Evolutionary design is required because data needs change. You need a system not for stability – we have that in the DW - but for evolution and change, the data lake.

Data Acquisition Collect & Store

Incremental

Batch

One-time copy

Real time

Data Lake Platform Services

Data Management Process & Integrate

Data Access Deliver & Use

Data storage

You can’t build this all at once. You need to grow it over time.

© Third Nature, Inc.

Away from “one throat to choke”, back to best of breed

Tight coupling leads to efficient reuse and standardization, and to slow changes.

In a rapidly evolving market componentized architectures, modularity and loose coupling are favorable over monolithic stacks, single-vendor architectures and tight coupling.

Architecture, not blueprints: there is no single answer. It depends on your goals and starting position.

Questions? “When a new technology rolls over you, you're either part of the steamroller or part of the road.” – Stewart Brand

© Third Nature, Inc.

CC Image Attributions

Thanks to the people who supplied the creative commons licensed images used in this presentation: donuts_4_views.jpg - http://www.flickr.com/photos/le_hibou/76718773/ glass_buildings.jpg - http://www.flickr.com/photos/erikvanhannen/547701721

© Third Nature, Inc.

About the Presenter

Mark Madsen is president of Third Nature, a consulting and advisory firm focused on analytics, business intelligence and data management. Mark is an award-winning author, architect and CTO. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributor to Forbes, member of the O’Reilly Strata program committee. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net

About Third Nature

Third Nature is a consulting and advisory firm focused on new and emerging technology and practices in information strategy, analytics, business intelligence and data management. If your question is related to data, analytics, information strategy and technology infrastructure then you‘re at the right place.

Our goal is to help organizations solve problems using data. We offer education, consulting and research services to support business and IT organizations as well as technology vendors.

We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in strategy and architecture, so we look at emerging technologies and markets, evaluating how technologies are applied to solve problems rather than evaluating product features.