Organising the Data Lake - Information Management in a Big Data World

Organising The Data Lake

- Information Management In A Big Data World

Mike Ferguson

Managing Director

Intelligent Business Strategies

Hadoop Summit

Dublin, April 2016

2Copyright © Intelligent Business Strategies 1992-2016!

About Mike Ferguson

Mike Ferguson is Managing Director of

Intelligent Business Strategies Limited. As an

analyst and consultant he specialises in

business intelligence, data management and

enterprise business integration. With over 34

years of IT experience, Mike has consulted for

dozens of companies, spoken at events all over

the world and written numerous articles.

Formerly he was a principal and co-founder of

Codd and Date Europe Limited – the inventors

of the Relational Model, a Chief Architect at

Teradata on the Teradata DBMS and European

Managing Director of DataBase Associates.www.intelligentbusiness.biz

[email protected]

Twitter: @mikeferguson1

Tel/Fax (+44)1625 520700


Topics

The data integration complexity

The siloed approach to managing and governing data

A new inclusive approach to governing and managing data

Introducing the data reservoir and data refinery

How does a data reservoir and data refinery work?

Mapping new data and insights into your shared business vocabulary

The mission critical importance of an information catalog in a distributed data

landscape

Integrating data reservoirs and data refineries into your existing environment


The Changing Landscape – We Now Have Different Platforms Optimised For

Different Analytical Workloads

Streaming

data Hadoop

data store

Data Warehouse

RDBMS NoSQL

DBMS

EDW

DW & marts

NoSQLGraph DB

Advanced Analytic

(multi-structured data)

mart

DW

Appliance

Advanced Analytics

(structured data)

Analytical

RDBMS

Big Data workloads result in multiple platforms now being needed for analytical processing

C

R

U

D

Prod

Asset

Cust

MDM

Traditional

query,

reporting &

analysis

Real-time

stream

processing &

decision m’gmt

Data mining,

model

development

Investigative

analysis,

Data refinery

Data mining,

model

development

Graph

analysis

Graph

analysis


Data Integration Today Has Become Much More Complex

- Popular Data Integration Paths Between Platforms

EDW

DW

Appliance

Analytical DBMS

MDM System

C

R

U

D

Prod

Asset

Cust

XML,JSON

social

Web

logs

ERP

CRM

SCM

Ops

Graph

DBMS

NoSQL DB

Column Fam DB

Document DB

NoSQL DB

web

Data martsTransaction data

Cloud data may

also be part of it

insig

hts

Txn

s


Issues: Siloed Analytics - Different Tools To Manage And Integrate Data For

Each Type Of Analytical And MDM Store

Analytical

tools

Data

management

tools

EDWmart

Structured data

CRM ERP SCM

Silo

DW & marts

Analytical

tools/apps

Data

management

tools

Multi-structured

data

Silo

DW

Appliance

Advanced Analytics

(structured data)

Data

management

tools

Structured data

CRM ERP SCM

Analytical

tools

Silo

Analytical

tools/apps

Data

management

tools

NoSQL DB e.g. graph DB

Silo

Multi-structured &

structured data

Silo

C

R

U

D

Prod

Asset

Cust

MDM

Applications

Data

management

tools

Master data

management

CRM ERP SCM


Issues: Data Deluge - Data Is Arriving Faster Than We Can Consume It

F

D I

A L

T T

A E

R

Enterprise

Enterprisesystems


With 000’s Of Data Sources, IT And Business Need To Working Together As IT

Will Likely Become A Bottleneck

IT

OLTPsystems

Web

logs

web

DQ/DI

jobDQ/DI

jobDQ/DI

job

Open data

IoT

machine data

social & web

C

R

Uprod cust

asset

D

MDM

DW

Data

warehousing

cloud

Data virtualisation

Can business analysts &

Data Scientists help?

DQ/DI

jobDQ/DI

jobDQ/DI

job???

Bottleneck?

Should IT be expected

to do everything?

Big Data


Issues: Have You Got Self-Service Data Integration Causing Chaos In The

Enterprise?

social

Web

logs

web cloud

sandbox

Data Scientists

sandbox

Data Scientists

sandbox

Data Scientists

HDFS

ETL

/ DQ

Self-serviceBI tools with ETL

ETL

new

insights

SQL on

Hadoop

DW

ETL

/ DQDW

marts

ETL

SCM

CRM

ERP

ETL/D

Q

marts Self-serviceBI tools with ETL

ETL/D

Q

Built by IT

ETL/

DQETL/

DQETL/

DQ


Problems With The Current Approach

Project oriented siloed approach to DI/DQ with limited collaboration

Cost of data integration is too high

Slow speed of development

Multiple DI/DQ technologies and techniques being used that are not integrated

Lots of re-invention rather than re-use

Fractured metadata across multiple tools or no metadata at all in some cases

Risk of duplicate inconsistent DI/DQ rules for same data

Metadata lineage is unavailable in many places especially with hand-coded Big Data DI/DQ applications

Multiple skill sets fractured across different projects

Repetition of our mistakes, e.g. Big Data preparation

EDW C

R

U

D

Prod

Asset

Cust

MDMDQ/DIDQ/DI

DQ/DI

DQ/DIDQ/DI

cloud Data

virtualisation

DQ/DIDQ/DI

DQ/DI

Self-service


There has to be a better, more governed

way to fuel productivity and agility without

causing data inconsistency and chaos

EDW

DQ/DI

C

R

U

D

Prod

Asset

Cust

MDM

DQ/DI

DQ/DIDQ/DI

cloudData

virtualisation

DQ/DIDQ/DIDQ/DI

DQ/DI

Self-service

Tools are available but are not well integrated

Also the whole collaborative, metadata and information catalog piece is incomplete

IT IS NOT ENOUGH – THE WHOLE THING HAS TO BE CO-ORDINATED


We Are All In The Same Boat!

– Everyone For Themselves Is Not An Option

IT Data ArchitectData Scientist

IT Developer Business analyst

Information Management

– Introducing The Data Lake

Reservoir

Reservoir


What Is A Data Reservoir? - A Collaborative, Governed Environment Aimed At

Rapidly Producing Information

IT Data Architect

Data ScientistDomain Expert

community

Bus. analyst

Need to work together for competitive advantage

Data ScientistIT Developer

community

Data

Architect


community

Domain Expert


community

Bus. analyst

Bus. analyst Data Architect

community


Chaos Is NOT An Option – Business Alignment Of Information Being Produced

Is Critical To Success

Big Data Project

Big Data Project

DW Project

MDM ProjectProject

Strategic Objectives

Business

Strategy

• What problem are you

trying to solve?

• What data do you need?

• What kind(s) of analytic

workload are needed

We need co-ordinated

“info producer” projects in

a managed environment


Key Capabilities In A Managed Data Reservoir - 1

Data collection

• Automated discovery of the structure and formatting

• Data structure inferred by machine learning

• Automated cataloging, infinite storage and processing

Data classification

• Determines how data should be governed

• Support is needed for different types of classification schemes, e.g.

Retention

Unclassified

Temporary

Project Lifetime

Managed period

Permanent

Confidential

Unclassified

Internal use

Business confidential

Supplier confidential

Sensitive (PII)

Sensitive (Financial)

Sensitive (Operations)

Restricted (Trade secret)

Confidence

Unclassified

Raw (original)

Obsolete

Archived

Trusted

Business

Value

Unclassified

Unimportant

Marginal

Important

Critical

Catastrophic


Key Capabilities In A Managed Data Reservoir - 2

Collaborative data governance

• Data quality

• Data trustworthiness (confidence)

• Data protection

– Data privacy, access authorisation, lifecycle management

• Compliance

Data refinery

• Systematically clean and refine data through various stages

• Manual and guided data preparation

• “Sandbox” analyse data to produce high value insights

Data as a Service (DaaS)

• Published high value insights available for consumption

• Search for and discover trusted insights, subscribe to receive it

Data consumption

• Provision refined, trusted commonly understood data into any tool or application


Data virtualisation services

A Data Reservoir Is An Organised Collection Of Raw, In-Progress And Trusted

Data (Multiple Data Stores)

DW

MDM

C

R

U

D

Prod

Asset

Cust

Data marts

Cloud object

storage

Refined

tru

ste

d &

inte

gra

ted d

ata

Str

ong g

overn

ance

Raw

untr

uste

d d

ata

som

e g

overn

ance

ECMStaging areas

ODS

RDM

C

R

U

D

Code

sets

Archived DW data

Hive tables

feedsIoT

XML,JSON

RDBMS Files office docssocial Cloud

clickstream

web logs web services

NoSQL

ODSODS

DW

Text /Image/Video

Filtered sensor data

Published trusted

data

Search

indexes

In-progress data

Data Reservoir

(not a data store but a collection of stores)

Data sources and ingested reservoir

data are all known to the catalog

Info

Catalog


Replicate

Streaming

Batch Load

Archive

Raw Data Is Being Collected In Multiple Places Across The Enterprise – We

Need To Know What’s Happening!

We need to avoid unconnected silos

But we HAVE TO know what is being collected and

filtered and where that is happening

Also who is doing it, for what business purpose?


If Multiple Collection Points Exist Then Something Has To Catalog What Data Is

Available, Its Status And Where It Is

All data entering a

reservoir needs to

be catalogued and

organised

You need to know what data is available across the enterprise, where it

came from, what state is it in, should we trust it, can we order it

Information Catalogue


A Distributed Data Reservoir Requires Information Management Software To

Work Across Multiple Data Stores

Enterprise Information Management (Catalog, DQ, ETL, Security, Privacy…)

The Data Reservoir is distributed but is should be managed

and function as if it were centralised

Key requirements

Define once, execute anywhere

Centralised metadata

Distributed execution of policies associated with data quality, ETL, security, lifeecycle

management across the landscape (multiple execution engines)


Replicate

Streaming

Batch Load

Archive

A Distributed Data Reservoir Requires Management And Governance As If It

Was Centralised

The data in the reservoir is distributed but the reservoir

is managed and operated as if it were centralised


Information Production Is A Process That Involves Refining And Integrating Data

High value

information

and /or insights

available for

consumption

Raw

data

Raw

data

Trusted

data

Collaboration is needed

to perform many tasks in

producing information,

e.g. selecting &

transforming data

Reservoir storage

Raw

data

Raw

data

In-

progress

data

Trusted

data


The Information Production Process Works Across Zones In The Reservoir –

Zones Created By Tagging Files

sandbox

Trusted

Data Zone

Raw Data

Zone

Info

Catalog

master

ref data

DW archive

sandbox

Refinery zone

(prepare &

analyse data)

In-progress data

Refined

data &

Insights

zone

Data

marketplace

Data reservoir management

ETL/

Data prepDQ

Data Ingestion

zone

(transient

data)IoT

RDBMS

office docs

social

Cloud

clickstream

web logs

XML,JSON

web services

NoSQL

Files

DW data

streams

Data Reservoir

Exploratory analysis


Organising Data In A Reservoir – The Catalog Knows About Data Sources Plus

Data In All Zones And Sandboxes

sandbox

Trusted

Data Zone

Raw Data

Zone

Info

Catalog

master

ref data

DW archive

sandbox

Refinery zone

(prepare &

analyse data)

In-progress data

Refined

data &

Insights

zone

Data

marketplace


ETL/

Data prepDQ

Data Ingestion

zone

(transient

data)IoT

RDBMS

office docs

social

Cloud

clickstream

web logs

XML,JSONweb

services

NoSQL

Files

DW data

streams

Data Reservoir



Operating A Data Reservoir – The Information Production Process Is A

Production Line That Spans Reservoir Zones

Trusted

Data Zone

Raw Data

Zone

Info

Catalog

master

ref data

DW archive

Refinery zone

(prepare &

analyse data)

In-progress data

Refined

data &

Insights

zone

Data

marketplace


ETL/

Data prepDQ

Data Ingestion

zone

(transient

data)IoT

RDBMS

office docs

social

Cloud

clickstream

web logs

XML,JSONweb

services

NoSQL

Files

DW data

streams

Data Reservoir

Nominate

new data

Classify

sensitivity,

quality,

retention

Tag data

(what’s it

mean?)

Assign

governance

policies based on

classification

Collaborate

about

processing

Track data

freshness

Rate its value

★★★★


Analyse

consume

Reservoir operations are

controlled via the catalog

and workflow processes

Info

Catalog

Map to shared

business

vocabulary


Operating A Data Reservoir – Workflows Are Everywhere And Are Components

Of An Information Production Process

sandbox

Trusted

Data Zone

Raw Data

Zone

Info

Catalog

master

ref data

DW archive

sandbox

Refinery zone

(prepare &

analyse data)

In-progress data

Refined

data &

Insights

zone

Data

marketplace


ETL/

Data prepDQ

Data Ingestion

zone

(transient

data)IoT

RDBMS

office docs

social

Cloud

clickstream

web logs

XML,JSON

web services

NoSQL

Files

DW data

streams

Data Reservoir


Ingest

w/flow

movement

w/flow

movement

w/flow

Publish

w/flow

Publish

w/flow

Provision

w/flow

Refinery

w/flow

Analytical

w/flow

Gov

w/flow

Gov

w/flow

Stream

w/flow


Trends – Data And Analytical Workflow (Pipeline) Products Requiring No

Programming Are Emerging Everywhere

Talend Alteryx

Microsoft Azure Data FactoryHortonworks

Dataflow (Nifi)

Dell Statistica

Who is using what

tools?

Any reinvention?


Operating A Data Reservoir – All Workflows Should Be Approved And

Registered In The Information Catalog

sandbox

Trusted

Data Zone

Raw Data

Zone

Info

Catalog

master

ref data

DW archive

sandbox

Refinery zone

(prepare &

analyse data)

In-progress data

Refined

data &

Insights

zone

Data

marketplace


ETL/

Data prepDQ

Data Ingestion

zone

(transient data)IoT

RDBMS

office docs

social

Cloud

clickstream

web logs

XML,JSON

web services

NoSQL

Files

DW data

streams

Data Reservoir


Ingest

w/flow

Publish

w/flow

Publish

w/flow

movement

w/flow

movement

w/flow

Provision

w/flow

Refinery

w/flow

Analytical

w/flow

Gov

w/flow

Gov

w/flow

Stream

w/flow

Convert SSDI workflows to data

virtualisation views to minimise re-

invention and enforce governance

virtu

al v

iew

virtu

al v

iew

virtu

al v

iew


Data Strategy Requirements – We Need To Enable Information Producers And

Information Consumers

Need to make use of

• A business glossary and information catalog

• Re-usable services to manage and process data

• Collaboration and social computing to manage, process and rate data

• Role-based data management tools aimed at IT AND business

clean &

integrate

service

raw datatrusted data

Information

catalog

BI tool or

application

search

find

shop

order consume

data scientist

IT professional

information producers

clean &

integrate

service

raw data

business analysts

information consumers

like a

“corporate

iTunes” for

data


A ‘Production Line’ Publish And Subscribe Approach Is Used To Accelerate

Information And Insight Production

data

source

Data

Integration

publishInfo

catalog

trusted data

as a service

publish Info

catalog

trusted, integrated

data ad a service

subscribeAnalyse

(e.g. score)consume

publishAnalytics

catalog

New predictive

analytic pipelines

(as a service)

consume

subscribe

Visualise

Decide Act

Other, e.g. embed

analytic applications

consume

subscribe

publish

Solutions

catalogNew prescriptive

analytic pipelines

publish New analytic

applicationsuse

crawl

discover

profile

publish

Info

catalog

discovered

data

AcquireAcquire

AcquireData Preparation

(clean, transform, filter)


Cataloging, Automated Discovery And Collaboration Are All Needed When Data

Is Ingested

Trusted

Data Zone

Raw Data

Zone

Info

Catalog

master

ref data

DW archive

Refinery zone

(prepare &

analyse data)

In-progress data

Refined

data &

Insights

zone

Data

marketplace


ETL/

Data prepDQ

Data Ingestion

zone

(transient

data)IoT

RDBMS

office docs

social

Cloud

clickstream

web logs

XML,JSONweb

services

NoSQL

Files

DW data

streams

Data Reservoir


Analyse

consume

Automated relationship

discovery, data profiling,

and document clustering

Descriptive metadata is

critical to keeping things

organised

Info

Catalog

Catalog, tag and

describe data/files

(what’s it about?)

collaborative

appraisal


Governance In A Data Reservoir Is Controlled By Classification And Metadata In

The Information Catalog Classifications drive the governance

Governance Rule

Governance Rule

Governance Rule

Classification

ClassificationInformation

Rule

Information Governance

Rule

Classified by

Actionedby

Physical Data Description

Policy

Governs

Implemented by

Policy

ProcessAssessed by

BusinessAttribute

Classified by

Mapped to

Governs Sensitive

IT Landscape

Deployed toGovernance Action

Describesby

Engine

AccessesMetrics

Measures

ProcessAssessed by

Feeds

OperationalLog

Logs activity

Describes

Data storeData store/ Document/

File/API

MeasuresMeasures

9Source: IBM


IBM Are Creating ‘Governance Aware’ Runtimes To Verify And Enforce Policies

In A Data Reservoir

Source: IBM

They access the information

catalog to determine what to

do at run time


We Need A Data Refinery To Process, Clean And Analyse Data To Produce

Consumable High Value Insight

cloud On-premises

DW Analytical

RDBMS

ETL

Server

Data Virtualisation

Server

A data refinery should be able to choose where to best refine data to produce the information needed


Data virtualisation services

A Key Requirement In A Distributed Data Reservoir Is Centralised Development,

Distributed Execution

MDM

C

R

U

D

Prod

Asset

Cust

Data marts

Cloud object

storage

Refined

tru

ste

d &

inte

gra

ted d

ata

Str

ong g

overn

ance

Raw

untr

uste

d d

ata

som

e g

overn

ance

ECMStaging areas

RDM

C

R

U

D

Code

sets

Archived DW data

Hive tables

feedsIoT

XML,JSON

RDBMS Files office docssocial Cloud

clickstream

web logs web services

NoSQL

Text /Image/Video

Filtered sensor data

Published trusted

data

Search

indexes

In-progress data

Data Reservoir

(not a data store but a collection of stores) Info

Catalog

ODS

DW

staging area

EIM Tool Suite (Profiling, cleansing, ELT)

ODSODS

Execution

engine

Execution

engineExecution

engine

Execution

engine

Execution

engine

Execution

engine

IT User

Interface

Self-

service UI

Execution

engine

Execution

engineExecution

engineExecution

engine

Execution

engineExecution

engineExecution

engine


On-premises

storage

DW

staging area

Cloud

storage

Execution

engineExecution

engine

Execution

engine

Execution

engineExecution

engine

If A Data Reservoir Is Distributed With Data Too Big To Move Then Processing

Needs To Go The Data

Not centralised,

Not distributed

But Federated

TaskTask

TaskTaskTask


Options For Refining Data

IT developed ETL processing using EIM tool suites

Self-service data integration

Multi-role EIM tool suites

• Can be used by both IT AND business users

Data virtualisation server

A combination of the above


Scaling ETL Transformations For In-Hadoop ELT Processing

Data Cleansing and Integration Tool

Extract Parse Clean Transform AnalyseLoad Insights

Option 1

ETL tool generates HQL or

convert generated SQL to HQLOption 2

ETL tool generates Pig

(compiler converts every

transform to a map reduce

job) or JAQL

Option 3

ETL tool generates 3GL MR

or Spark code

Option 4 – Other

Native massively parallel transformation and

integration bypassing any Hadoop execution

engine

E.g. Talend, IBM BigIntegrate, Informatica


Self-Service Data Integration Tool Vendors

Actian Dataflow

Alteryx

Clear Story Data

Datameer

IBM DataWorks

Informatica Rev

Paxata

SAS Data Loader

for Hadoop

Tamr

Trifacta



Analyse

(e.g. Score)Visualise

Decide Act

Data

Integrationdata

Embed



Analyse

(e.g. Score)Visualise

Decide Act

Data

Integrationdata

Embed

Data preparation, integration, analysis & visualisation

Data preparation and integration


Some Data Management Vendors Are Trying To Cover All Roles And Integrate

With Other Vendors, e.g. Informatica

Informatica

Catalog & Live

Data Map

Analyst toolData &

Metadata

Relationship

Discovery

Services

Data Quality Profiling & MonitoringServices

Data

Modeling

Services

DataCleansing &

MatchingServices

Data

Integration

Services

Business

Glossary

/ Info Catalog

Services

Data Governance/Management Console

Data

Privacy &

Lifecycle

Management

Services

Data

Audit &

Protection

Services

EIM Tool Suite

IT Data ArchitectData Scientist

Business Analyst

Informatica Rev

Self-service

Cloud DI

metadata

metadata


Data &

Metadata

Relationship

Discovery

Services

Data Quality Profiling & MonitoringServices

Data

Modeling

DataCleansing &

MatchingServices

Data

Integration

Services

(virt & ETL)

Business

Glossary

/ Info

Catalog

Services


metadata

Data

Privacy &

Lifecycle

Management

Services

Data

Audit &

Protection

Services

ESB

Information

servicesC

R

Uprod cust

asset

D

MDM

DW

Data

warehousing

Big Data

Data virtualisation

cloud

Business UserIT DeveloperIT Data Architect

App Self-

Service

Enterprise Service Bus

Some Vendors Are Opening Up Their Service Oriented Data Management

Platforms To IT AND Business Users

Role-based

Uis to the same

data management

platform

Workflow


Alternatively Interoperability Is Needed Across Tools To Use Data Preparation

Jobs Developed By Different Users

Stand-alone

Data Wrangling

tools

Data &

Metadata

Relationshi

p

Discovery

Services

Data Quality

Profiling & MonitoringServices

Data

Modeling

Services

DataCleansing

& MatchingServices

Data

Integration

Services

Business

Glossary

/ Info

Catalog

Services


Data

Privacy &

Lifecycle

Management

Services

Data

Audit &

Protection

Services

EIM Tool Suite

IT Data Architect Data Scientist

Business Analyst

PowerQuery

Self-Service DI

embedded in Self-

Service BI tools

Microsoft Data Factory

Dell Boomi

SnapLogic

IBM DataWorks

Informatica Rev

Cloud DI

Interoperability

metadata metadata

metadatametadata


Metadata Management In A Data Reservoir

- EIM Platform Information Catalog And Apache Atlas

Stand-alone

Data Wrangling

tools

Services


EIM Tool Suite


Business Analyst

PowerQuery

Self-Service DI

embedded in Self-

Service BI tools


Dell Boomi

SnapLogic

IBM DataWorks

Informatica Rev

Cloud DImetadata

metadata

metadata

metadata

atlas

Graph store

atlas atlas

Information

Catalog


Metadata Management In A Data Reservoir

- Stand-Alone Information Catalog And Apache Atlas

Stand-alone

Data Wrangling

tools

Services


EIM Tool Suite


Business Analyst

PowerQuery

Self-Service DI

embedded in Self-

Service BI tools


Dell Boomi

SnapLogic

IBM DataWorks

Informatica Rev

Cloud DImetadata

metadata

metadata

metadata

atlas

Graph store

atlas atlas

Information

Catalogmetadata atlas


New Trusted Data Produced By Refining Un-Modelled Data Should Be Defined

In A Business Glossary

Raw data In-Progress data Refined data

Untrusted Trusted

corporate

firewall

Fit for use

Data Refinery

sandboxBusiness

Glossary

Da

ta V

irtua

lisa

tion

Could implement the

SBV in a data

virtualisation server


The Critical Importance Of An Information Catalog

– We MUST Be Able To Answer This Question

Business user

What information exists

about……….?

An Information Catalogue

Where is that likely to be documented?


The Information Catalog

- What Else Do I Want To Know?

Can I search for information? (faceted search via your SBV)

Does the data exist?

Is the data trusted? (what is the rating)

Is the data sensitive? (what is the rating)

Is it high business value (what is the rating)

Can I order it?

Can I specify where to deliver it to and in what format?

Can I see where is it used and who owns it?

Information Catalogue


Information Catalog Example - Waterline Data


Faceted Navigation Used In E-Commerce (e.g. Amazon) Is About To Get A

Much Bigger Role In Data Management

Add it to

your cart

Select the

products you

want


Ordered Parcel Delivery – The Same Thing Will Happen To Provision Ordered

Data

Ordered data


Virtual Information Provisioning Needs Policy Awareness At Runtime To Create

Virtual Views That Enforce Governance

Information

provisioning

service

Virtual data subset

Virtual full data set

security

policy

(some data not

permitted to be seen)

(all data permitted

to be seen)

“Finished-Goods”

Refined data

Information

provisioning

service

Virtual data subset

Virtual full data set

compliance

policy

(some data not

allowed to be

provisioned outside

the country)

(all data

provisioned inside

the country)

Data reservoir

All data

has SBV Data

Virtu

alis

atio

n


Conclusions

The challenge is now to manage data in the entire analytical ecosystem

Invest in new skills and training needed in this environment

Data needs to be organised in a data reservoir to prevent chaos

Hadoop is becoming a platform to accelerate cleansing and ETL processing to conduct

exploratory analytics

Multiple options exist to allow IT and business users to clean and integrate data in preparation

for analysis

• Data integration vendors have added functionality to support Hadoop

• Self-service data cleansing and integration tools also exist

The ideal solution is a single platform that supports IT and business user self-service data

integration

An information catalog is critical for end-to-end data governance

• Understanding what data is available (descriptive metadata)

• Understand how it was transformed (metadata lineage)

Data virtualisation is needed to see across multiple data reservoirs

Start small and build out incrementally – don’t just load data and hope


www.intelligentbusiness.biz

[email protected]

Twitter: @mikeferguson1

Tel/Fax (+44)1625 520700

Thank You!

Technology

Organising the Data Lake - Information Management in a Big Data World