43
Grab some coee and enjoy the pre-show banter before the top of the hour!

The Central Hub: Defining the Data Lake

Embed Size (px)

Citation preview

Page 1: The Central Hub: Defining the Data Lake

Grab some

coffee and

enjoy the

pre-show

banter before

the top of the

hour!

Page 2: The Central Hub: Defining the Data Lake

The Data Lake Survival Guide Exploratory Webcast | October 26, 2016

SPONSORED BY

Page 3: The Central Hub: Defining the Data Lake

Presenting

Robin Bloor Chief Analyst, The Bloor Group @robinbloor [email protected]

Host: Eric Kavanagh CEO, The Bloor Group @eric_kavanagh [email protected]

Dez Blanchfield Data Scientist, The Bloor Group @dez_blanchfield [email protected]

Page 4: The Central Hub: Defining the Data Lake

Findings Webcast January 12, 2017

Data Lake Survival Guide

Roundtable Webcast December 8, 2016

Exploratory Webcast October 26, 2016

Page 5: The Central Hub: Defining the Data Lake

Data Lake Survival

Robin Bloor, PhD

Page 6: The Central Hub: Defining the Data Lake

The Sequence of Topics….

1  Disturbance in the Force

2  What is a Data Lake, exactly?

3  Streams and Events

Page 7: The Central Hub: Defining the Data Lake

1

Disturbance in the Force

Page 8: The Central Hub: Defining the Data Lake

The Generic Dimensions of IT q  All IT involves 4 components (only)

q  Users q  Software q  Data q  Hardware

q  They all relate to each other q  Change any one of these and the other

three components have to adjust q  Aggregate these and you get a process q  Time will impose change anyway q  We can also consider:

q  Staff q  Business Processes q  Business Information q  Facility

q  And also q  People q  Information q  Human Activity q  Civilization (Stuff)

Four Fundamental (IT) Factors

Hardware

Users

Software Data

Business

InformationB

usinessProcess

Hum

anActivity

AllInform

ation

Staff

Facility

People

Civilization

TIME

Page 9: The Central Hub: Defining the Data Lake

The Technology Layers

§  The buying impulse descends through the stack

§  The impact of technology change rises up the stack

§  This ensures the eventual “legacification” of all technology

The BuyingImpulse Goes

Down

TechnologyChange Rises Up

The TechnologyLayers

Page 10: The Central Hub: Defining the Data Lake

Disruption in the Technology Layers

§  Disruption (as innovation) can happen in any layer §  Where it occurs it will impact all layers above it §  And it may also impact the layers below it (but less quickly) §  There is no such thing as future-proof; but some technologies definitely live longer

The BuyingImpulse Goes

Down

TechnologyChange Rises Up

The TechnologyLayers

Page 11: The Central Hub: Defining the Data Lake

§  Mainframe Computer (Batch architecture)

§  On-line Interaction (Centralized architecture)

§  PC (Client Server)

§  Internet (Multi-tier architecture)

§  Mobile (Service Oriented architecture)

§  Internet of Things (Event Driven Architecture)

Tech Revolutions

Note that all of these disruptive changes were driven by hardware innovation

Cloud

Centralized Computer Systems

PC Based Systems

Integrated Systems

Limited process powerTerminals onlyFew applicationsNo external data sources

Extensive process powerPCs & AppsAnalytics capabilityWealth of applicationsMany external data sources

Moderate process powerPCsSpreadsheets & emailMany applicationsFew external data sources

Page 12: The Central Hub: Defining the Data Lake

Parallelism: The Imp Out of the Bottle

u Multicore chips enabled parallelism

u  It has changed the whole performance equation

u  It enabled Big Data

u  Big Data is really Big Processing

Page 13: The Central Hub: Defining the Data Lake

The Impact of Parallelism

We used to see 10x performance improvement every 6 years, now we

see 1000x (and that’s just an approximation)

Page 14: The Central Hub: Defining the Data Lake

Hardware Factors q  CPUs, GPUs & FPGAs

q  Cross breeding

q  SoCs

q  3D Xpoint and PCM (and memristor?)

q  SSDs & parallel access

q  Parallel hardware architectures

Performance is accelerating and costs continue to fall.

Page 15: The Central Hub: Defining the Data Lake

The Perfect Storm (Software)

q  The triumph of Open Source as a business model

q  The dominance of Apache q  Hadoop, the platform

for data q  Spark, for speed q  Kafka, for connectivity

q  The triumph of the cloud and its dominance

q  Little data is also big data

q  Cost challenges

Page 16: The Central Hub: Defining the Data Lake

Then the DataLake evaporatedinto the Cloud

2

What is a Data Lake?

Page 17: The Central Hub: Defining the Data Lake

Everything in flux

u  Hardware (network, storage, servers)

u  Data Sources u  Data Staging u  Data Volumes u  Data Flow u  Data Governance u  Data Usage u  Data Structures u  Schema definition u  Ingest Speeds u  Data Workloads

Page 18: The Central Hub: Defining the Data Lake

Hadoop Applications

Page 19: The Central Hub: Defining the Data Lake

The Scale Out Applications

§  Data Ingest & Staging

§  Data Governance

§  Software development platform

§  Analytics environment

§  Database/Data Warehouse

§  Data Archiving

§  Video rendering & other niche apps

The Data Lake involves just the first two and does not necessarily involve Hadoop

Page 20: The Central Hub: Defining the Data Lake

Data Lake, Refinery, Hub, in Overview

Think Logical, Implement Physical

Page 21: The Central Hub: Defining the Data Lake

The Data Lake Analytics Picture Data Sources

Analytics

ServiceMgt

Life CycleMgt

MetaDataDiscovery

MDM

MetaDataMgt

DataCleansing

DataLineage

ROUND|UP

WRANGLING

Staging Area(Hadoop)

Data Warehouseor other location

Data Streams

ETL

ETL

Page 22: The Central Hub: Defining the Data Lake

How Data Gets to be Wrong

u  Accidentally born wrong

u  Deliberately born wrong

u  Defective sensor/data source

u  Murdered (truncated, overwritten)

u  Corrupted in flight (rare)

u  Corrupted by bad code (surely not!)

u  Corrupted by bad DBA

Page 23: The Central Hub: Defining the Data Lake

Data Governance

If data governance was important before Big Data, (and it was) it is far more important in the era of

Data Lakes

Page 24: The Central Hub: Defining the Data Lake

What Needs To Be Governed

Page 25: The Central Hub: Defining the Data Lake

Data Governance

  Data Flows and Data Storage

  Security & Access

  Data cleansing and transformation

  Data meaning

  Data provenance and lineage

  Data archive and disposal

  Availability and performance

Page 26: The Central Hub: Defining the Data Lake

Analytics Is a Process Not an Activity

q Data Analytics is a multi-disciplinary end-to-end process

q Until recently it was a walled-garden. But the walls were torn down by… §  Data availability §  Scalable technology §  Open source tools

q  It is now becoming an integrated process

Data Governance is a process, not an activity!!

Page 27: The Central Hub: Defining the Data Lake

The Global Map and Data Options

u  Move the data to the processing

u  Move the processing to the data

u  Move the processing and the data

u  Shard

All network nodes can be data creators, data stores and

processing points.

Page 28: The Central Hub: Defining the Data Lake

Logical Data Lakes

Soon we will be speaking of a logical data lake and multiple

physical data lakes

Page 29: The Central Hub: Defining the Data Lake

3

Events and Streams

Page 30: The Central Hub: Defining the Data Lake

Big Data, Event Data – The Data of Everything

WHAT IS BIG DATA?

Business data

Traditional data

Log file data

Operational data

Mobile data

Location data Social

network data

Public data

Commercial databases

Streaming data

Internet of Things

Page 31: The Central Hub: Defining the Data Lake

A TRANSACTION is a MOLECULE of ATOMIC EVENTS

The ATOM of data has become the EVENT

Events: Atoms and Molecules

Page 32: The Central Hub: Defining the Data Lake

It’s Become and Event Based World

Page 33: The Central Hub: Defining the Data Lake

Events

Think of events as drops of water. They can live in streams, and they can also live in data pools and data

lakes.

Page 34: The Central Hub: Defining the Data Lake

Two Data Flows

Page 35: The Central Hub: Defining the Data Lake

The Traffic Cop (Events)

Page 36: The Central Hub: Defining the Data Lake

Event Types

q  Instantiation Event q  A State Report q  A Trigger Event q  A Correction Event

We also need to consider: Data Refinement Aggregations Homogeneous Collections Derived Data

Page 37: The Central Hub: Defining the Data Lake

§  The pulse and the threshold alert

§  Some of this involves distributed processing

§  There are known apps and unknown apps, so analytical exploration needs to be enabled

§  Only aggregations will migrate

DepotDepot

CentralHub

SourceProc.

DepotProc.

CentralProc.

Sensors, controllers, CPUs

Data Data

Data

Event Based IoT Architecture

Page 38: The Central Hub: Defining the Data Lake

u Time

u Geographic location

u Virtual/logical location

u Source device

u Device ID

u Actors

u Ownership/Provenance

u Values

Events and Event Data

Page 39: The Central Hub: Defining the Data Lake

Spark, Storm, Flink & Kafka

u  Spark has dethroned Hadoop as a platform and has momentum, both for microbatch and streaming

u  Storm provides batch and streaming (event processing capabilities) concurrently via the lambda architecture

u  Flink was purpose built for streaming

u Kafka is the pipe

u  Lambda and Zeta Architectures…

Page 40: The Central Hub: Defining the Data Lake

In Summary

1  Disturbance in the Force

2  What is a Data Lake, exactly?

3  Streams and Events

Page 41: The Central Hub: Defining the Data Lake
Page 42: The Central Hub: Defining the Data Lake

Questions?

Page 43: The Central Hub: Defining the Data Lake

THANK YOU!

FIND OUT MORE at InsideAnalysis.com