43
Building The Enterprise Data Lake Important Considerations Before You Jump In December 8, 2015

Building the Enterprise Data Lake - Important Considerations Before You Jump In

Embed Size (px)

Citation preview

Building The Enterprise Data Lake

Important Considerations Before You Jump In

December 8, 2015

Building The Enterprise Data Lake

Today’s Presenters

Mark Madsen Industry Analyst

Third Nature @markmadsen

Craig Stewart Sr. Dir.

Product Management SnapLogic

@01Badger

Erin Curtis Sr. Dir.

Product Marketing

SnapLogic @erncrts

Building  the  Enterprise  Data  Lake  Considera6ons  before  you  jump  in            

December,  2015    Mark  Madsen  www.ThirdNature.net  @markmadsen1  

What  This  Session  Isn’t  

SQL...

SQL!

SQL?

SQL

The  craB  model  of  informa6on  delivery  does  not  scale  

©  Third  Nature,  Inc.  

So  we  shiBed  to  data  publishing  

Industrialized  data  delivery  for  self-­‐service  access.  

Events  and  sensors  are  a  rela6vely  new  data  source  

Sensor  data  doesn’t  fit  well  with  current  methods  of  modeling,  collecEon  and  storage,  or  with  the  technology  to  process  and  analyze  it.  

There’s  lots  of  other  new  data  involved  

©  Third  Nature,  Inc.  

You  can  store  this  data  in  an  RDBMS,  but…  

These  sorts  of  things  slow  user  requests  down  

Conclusion:  any  methodology  built  on  the  premise  that  you  must  know  and  model  all  the  data  first  is  untenable    

©  Third  Nature,  Inc.  

Analy6cs  embiggens  data  volume  problems  

Many  of  the  processing  problems  are  O(n2)  or  worse,  so  moderate  data  can  be  a  problem  for  scale-­‐up  plaOorms  

©  Third  Nature,  Inc.  

Old  market  says:  There’s  nothing  wrong  with  what  you  have,  just  keep  buying  new  products  from  us  

The  emerging  big  data  market  has  an  answer…  

©  Third  Nature,  Inc.  

The  data  lake  

©  Third  Nature,  Inc.  

Views  of  the  lake  Is  the  business  vs  supports  the  business?  ApplicaEon  vs  infrastructure?  

©  Third  Nature,  Inc.  

The  naïve  idea  of  a  data  lake  leads  to  predictable  results  

©  Third  Nature,  Inc.  You  can’t  install  Hadoop  and  hope  it  solves  all  the  problems  

Big  data  no  2  

Slide 18

The  answer  isn’t  just  technology,  it’s  architecture  

Schema

In  the  DW  world  both  data  and  processing  are  bounded  

No consideration for feedback loops and change

Processing only happens here

Carefully controlled access here

Nobody here creates

new inform

ation

Sources few and well understood

Complex DI is controlled by IT

Schemas are few and designed

Tools are authorized, few in number and kind

One way flow

This  is  a  monolithic,  layered  architecture  

©  Third  Nature,  Inc.  

In  the  big  data  world  flow  is  unbounded  and  con6nuous  

Feedback loops allowed

End-of-analysis dataset may be start of a BI dataset

Continuous data integration and delivery

Files are back as both input and storage

Minimal barrier of / control on collection

Areas of provisioned data

Any shape in, rectangles out

This  needs  a  distributed  service  architecture  

©  Third  Nature,  Inc.  

Deconstruc6ng  data  environments  

There  are  three  things  happening  in  a  data  warehouse:  ▪ Data  acquisiEon  ▪ Data  management  ▪ Data  delivery  Isolate  them  from  one  another,  allow  read-­‐write  use,  and  you  are  on  the  path.  

Data Warehouse

Data  lake  subsystems  /  components  

The  acquisi6on  component  allows  any  data  to  be  collected  at  any  latency.  The  management    component  allows  some  data  to  be  standardized  and  integrated.  The  access  component  provides  access  at  any  latency  and  via  any  means  an  applica6on  chooses.  Processing  can  be  done  to  any  data  at  any  6me  from  any  area.  

Data  AcquisiEon  Collect  &  Store  

Incremental  

Batch  

One-­‐Eme  copy  

Real  Eme  

Data  Lake  PlaOorm  Services  

Data  Management  Process  &  Integrate  

Data  Access  Deliver  &  Use  

Data  storage  

In  reality,  you  are  building  three  systems,  not  one.  Avoid  the  monolith.  

©  Third  Nature,  Inc.  

Data  lake  func6ons  depend  on  plaUorm  services  

Base Platform Services

Data Movement Metadata Data Persistence

Workflow Management

Processing Engines Dataflow Services

Data Curation Data Access Services

Data  AcquisiEon  Collect  &  Store  

Data  Management  Process  &  Integrate  

Data  Access  Deliver  &  Use  

PlaOorm  services  needed  

DATA  ARCHITECTURE  

We’re  so  focused  on  the  light  switch  that  we’re  not  talking  about  the  light  

©  Third  Nature,  Inc.  

Decouple  the  Data  Architecture  

The  core  of  the  data  lake  isn’t  a  database  or  HDFS,  it’s  the  data  architecture  that  the  tools  implement.    We  need  a  data  architecture  that  is  not  limiEng:  ▪ Deals  with  change  easily  and  at  scale  ▪ Does  not  enforce  requirements  and  models  up  front  ▪ Does  not  limit  the  format  or  structure  of  data  ▪ Assumes  the  range  of  data  latencies  in  and  out,  from  streaming  to  one-­‐Eme  bulk  

©  Third  Nature,  Inc.  

Food  supply  chain:  an  analogy  for  data  

MulEple  contexts  of  use,  differing  quality  levels  

               

You  need  to  keep  the  original  because  just  like  baking,  you  can’t  unmake  dough  once  it’s  mixed.  

©  Third  Nature,  Inc.  

Data  architecture  is  required  by  the  services,  and  vice  versa  

Raw data in an immutable storage area

Standardized or enhanced data

Common or usage-specific data

Transient data

Data  AcquisiE

on  

Collect  &  Store  

PlaOorm  Services  

Data  Access  Deliver  &

 Use  

Data  Management  Process  &  Integrate  

©  Third  Nature,  Inc.  

The  data  areas  map  (mostly)  to  func6onal  areas  of  the  lake  

CollecEon  can’t  be  limited  by  database  scale  and  latency.  Immutability,  persistence  and  concurrency  are  required.  

Incremental  

Collect  

Batch  

One-­‐Eme  copy  

Real  Eme  

Manage    &  Integrate   Process,  Deliver,  Use  

©  Third  Nature,  Inc.  

Stages,  not  layers  Some  tools  require  specific  repositories  or  models.  Others  can  reach  in  to  get  what  they  need.  Do  not  enforce  a  single  access  point  or  model.  

©  Third  Nature,  Inc.  

The  geography  has  been  redefined  

The  box  IT  created:  • not  any  data,  rigidly  typed  data  • not  any  form,  tabular  rows  and  columns  of  typed  data  

• not  any  latency,  persist  what  the  DB  can  keep  up  with  

• not  any  process,  only  queries    The  digital  world  was  diminished  to  only  what’s  inside  the  box  un6l  we  forgot  the  box  was  there.  

 

©  Third  Nature,  Inc.  

Layered  data  architecture  The  DW  assumed  a  single  flat  model  of  data,  DB  in  the  center.    The  data  lake  enables  new  ways  to  organize  data:  ▪  Raw  –  straight  from  the  source  ▪  Enhanced  –cleaned,  standardized  ▪  Integrated  –  modeled,  augmented,  ~semi-­‐persistent  ▪  Derived  –  analyEc  output,  pacern  based  sets,  ephemeral  

Implies  a  new  technology  architecture  and  data  modeling  approaches.  

©  Third  Nature,  Inc.  

The  data  lake  enables  evolu6onary  design  for  data  EvoluEonary  design  is  required  because  data  needs  change.  You  need  a  system  not  for  stability  –  we  have  that  in  the  DW  -­‐  but  for  evoluEon  and  change,  the  data  lake.    

Data  AcquisiEon  Collect  &  Store  

Incremental  

Batch  

One-­‐Eme  copy  

Real  Eme  

Data  Lake  PlaOorm  Services  

Data  Management  Process  &  Integrate  

Data  Access  Deliver  &  Use  

Data  storage  

You  can’t  build  this  all  at  once.  You  need  to  grow  it  over  6me.  

©  Third  Nature,  Inc.  

Away  from  “one  throat  to  choke”,  back  to  best  of  breed  

Tight  coupling  leads  to  efficient  reuse  and  standardizaEon,  and  to  slow  changes.  In  a  rapidly  evolving  market  componenEzed  architectures,  modularity    and  loose  coupling  are  favorable  over  monolithic  stacks,  single-­‐vendor  architectures  and  Eght  coupling.  Architecture,  not  blueprints:  there  is  no  single  answer.  It  depends  on  your  goals  and  starEng  posiEon.    

Ques6ons?  “When  a  new  technology  rolls  over  you,  you're  either  part  of  the  steamroller  or  part  of  the  road.”  –  Stewart  Brand  

©  Third  Nature,  Inc.  

CC  Image  Abribu6ons  Thanks  to  the  people  who  supplied  the  creaEve  commons  licensed  images  used  in  this  presentaEon:    donuts_4_views.jpg  -­‐  hcp://www.flickr.com/photos/le_hibou/76718773/  glass_buildings.jpg  -­‐  hcp://www.flickr.com/photos/erikvanhannen/547701721      

©  Third  Nature,  Inc.  

About  the  Presenter  

Mark  Madsen  is  president  of  Third  Nature,  a  consulEng  and  advisory  firm  focused  on  analyEcs,  business  intelligence  and  data  management.  Mark  is  an  award-­‐winning  author,  architect  and  CTO.  Over  the  past  ten  years  Mark  received  awards  for  his  work  from  the  American  ProducEvity  &  Quality  Center,  TDWI,  and  the  Smithsonian  InsEtute.  He  is  an  internaEonal  speaker,  a  contributor  to  Forbes,  member  of  the  O’Reilly  Strata  program  commicee.  For  more  informaEon  or  to  contact  Mark,  follow  @markmadsen  on  Twicer  or  visit    hcp://ThirdNature.net    

About  Third  Nature  

Third  Nature  is  a  consulEng  and  advisory  firm  focused  on  new  and  emerging  technology  and  pracEces  in  informaEon  strategy,  analyEcs,  business  intelligence  and  data  management.  If  your  quesEon  is  related  to  data,  analyEcs,  informaEon  strategy  and  technology  infrastructure  then  you‘re  at  the  right  place.  

Our  goal  is  to  help  organizaEons  solve  problems  using  data.  We  offer  educaEon,  consulEng  and  research  services  to  support  business  and  IT  organizaEons  as  well  as  technology  vendors.  

We  fill  the  gap  between  what  the  industry  analyst  firms  cover  and  what  IT  needs.  We  specialize  in  strategy  and  architecture,  so  we  look  at  emerging  technologies  and  markets,  evaluaEng  how  technologies  are  applied  to  solve  problems  rather  than  evaluaEng  product  features.  

About SnapLogic

Anything apps | APIs | things | data

Anytime batch | streaming | real-time

Anywhere on premises | in the cloud

SnapLogic helps enterprises connect data and applications faster

Modern Architecture: Hybrid and Elastic

Streams: No data is stored/cached Secure: 100% standards-based Elastic: Scales out & handles data and app integration use cases

Metadata

Data Databases On Prem

Apps

Big Data

Cloud Apps and Data Cloud-Based Designer, Manager,

Dashboard

Cloudplex

Groundplex

Hadooplex Sparkplex

Firewall

z

Data Acquisition

On Prem Apps and Data

Data Access

z

Data Management

Data Lake

Add information and improve data

Spark Python Scala Java

R Pig

Collect and integrate data from multiple

sources

HDFSAWS S3

MS Azure Blob

•  ERP •  CRM •  RDBMS

Cloud Apps and Data

•  CRM •  HCM •  Social

IoT Data

•  Sensors •  Wearables •  Devices

LakeshoreData Mart

•  MS Azure •  AWS

Redshift •  …

BI / Analytics

•  Tableau •  MS

PowerBI / Azure

•  AWS QuickSight

Organize and prepare data for

visualization

HDFSAWS S3

MS Azure Blob Hive

Batch

Streaming

Schedule and manage: Oozie, Ambari

Kafka, Sqoop, Flume

Real-time

Ingest Prepare Deliver

Impala, HiveSQL, SparkSQL

z

Data Acquisition

On Prem Apps and Data

Data Access

z

Data Management

The Modern Data Lake Powered by SnapLogic

•  ERP •  CRM •  RDBMS

Cloud Apps and Data

•  CRM •  HCM •  Social

IoT Data

•  Sensors •  Wearables •  Devices

LakeshoreData Mart

•  MS Azure •  AWS

Redshift •  …

BI / Analytics

•  Tableau •  MS

PowerBI / Azure

•  AWS QuickSight

Batch

Streaming

Schedule and manage: SnapLogic SnapLogic Pipelines

Real-time

Ingest Prepare Deliver

SnapLogic Pipelines

Sort, Aggregate,

Join, Merge, Transform

SnapLogic abstracts and

operationalizes with

SnapReduce or Spark pipelines

Collect and integrate data from multiple

sources

SnapLogic pipelines with

standard mode execution

Organize and prepare data for

visualization

SnapLogic pipelines with

standard mode execution

Thank You Watch SnapLogic in action:"

video/snaplogic.com

Contact us: [email protected]

Follow us on Twitter:

@SnapLogic