23
BUILDING A CEPH-POWERED DATA LAKE (OR) DATA GRID Paul Evans principal architect daystrom technology group [email protected] san jose 2014 ceph days

Ceph Days 2014 Paul Evans Slide Deck

Embed Size (px)

DESCRIPTION

Ceph Days held in October 2014 at Brocade headquarters in Silicon Valley.

Citation preview

Page 1: Ceph Days 2014 Paul Evans Slide Deck

BUILDING A CEPH-POWERED DATA LAKE (OR) DATA GRID

Paul Evans principal architect

daystrom technology group [email protected]

san jose 2014

ceph days

Page 2: Ceph Days 2014 Paul Evans Slide Deck

Why build a data grid (or data lake) ?

…because we have a data FLOOD in process

Page 3: Ceph Days 2014 Paul Evans Slide Deck

indeed, we love data…

we’re good at generating more and more, but…

( we never seem to throw any of it out )

too FAST

too many VARIANTS

too MUCH

Page 4: Ceph Days 2014 Paul Evans Slide Deck

IS THE ANSWER TO ALL OF THIS…. “ WE NEED LESS DATA! ”

are you crazy? we live to store things!

we just need better tools… (and more storage)

Page 5: Ceph Days 2014 Paul Evans Slide Deck

DATA AUTOMATION

Workflow Automation

Wildly-Scalable Storage

Data Lake Data Grid

STACK

Page 6: Ceph Days 2014 Paul Evans Slide Deck

DATA LAKE“a storage repository that holds a vast amount of raw data in its native

format until it is needed”

Page 7: Ceph Days 2014 Paul Evans Slide Deck

DATA LAKE - ORIGINS

First use credited to James Dixon, CTO at Pentaho, circa 2010

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state…”

“The contents of the data lake stream in from a

source to fill the lake, and various users of the lake

can come to examine, dive in, or take samples.”

Page 8: Ceph Days 2014 Paul Evans Slide Deck

DATA LAKE - EXPLAINED

While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.

Page 9: Ceph Days 2014 Paul Evans Slide Deck

DATA LAKE - WHY ???

?

Page 10: Ceph Days 2014 Paul Evans Slide Deck

DATA LAKE CHARACTER

Unwashed Data: schema-on-read from RAW source Flexible Processing: batch, interactive, online, search

MetaData Dependent: tag it or lose it Common Access: hdfs-centric toolset

…in other words: this is not a glass-house Data Mart

Page 11: Ceph Days 2014 Paul Evans Slide Deck

A REFERENCE ‘LAKE’ ARCHITECTURE

OPERATIONSSECURITYDATA ACCESSGOVERNENCEINTEGRATION

DATA MANAGEMENT

Page 12: Ceph Days 2014 Paul Evans Slide Deck

A CEPHALOPOD IN THE LAKE?

Hadoop-native HDFS Locality-aware HDFS Distributed Name Svc Ceph Native Erasure Coding Ceph 20% Faster * Ceph * on Terasort benchmark over IB, Mar 2014

If this is import… Use this…

Page 13: Ceph Days 2014 Paul Evans Slide Deck

(LAKE) DREDGERS

technology grouptechnology group

Page 14: Ceph Days 2014 Paul Evans Slide Deck

DATA GRID“the unifying layer to how content and data are stored, protected, located

and accessed”

Page 15: Ceph Days 2014 Paul Evans Slide Deck

DATA GRID - ORIGINS

The need for data grids was first recognized by the scientific community concerning climate modeling, where exchanging PB-size data sets became commonplace. Recently, large-scale

instruments such as the Large Hadron Collider (LHC) at CERN are driving grid innovation.

Page 16: Ceph Days 2014 Paul Evans Slide Deck

DATA GRID - EXPLAINED

Data Grids present consistent access controls, governance, and metadata extensions to diverse storage media using a common, global interface for access and transport.

Additionally, they offer a ‘micro-service’ architecture for the creation of standard tasks & policies, which are enforced by a distributed “grid control-plane.”

Page 17: Ceph Days 2014 Paul Evans Slide Deck

DATA GRID - WHY ???

Page 18: Ceph Days 2014 Paul Evans Slide Deck

DATA GRID - ATTRIBUTES

Data Virtualization: common presentation of all content Universe-size Namespace: for files, objects & metadata Automation of Data Operations: distributed, scalable

Policy Mgmt/Reporting: data valuation & action triggers

Page 19: Ceph Days 2014 Paul Evans Slide Deck

CEPH MEETS GRID

implemented:

CephFS & RBD Ceph libRADOS RemoteCloud

Cold StorageArchive

DATA GRID unified namespace

HiSpeed Tier

LinkD

irectLIBRADOS

+ Ceph

LIBRADOS + Ceph

RBD

Page 20: Ceph Days 2014 Paul Evans Slide Deck

GRID IRON ALL-STARS

technology grouptechnology group

(Dan Bedard: [email protected])

Page 21: Ceph Days 2014 Paul Evans Slide Deck

TIME 2 SUMMARIZE…We are in the midst of a Data Explosion

We also need effective, de-centralized ways to care for the dataWe need robust, expandable, yet simple solutions to store data

Page 22: Ceph Days 2014 Paul Evans Slide Deck

DATA AUTOMATION

STACK

Workflow Automation

Wildly-Scalable Storage

Ceph+

the SMART approach

Data Lake Data Grid

Page 23: Ceph Days 2014 Paul Evans Slide Deck

thank you!

Paul Evans principal architect

[email protected]

technology grouptechnology group

san jose ceph days