Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. ﬁxed

Freeing Data Storage from Cages

Alekh Jindal Jorge Quiané Jens Dittrich

Saarland University, GermanyQCRI, Qatar

WWHow! Freeing Data Storage from Cages

Alekh Jindal

FJorge-Arnulfo Quiané-Ruiz

_,⌥ Jens Dittrich

F

FInformation Systems Group, Saarland University

http://infosys.cs.uni-saarland.de

_QCRI, Qatar Foundation

http://www.qcri.qa

Abstract. E�cient data storage is a key component of data manag-ing systems to achieve good performance. However, currently datastorage is either heavily constrained by static decisions (e.g. fixeddata stores in DBMSs) or left to be tuned and configured by users(e.g. manual data backup in File Systems). In this paper, we takea holistic view of data storage and envision a virtual storage layer.Our virtual storage layer provides a unified storage framework forseveral use-cases including personal, enterprise, and cloud storage.

1. INTRODUCTIONTo better understand the problem, let us first see some of the typ-

ical data storage scenarios and analyze what they have in common.

1.1 Current Approaches to Data StorageWe start from the simplest use case of a personal user using a File

System Storage, right up to an enterprise user using Cloud Storage.Use Case 1: File System. Consider a researcher, Alice, having twolaptops (one personal, one from her university). Alice stores di↵er-ent types of files at di↵erent locations. For example, she storesmovies on her personal laptop and grant proposals on her univer-sity laptop. Alice also maintains regular copies of her data on ex-ternal devices, in case one of her laptop crashes. For instance, shemaintains copies of grant proposals on hard-drives and recent con-ference talks on USB sticks. Finally, Alice also changes the dataformat (e.g. compressed) of her files in order to save storage space.Use Case 2: RAID. Consider a University IT department usingRAID storage servers to store its data. The RAID system automat-ically stores parity information on di↵erent disks and stripes themto allow for one or more disk failures. The server admin does notneed to worry about the reliability of data. She just needs to choosethe RAID level for the first time. However, changing the RAIDlevel is often a complex task in practice.Use Case 3: Relational DBMS. Consider a car manufacturer us-ing an RDBMS for managing its inventory and sales. The manufac-turer provides the schema definition as well as the data to store. TheRDBMS takes care of physical data storage, recovery, and data lay-outs (e.g. row store in IBM DB2). For an expert DBA, the RDBMS

⌥Work done at Saarland University.

This article is published under a Creative Commons Attribution License(http://creativecommons.org/licenses/by/3.0/), which permits distributionand reproduction in any medium as well allowing derivative works, pro-vided that you attribute the original work to the author(s) and CIDR 2013.6th Biennial Conference on Innovative Data Systems Research (CIDR ’13)January 6-9, 2013, Asilomar, California, USA.

provides several additional data tunings knobs such as replication,indexing, partitioning, and storage locations.Use Case 4: Cloud. Consider a web analytics start-up companyusing Cloud storage to store large volumes of web logs. The start-up company needs to pick a cloud provider for its data. However,it does not need to worry about the data placement. The CloudStorage automatically replicates and distributes data over severalstorage locations in order to guarantee the availability of their data.Furthermore, the start-up company does not care about adding newstorage locations: the Cloud Storage does so automatically. Addi-tionally, the company may chose to store the data in multiple datalayouts [10] or indexes [5].

All of these use-cases are very di↵erent, right? No! We arguethat they are all facets of the same single data storage problem.We observe that in each of the four Use Cases, the users are im-plicitly or explicitly answering the following three key questions:(1) What to store? (i.e. which parts to store from a dataset or fileand how many times), (2) Where to store? (i.e. on which devices orlocations), and (3) How to store? (i.e. in which layout or format).Table 1 categorizes these storage decisions in the four Use Casesand labels them according to whether they use fixed ( ), flexible &manual ( ), or flexible & automatic ( ) storage decisions.What to store? The second column of Table 1 shows the what de-cisions in the four Use Cases. In Use Case 1 (UC1 for short), thefile system requires Alice to make manual choices of which mas-ter copies as well as the backup copies of data to store. In UC2,RAID automatically creates data replicas or parity, depending onthe RAID level. In UC3, an RDBMS typically stores the data aswell as the recovery log. A DBA has further control to create in-dexes and materialized views to speed up query execution. Simi-larly, Fractured Mirrors makes a fixed decision of storing exactlytwo copies of the data. Finally, in UC4, Cloud Storage is flexibleto create one or more data replicas for availability.Where to store? The third column of Table 1 shows the wheredecisions in the four Use Cases. In UC1, Alice has to manuallydecide where to store each type of data, e.g. movies on her per-sonal laptop and grant proposals on her university laptop. Goingfurther, RAID (UC2) makes fixed decisions (based on the RAIDlevel) to store data on a set of storage devices, which can be addedor removed manually. Similarly, an RDBMS (UC3) allows usersto define table spaces, typically residing on di↵erent storage loca-tions, for their application. However, Fractured Mirrors stores thedata on two (fixed) di↵erent machines. Cloud Storage (UC4), onthe other hand, is fully flexible on where to store the data. A cloudprovider can change data storage locations and even provision ad-ditional physical machines automatically.How to store? The fourth column of Table 1 shows the how deci-sions in the four Use Cases. Again, users in UC1 must manually


Alekh Jindal


_,⌥ Jens Dittrich

F




http://www.qcri.qa











Alekh Jindal


_,⌥ Jens Dittrich

F




http://www.qcri.qa











Alekh Jindal


_,⌥ Jens Dittrich

F




http://www.qcri.qa











Alekh Jindal


_,⌥ Jens Dittrich

F




http://www.qcri.qa










Data Storage is Easy!

Data Storage is Easy!

Raw data

Data Management

System

Raw data

Data Management

System

???

?

Queries

Raw data

Data Management

System

Is performance as expected?

???

?

Is your Data Storage Efficient?

What to store?

TPC-H Lineitem

copy 1 copy 2 copy 3

TPC-H Lineitem

Where to store?

?

TPC-H Lineitem

How to store?

?a b+

What, Where, and Howin our everyday life

ppt

ppt

ppt

ppt

pdfppt

pdfppt

pdfppt

pdfppt

ppt

pdfppt

ppt

pdfppt

pdf

However

WhatWhereHow

TPC-HLineitem

Data Management

System

Store Engine

TPC-HLineitem

Data Management

System

Store Engine

DBMS-X

TPC-HLineitem

Store Engine

DBMS-X

TPC-HLineitem

row

layo

ut

WhatWhere

How

pdf

ppt

pdfppt

pdf

ppt

pdfppt

WWHow!

WWHow!hat

here

WWHow

WWHow! VisionFlexible on What, Where, and How1

WWHow! VisionFlexible on What, Where, and How1Declarative Data Storage Language2

WWHow! VisionFlexible on What, Where, and How1Declarative Data Storage Language2Data Storage Optimizer3

DSL DSL

Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ew

WWHow! Language

Physical Storage Interface

Data Management

System

WWHow! Layer

WWHow!hat

WWHow

can we do?

I want to store 3 replicas of the weblogs on the cloud with a different index each replica

STORE ‘/webApp/logs/UserVisits.log’


STORE ‘/webApp/logs/UserVisits.log’WHAT * AS rep-1, * AS rep-2, * AS rep-3


STORE ‘/webApp/logs/UserVisits.log’WHAT * AS rep-1, * AS rep-2, * AS rep-3WHERE ec2-007-23-compute-1-amazonaws.com


STORE ‘/webApp/logs/UserVisits.log’WHAT * AS rep-1, * AS rep-2, * AS rep-3WHERE ec2-007-23-compute-1-amazonaws.comHOW idx(url) FOR rep-1, idx(sourceIP) FOR rep-2, idx(visitDate) FOR rep-3;


I want to store 3 replicas of the weblogs with a different index each replica and

with high privacy

STORE ‘/webApp/logs/UserVisits.log’WHAT * AS rep-1, * AS rep-2, * AS rep-3HOW idx(url) FOR rep-1,

idx(sourceIP) FOR rep-2,idx(visitDate) FOR rep-3;


with high privacy



CONSTRAINT Privacy = ‘high’


with high privacy



CONSTRAINT Privacy = ‘high’


with high privacy

job for the WWhow! data storage optimizer

I want to store the weblogs with high privacy and if

possible with high availability



STORE ‘/webApp/logs/UserVisits.log’



STORE ‘/webApp/logs/UserVisits.log’PREFERENCE Availability = ‘high’



STORE ‘/webApp/logs/UserVisits.log’PREFERENCE Availability = ‘high’ CONSTRAINT Privacy = ‘high’



STORE ‘/webApp/logs/UserVisits.log’PREFERENCE Availability = ‘high’ CONSTRAINT Privacy = ‘high’

job for the WWhow! data storage optimizer

Euh!

What about existing

technology?

WWHow!here

WWHow

do we go?

Data Management System

TPC-HLineitem

Store Enginero

w la

yout

WWHow! Layer

row

layo

ut


TPC-HLineitem

WWHow! Layer

colu

mn

layo

ut


TPC-HLineitem

WWHow! Layer

colu

mn

layo

ut

Data Management Systems

TPC-HLineitem

DMS-X DMS-Y

WWHow!WWHowto proceed?

WWHow! Language


WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System

WWHow! Language


WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System

API (Logical View)WWHow! Language


WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System



WWHow! Language

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System



WWHow! Language

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System

WWHow! LanguageInterpreter



WWHow! Language

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System


WWHow! ControllerStorage themes



WWHow! Language

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System





WWHow! Language

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System



WWHow! OptimizerWhat

Where How

Data Storage Optimizer



WWHow! Language

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System




Where How




WWHow! Language

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System




Where How


Data Access Optimizer



WWHow! Language

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System




Where How


Data Access Optimizer

Documents

Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. ﬁxed