71
Freeing Data Storage from Cages Alekh Jindal Jorge Quiané Jens Dittrich Saarland University, Germany QCRI, Qatar _ _ F F F

Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

Freeing Data Storage from Cages

Alekh Jindal Jorge Quiané Jens Dittrich

Saarland University, GermanyQCRI, Qatar

WWHow! Freeing Data Storage from Cages

Alekh Jindal

FJorge-Arnulfo Quiané-Ruiz

_,⌥ Jens Dittrich

F

FInformation Systems Group, Saarland University

http://infosys.cs.uni-saarland.de

_QCRI, Qatar Foundation

http://www.qcri.qa

Abstract. E�cient data storage is a key component of data manag-ing systems to achieve good performance. However, currently datastorage is either heavily constrained by static decisions (e.g. fixeddata stores in DBMSs) or left to be tuned and configured by users(e.g. manual data backup in File Systems). In this paper, we takea holistic view of data storage and envision a virtual storage layer.Our virtual storage layer provides a unified storage framework forseveral use-cases including personal, enterprise, and cloud storage.

1. INTRODUCTIONTo better understand the problem, let us first see some of the typ-

ical data storage scenarios and analyze what they have in common.

1.1 Current Approaches to Data StorageWe start from the simplest use case of a personal user using a File

System Storage, right up to an enterprise user using Cloud Storage.Use Case 1: File System. Consider a researcher, Alice, having twolaptops (one personal, one from her university). Alice stores di↵er-ent types of files at di↵erent locations. For example, she storesmovies on her personal laptop and grant proposals on her univer-sity laptop. Alice also maintains regular copies of her data on ex-ternal devices, in case one of her laptop crashes. For instance, shemaintains copies of grant proposals on hard-drives and recent con-ference talks on USB sticks. Finally, Alice also changes the dataformat (e.g. compressed) of her files in order to save storage space.Use Case 2: RAID. Consider a University IT department usingRAID storage servers to store its data. The RAID system automat-ically stores parity information on di↵erent disks and stripes themto allow for one or more disk failures. The server admin does notneed to worry about the reliability of data. She just needs to choosethe RAID level for the first time. However, changing the RAIDlevel is often a complex task in practice.Use Case 3: Relational DBMS. Consider a car manufacturer us-ing an RDBMS for managing its inventory and sales. The manufac-turer provides the schema definition as well as the data to store. TheRDBMS takes care of physical data storage, recovery, and data lay-outs (e.g. row store in IBM DB2). For an expert DBA, the RDBMS

⌥Work done at Saarland University.

This article is published under a Creative Commons Attribution License(http://creativecommons.org/licenses/by/3.0/), which permits distributionand reproduction in any medium as well allowing derivative works, pro-vided that you attribute the original work to the author(s) and CIDR 2013.6th Biennial Conference on Innovative Data Systems Research (CIDR ’13)January 6-9, 2013, Asilomar, California, USA.

provides several additional data tunings knobs such as replication,indexing, partitioning, and storage locations.Use Case 4: Cloud. Consider a web analytics start-up companyusing Cloud storage to store large volumes of web logs. The start-up company needs to pick a cloud provider for its data. However,it does not need to worry about the data placement. The CloudStorage automatically replicates and distributes data over severalstorage locations in order to guarantee the availability of their data.Furthermore, the start-up company does not care about adding newstorage locations: the Cloud Storage does so automatically. Addi-tionally, the company may chose to store the data in multiple datalayouts [10] or indexes [5].

All of these use-cases are very di↵erent, right? No! We arguethat they are all facets of the same single data storage problem.We observe that in each of the four Use Cases, the users are im-plicitly or explicitly answering the following three key questions:(1) What to store? (i.e. which parts to store from a dataset or fileand how many times), (2) Where to store? (i.e. on which devices orlocations), and (3) How to store? (i.e. in which layout or format).Table 1 categorizes these storage decisions in the four Use Casesand labels them according to whether they use fixed ( ), flexible &manual ( ), or flexible & automatic ( ) storage decisions.What to store? The second column of Table 1 shows the what de-cisions in the four Use Cases. In Use Case 1 (UC1 for short), thefile system requires Alice to make manual choices of which mas-ter copies as well as the backup copies of data to store. In UC2,RAID automatically creates data replicas or parity, depending onthe RAID level. In UC3, an RDBMS typically stores the data aswell as the recovery log. A DBA has further control to create in-dexes and materialized views to speed up query execution. Simi-larly, Fractured Mirrors makes a fixed decision of storing exactlytwo copies of the data. Finally, in UC4, Cloud Storage is flexibleto create one or more data replicas for availability.Where to store? The third column of Table 1 shows the wheredecisions in the four Use Cases. In UC1, Alice has to manuallydecide where to store each type of data, e.g. movies on her per-sonal laptop and grant proposals on her university laptop. Goingfurther, RAID (UC2) makes fixed decisions (based on the RAIDlevel) to store data on a set of storage devices, which can be addedor removed manually. Similarly, an RDBMS (UC3) allows usersto define table spaces, typically residing on di↵erent storage loca-tions, for their application. However, Fractured Mirrors stores thedata on two (fixed) di↵erent machines. Cloud Storage (UC4), onthe other hand, is fully flexible on where to store the data. A cloudprovider can change data storage locations and even provision ad-ditional physical machines automatically.How to store? The fourth column of Table 1 shows the how deci-sions in the four Use Cases. Again, users in UC1 must manually

WWHow! Freeing Data Storage from Cages

Alekh Jindal

FJorge-Arnulfo Quiané-Ruiz

_,⌥ Jens Dittrich

F

FInformation Systems Group, Saarland University

http://infosys.cs.uni-saarland.de

_QCRI, Qatar Foundation

http://www.qcri.qa

Abstract. E�cient data storage is a key component of data manag-ing systems to achieve good performance. However, currently datastorage is either heavily constrained by static decisions (e.g. fixeddata stores in DBMSs) or left to be tuned and configured by users(e.g. manual data backup in File Systems). In this paper, we takea holistic view of data storage and envision a virtual storage layer.Our virtual storage layer provides a unified storage framework forseveral use-cases including personal, enterprise, and cloud storage.

1. INTRODUCTIONTo better understand the problem, let us first see some of the typ-

ical data storage scenarios and analyze what they have in common.

1.1 Current Approaches to Data StorageWe start from the simplest use case of a personal user using a File

System Storage, right up to an enterprise user using Cloud Storage.Use Case 1: File System. Consider a researcher, Alice, having twolaptops (one personal, one from her university). Alice stores di↵er-ent types of files at di↵erent locations. For example, she storesmovies on her personal laptop and grant proposals on her univer-sity laptop. Alice also maintains regular copies of her data on ex-ternal devices, in case one of her laptop crashes. For instance, shemaintains copies of grant proposals on hard-drives and recent con-ference talks on USB sticks. Finally, Alice also changes the dataformat (e.g. compressed) of her files in order to save storage space.Use Case 2: RAID. Consider a University IT department usingRAID storage servers to store its data. The RAID system automat-ically stores parity information on di↵erent disks and stripes themto allow for one or more disk failures. The server admin does notneed to worry about the reliability of data. She just needs to choosethe RAID level for the first time. However, changing the RAIDlevel is often a complex task in practice.Use Case 3: Relational DBMS. Consider a car manufacturer us-ing an RDBMS for managing its inventory and sales. The manufac-turer provides the schema definition as well as the data to store. TheRDBMS takes care of physical data storage, recovery, and data lay-outs (e.g. row store in IBM DB2). For an expert DBA, the RDBMS

⌥Work done at Saarland University.

This article is published under a Creative Commons Attribution License(http://creativecommons.org/licenses/by/3.0/), which permits distributionand reproduction in any medium as well allowing derivative works, pro-vided that you attribute the original work to the author(s) and CIDR 2013.6th Biennial Conference on Innovative Data Systems Research (CIDR ’13)January 6-9, 2013, Asilomar, California, USA.

provides several additional data tunings knobs such as replication,indexing, partitioning, and storage locations.Use Case 4: Cloud. Consider a web analytics start-up companyusing Cloud storage to store large volumes of web logs. The start-up company needs to pick a cloud provider for its data. However,it does not need to worry about the data placement. The CloudStorage automatically replicates and distributes data over severalstorage locations in order to guarantee the availability of their data.Furthermore, the start-up company does not care about adding newstorage locations: the Cloud Storage does so automatically. Addi-tionally, the company may chose to store the data in multiple datalayouts [10] or indexes [5].

All of these use-cases are very di↵erent, right? No! We arguethat they are all facets of the same single data storage problem.We observe that in each of the four Use Cases, the users are im-plicitly or explicitly answering the following three key questions:(1) What to store? (i.e. which parts to store from a dataset or fileand how many times), (2) Where to store? (i.e. on which devices orlocations), and (3) How to store? (i.e. in which layout or format).Table 1 categorizes these storage decisions in the four Use Casesand labels them according to whether they use fixed ( ), flexible &manual ( ), or flexible & automatic ( ) storage decisions.What to store? The second column of Table 1 shows the what de-cisions in the four Use Cases. In Use Case 1 (UC1 for short), thefile system requires Alice to make manual choices of which mas-ter copies as well as the backup copies of data to store. In UC2,RAID automatically creates data replicas or parity, depending onthe RAID level. In UC3, an RDBMS typically stores the data aswell as the recovery log. A DBA has further control to create in-dexes and materialized views to speed up query execution. Simi-larly, Fractured Mirrors makes a fixed decision of storing exactlytwo copies of the data. Finally, in UC4, Cloud Storage is flexibleto create one or more data replicas for availability.Where to store? The third column of Table 1 shows the wheredecisions in the four Use Cases. In UC1, Alice has to manuallydecide where to store each type of data, e.g. movies on her per-sonal laptop and grant proposals on her university laptop. Goingfurther, RAID (UC2) makes fixed decisions (based on the RAIDlevel) to store data on a set of storage devices, which can be addedor removed manually. Similarly, an RDBMS (UC3) allows usersto define table spaces, typically residing on di↵erent storage loca-tions, for their application. However, Fractured Mirrors stores thedata on two (fixed) di↵erent machines. Cloud Storage (UC4), onthe other hand, is fully flexible on where to store the data. A cloudprovider can change data storage locations and even provision ad-ditional physical machines automatically.How to store? The fourth column of Table 1 shows the how deci-sions in the four Use Cases. Again, users in UC1 must manually

WWHow! Freeing Data Storage from Cages

Alekh Jindal

FJorge-Arnulfo Quiané-Ruiz

_,⌥ Jens Dittrich

F

FInformation Systems Group, Saarland University

http://infosys.cs.uni-saarland.de

_QCRI, Qatar Foundation

http://www.qcri.qa

Abstract. E�cient data storage is a key component of data manag-ing systems to achieve good performance. However, currently datastorage is either heavily constrained by static decisions (e.g. fixeddata stores in DBMSs) or left to be tuned and configured by users(e.g. manual data backup in File Systems). In this paper, we takea holistic view of data storage and envision a virtual storage layer.Our virtual storage layer provides a unified storage framework forseveral use-cases including personal, enterprise, and cloud storage.

1. INTRODUCTIONTo better understand the problem, let us first see some of the typ-

ical data storage scenarios and analyze what they have in common.

1.1 Current Approaches to Data StorageWe start from the simplest use case of a personal user using a File

System Storage, right up to an enterprise user using Cloud Storage.Use Case 1: File System. Consider a researcher, Alice, having twolaptops (one personal, one from her university). Alice stores di↵er-ent types of files at di↵erent locations. For example, she storesmovies on her personal laptop and grant proposals on her univer-sity laptop. Alice also maintains regular copies of her data on ex-ternal devices, in case one of her laptop crashes. For instance, shemaintains copies of grant proposals on hard-drives and recent con-ference talks on USB sticks. Finally, Alice also changes the dataformat (e.g. compressed) of her files in order to save storage space.Use Case 2: RAID. Consider a University IT department usingRAID storage servers to store its data. The RAID system automat-ically stores parity information on di↵erent disks and stripes themto allow for one or more disk failures. The server admin does notneed to worry about the reliability of data. She just needs to choosethe RAID level for the first time. However, changing the RAIDlevel is often a complex task in practice.Use Case 3: Relational DBMS. Consider a car manufacturer us-ing an RDBMS for managing its inventory and sales. The manufac-turer provides the schema definition as well as the data to store. TheRDBMS takes care of physical data storage, recovery, and data lay-outs (e.g. row store in IBM DB2). For an expert DBA, the RDBMS

⌥Work done at Saarland University.

This article is published under a Creative Commons Attribution License(http://creativecommons.org/licenses/by/3.0/), which permits distributionand reproduction in any medium as well allowing derivative works, pro-vided that you attribute the original work to the author(s) and CIDR 2013.6th Biennial Conference on Innovative Data Systems Research (CIDR ’13)January 6-9, 2013, Asilomar, California, USA.

provides several additional data tunings knobs such as replication,indexing, partitioning, and storage locations.Use Case 4: Cloud. Consider a web analytics start-up companyusing Cloud storage to store large volumes of web logs. The start-up company needs to pick a cloud provider for its data. However,it does not need to worry about the data placement. The CloudStorage automatically replicates and distributes data over severalstorage locations in order to guarantee the availability of their data.Furthermore, the start-up company does not care about adding newstorage locations: the Cloud Storage does so automatically. Addi-tionally, the company may chose to store the data in multiple datalayouts [10] or indexes [5].

All of these use-cases are very di↵erent, right? No! We arguethat they are all facets of the same single data storage problem.We observe that in each of the four Use Cases, the users are im-plicitly or explicitly answering the following three key questions:(1) What to store? (i.e. which parts to store from a dataset or fileand how many times), (2) Where to store? (i.e. on which devices orlocations), and (3) How to store? (i.e. in which layout or format).Table 1 categorizes these storage decisions in the four Use Casesand labels them according to whether they use fixed ( ), flexible &manual ( ), or flexible & automatic ( ) storage decisions.What to store? The second column of Table 1 shows the what de-cisions in the four Use Cases. In Use Case 1 (UC1 for short), thefile system requires Alice to make manual choices of which mas-ter copies as well as the backup copies of data to store. In UC2,RAID automatically creates data replicas or parity, depending onthe RAID level. In UC3, an RDBMS typically stores the data aswell as the recovery log. A DBA has further control to create in-dexes and materialized views to speed up query execution. Simi-larly, Fractured Mirrors makes a fixed decision of storing exactlytwo copies of the data. Finally, in UC4, Cloud Storage is flexibleto create one or more data replicas for availability.Where to store? The third column of Table 1 shows the wheredecisions in the four Use Cases. In UC1, Alice has to manuallydecide where to store each type of data, e.g. movies on her per-sonal laptop and grant proposals on her university laptop. Goingfurther, RAID (UC2) makes fixed decisions (based on the RAIDlevel) to store data on a set of storage devices, which can be addedor removed manually. Similarly, an RDBMS (UC3) allows usersto define table spaces, typically residing on di↵erent storage loca-tions, for their application. However, Fractured Mirrors stores thedata on two (fixed) di↵erent machines. Cloud Storage (UC4), onthe other hand, is fully flexible on where to store the data. A cloudprovider can change data storage locations and even provision ad-ditional physical machines automatically.How to store? The fourth column of Table 1 shows the how deci-sions in the four Use Cases. Again, users in UC1 must manually

WWHow! Freeing Data Storage from Cages

Alekh Jindal

FJorge-Arnulfo Quiané-Ruiz

_,⌥ Jens Dittrich

F

FInformation Systems Group, Saarland University

http://infosys.cs.uni-saarland.de

_QCRI, Qatar Foundation

http://www.qcri.qa

Abstract. E�cient data storage is a key component of data manag-ing systems to achieve good performance. However, currently datastorage is either heavily constrained by static decisions (e.g. fixeddata stores in DBMSs) or left to be tuned and configured by users(e.g. manual data backup in File Systems). In this paper, we takea holistic view of data storage and envision a virtual storage layer.Our virtual storage layer provides a unified storage framework forseveral use-cases including personal, enterprise, and cloud storage.

1. INTRODUCTIONTo better understand the problem, let us first see some of the typ-

ical data storage scenarios and analyze what they have in common.

1.1 Current Approaches to Data StorageWe start from the simplest use case of a personal user using a File

System Storage, right up to an enterprise user using Cloud Storage.Use Case 1: File System. Consider a researcher, Alice, having twolaptops (one personal, one from her university). Alice stores di↵er-ent types of files at di↵erent locations. For example, she storesmovies on her personal laptop and grant proposals on her univer-sity laptop. Alice also maintains regular copies of her data on ex-ternal devices, in case one of her laptop crashes. For instance, shemaintains copies of grant proposals on hard-drives and recent con-ference talks on USB sticks. Finally, Alice also changes the dataformat (e.g. compressed) of her files in order to save storage space.Use Case 2: RAID. Consider a University IT department usingRAID storage servers to store its data. The RAID system automat-ically stores parity information on di↵erent disks and stripes themto allow for one or more disk failures. The server admin does notneed to worry about the reliability of data. She just needs to choosethe RAID level for the first time. However, changing the RAIDlevel is often a complex task in practice.Use Case 3: Relational DBMS. Consider a car manufacturer us-ing an RDBMS for managing its inventory and sales. The manufac-turer provides the schema definition as well as the data to store. TheRDBMS takes care of physical data storage, recovery, and data lay-outs (e.g. row store in IBM DB2). For an expert DBA, the RDBMS

⌥Work done at Saarland University.

This article is published under a Creative Commons Attribution License(http://creativecommons.org/licenses/by/3.0/), which permits distributionand reproduction in any medium as well allowing derivative works, pro-vided that you attribute the original work to the author(s) and CIDR 2013.6th Biennial Conference on Innovative Data Systems Research (CIDR ’13)January 6-9, 2013, Asilomar, California, USA.

provides several additional data tunings knobs such as replication,indexing, partitioning, and storage locations.Use Case 4: Cloud. Consider a web analytics start-up companyusing Cloud storage to store large volumes of web logs. The start-up company needs to pick a cloud provider for its data. However,it does not need to worry about the data placement. The CloudStorage automatically replicates and distributes data over severalstorage locations in order to guarantee the availability of their data.Furthermore, the start-up company does not care about adding newstorage locations: the Cloud Storage does so automatically. Addi-tionally, the company may chose to store the data in multiple datalayouts [10] or indexes [5].

All of these use-cases are very di↵erent, right? No! We arguethat they are all facets of the same single data storage problem.We observe that in each of the four Use Cases, the users are im-plicitly or explicitly answering the following three key questions:(1) What to store? (i.e. which parts to store from a dataset or fileand how many times), (2) Where to store? (i.e. on which devices orlocations), and (3) How to store? (i.e. in which layout or format).Table 1 categorizes these storage decisions in the four Use Casesand labels them according to whether they use fixed ( ), flexible &manual ( ), or flexible & automatic ( ) storage decisions.What to store? The second column of Table 1 shows the what de-cisions in the four Use Cases. In Use Case 1 (UC1 for short), thefile system requires Alice to make manual choices of which mas-ter copies as well as the backup copies of data to store. In UC2,RAID automatically creates data replicas or parity, depending onthe RAID level. In UC3, an RDBMS typically stores the data aswell as the recovery log. A DBA has further control to create in-dexes and materialized views to speed up query execution. Simi-larly, Fractured Mirrors makes a fixed decision of storing exactlytwo copies of the data. Finally, in UC4, Cloud Storage is flexibleto create one or more data replicas for availability.Where to store? The third column of Table 1 shows the wheredecisions in the four Use Cases. In UC1, Alice has to manuallydecide where to store each type of data, e.g. movies on her per-sonal laptop and grant proposals on her university laptop. Goingfurther, RAID (UC2) makes fixed decisions (based on the RAIDlevel) to store data on a set of storage devices, which can be addedor removed manually. Similarly, an RDBMS (UC3) allows usersto define table spaces, typically residing on di↵erent storage loca-tions, for their application. However, Fractured Mirrors stores thedata on two (fixed) di↵erent machines. Cloud Storage (UC4), onthe other hand, is fully flexible on where to store the data. A cloudprovider can change data storage locations and even provision ad-ditional physical machines automatically.How to store? The fourth column of Table 1 shows the how deci-sions in the four Use Cases. Again, users in UC1 must manually

WWHow! Freeing Data Storage from Cages

Alekh Jindal

FJorge-Arnulfo Quiané-Ruiz

_,⌥ Jens Dittrich

F

FInformation Systems Group, Saarland University

http://infosys.cs.uni-saarland.de

_QCRI, Qatar Foundation

http://www.qcri.qa

Abstract. E�cient data storage is a key component of data manag-ing systems to achieve good performance. However, currently datastorage is either heavily constrained by static decisions (e.g. fixeddata stores in DBMSs) or left to be tuned and configured by users(e.g. manual data backup in File Systems). In this paper, we takea holistic view of data storage and envision a virtual storage layer.Our virtual storage layer provides a unified storage framework forseveral use-cases including personal, enterprise, and cloud storage.

1. INTRODUCTIONTo better understand the problem, let us first see some of the typ-

ical data storage scenarios and analyze what they have in common.

1.1 Current Approaches to Data StorageWe start from the simplest use case of a personal user using a File

System Storage, right up to an enterprise user using Cloud Storage.Use Case 1: File System. Consider a researcher, Alice, having twolaptops (one personal, one from her university). Alice stores di↵er-ent types of files at di↵erent locations. For example, she storesmovies on her personal laptop and grant proposals on her univer-sity laptop. Alice also maintains regular copies of her data on ex-ternal devices, in case one of her laptop crashes. For instance, shemaintains copies of grant proposals on hard-drives and recent con-ference talks on USB sticks. Finally, Alice also changes the dataformat (e.g. compressed) of her files in order to save storage space.Use Case 2: RAID. Consider a University IT department usingRAID storage servers to store its data. The RAID system automat-ically stores parity information on di↵erent disks and stripes themto allow for one or more disk failures. The server admin does notneed to worry about the reliability of data. She just needs to choosethe RAID level for the first time. However, changing the RAIDlevel is often a complex task in practice.Use Case 3: Relational DBMS. Consider a car manufacturer us-ing an RDBMS for managing its inventory and sales. The manufac-turer provides the schema definition as well as the data to store. TheRDBMS takes care of physical data storage, recovery, and data lay-outs (e.g. row store in IBM DB2). For an expert DBA, the RDBMS

⌥Work done at Saarland University.

This article is published under a Creative Commons Attribution License(http://creativecommons.org/licenses/by/3.0/), which permits distributionand reproduction in any medium as well allowing derivative works, pro-vided that you attribute the original work to the author(s) and CIDR 2013.6th Biennial Conference on Innovative Data Systems Research (CIDR ’13)January 6-9, 2013, Asilomar, California, USA.

provides several additional data tunings knobs such as replication,indexing, partitioning, and storage locations.Use Case 4: Cloud. Consider a web analytics start-up companyusing Cloud storage to store large volumes of web logs. The start-up company needs to pick a cloud provider for its data. However,it does not need to worry about the data placement. The CloudStorage automatically replicates and distributes data over severalstorage locations in order to guarantee the availability of their data.Furthermore, the start-up company does not care about adding newstorage locations: the Cloud Storage does so automatically. Addi-tionally, the company may chose to store the data in multiple datalayouts [10] or indexes [5].

All of these use-cases are very di↵erent, right? No! We arguethat they are all facets of the same single data storage problem.We observe that in each of the four Use Cases, the users are im-plicitly or explicitly answering the following three key questions:(1) What to store? (i.e. which parts to store from a dataset or fileand how many times), (2) Where to store? (i.e. on which devices orlocations), and (3) How to store? (i.e. in which layout or format).Table 1 categorizes these storage decisions in the four Use Casesand labels them according to whether they use fixed ( ), flexible &manual ( ), or flexible & automatic ( ) storage decisions.What to store? The second column of Table 1 shows the what de-cisions in the four Use Cases. In Use Case 1 (UC1 for short), thefile system requires Alice to make manual choices of which mas-ter copies as well as the backup copies of data to store. In UC2,RAID automatically creates data replicas or parity, depending onthe RAID level. In UC3, an RDBMS typically stores the data aswell as the recovery log. A DBA has further control to create in-dexes and materialized views to speed up query execution. Simi-larly, Fractured Mirrors makes a fixed decision of storing exactlytwo copies of the data. Finally, in UC4, Cloud Storage is flexibleto create one or more data replicas for availability.Where to store? The third column of Table 1 shows the wheredecisions in the four Use Cases. In UC1, Alice has to manuallydecide where to store each type of data, e.g. movies on her per-sonal laptop and grant proposals on her university laptop. Goingfurther, RAID (UC2) makes fixed decisions (based on the RAIDlevel) to store data on a set of storage devices, which can be addedor removed manually. Similarly, an RDBMS (UC3) allows usersto define table spaces, typically residing on di↵erent storage loca-tions, for their application. However, Fractured Mirrors stores thedata on two (fixed) di↵erent machines. Cloud Storage (UC4), onthe other hand, is fully flexible on where to store the data. A cloudprovider can change data storage locations and even provision ad-ditional physical machines automatically.How to store? The fourth column of Table 1 shows the how deci-sions in the four Use Cases. Again, users in UC1 must manually

Page 2: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

Data Storage is Easy!

Page 3: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

Data Storage is Easy!

Page 4: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

Raw data

Page 5: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

Data Management

System

Raw data

Page 6: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

Data Management

System

???

?

Queries

Raw data

Page 7: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

Data Management

System

Is performance as expected?

???

?

Page 8: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

Is your Data Storage Efficient?

Page 9: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed
Page 10: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

What to store?

TPC-H Lineitem

copy 1 copy 2 copy 3

Page 11: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

TPC-H Lineitem

Where to store?

?

Page 12: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

TPC-H Lineitem

How to store?

?a b+

Page 13: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

What, Where, and Howin our everyday life

Page 14: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

ppt

Page 15: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

ppt

ppt

Page 16: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

ppt

pdfppt

Page 17: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

pdfppt

Page 18: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

pdfppt

Page 19: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

pdfppt

Page 20: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

ppt

pdfppt

Page 21: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

ppt

pdfppt

pdf

Page 22: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

However

Page 23: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

WhatWhereHow

Page 24: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

TPC-HLineitem

Data Management

System

Page 25: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

Store Engine

TPC-HLineitem

Data Management

System

Page 26: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

Store Engine

DBMS-X

TPC-HLineitem

Page 27: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

Store Engine

DBMS-X

TPC-HLineitem

row

layo

ut

Page 28: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

WhatWhere

How

Page 29: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

pdf

ppt

pdfppt

Page 30: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

pdf

ppt

pdfppt

Page 31: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed
Page 32: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

WWHow!

Page 33: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

WWHow!hat

here

WWHow

Page 34: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

WWHow! VisionFlexible on What, Where, and How1

Page 35: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

WWHow! VisionFlexible on What, Where, and How1Declarative Data Storage Language2

Page 36: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

WWHow! VisionFlexible on What, Where, and How1Declarative Data Storage Language2Data Storage Optimizer3

Page 37: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

DSL DSL

Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ew

WWHow! Language

Physical Storage Interface

Data Management

System

WWHow! Layer

Page 38: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

WWHow!hat

WWHow

can we do?

Page 39: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

I want to store 3 replicas of the weblogs on the cloud with a different index each replica

Page 40: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

STORE ‘/webApp/logs/UserVisits.log’

I want to store 3 replicas of the weblogs on the cloud with a different index each replica

Page 41: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

STORE ‘/webApp/logs/UserVisits.log’WHAT * AS rep-1, * AS rep-2, * AS rep-3

I want to store 3 replicas of the weblogs on the cloud with a different index each replica

Page 42: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

STORE ‘/webApp/logs/UserVisits.log’WHAT * AS rep-1, * AS rep-2, * AS rep-3WHERE ec2-007-23-compute-1-amazonaws.com

I want to store 3 replicas of the weblogs on the cloud with a different index each replica

Page 43: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

STORE ‘/webApp/logs/UserVisits.log’WHAT * AS rep-1, * AS rep-2, * AS rep-3WHERE ec2-007-23-compute-1-amazonaws.comHOW idx(url) FOR rep-1, idx(sourceIP) FOR rep-2, idx(visitDate) FOR rep-3;

I want to store 3 replicas of the weblogs on the cloud with a different index each replica

Page 44: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

I want to store 3 replicas of the weblogs with a different index each replica and

with high privacy

Page 45: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

STORE ‘/webApp/logs/UserVisits.log’WHAT * AS rep-1, * AS rep-2, * AS rep-3HOW idx(url) FOR rep-1,

idx(sourceIP) FOR rep-2,idx(visitDate) FOR rep-3;

I want to store 3 replicas of the weblogs with a different index each replica and

with high privacy

Page 46: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

STORE ‘/webApp/logs/UserVisits.log’WHAT * AS rep-1, * AS rep-2, * AS rep-3HOW idx(url) FOR rep-1,

idx(sourceIP) FOR rep-2,idx(visitDate) FOR rep-3;

CONSTRAINT Privacy = ‘high’

I want to store 3 replicas of the weblogs with a different index each replica and

with high privacy

Page 47: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

STORE ‘/webApp/logs/UserVisits.log’WHAT * AS rep-1, * AS rep-2, * AS rep-3HOW idx(url) FOR rep-1,

idx(sourceIP) FOR rep-2,idx(visitDate) FOR rep-3;

CONSTRAINT Privacy = ‘high’

I want to store 3 replicas of the weblogs with a different index each replica and

with high privacy

job for the WWhow! data storage optimizer

Page 48: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

I want to store the weblogs with high privacy and if

possible with high availability

Page 49: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

I want to store the weblogs with high privacy and if

possible with high availability

STORE ‘/webApp/logs/UserVisits.log’

Page 50: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

I want to store the weblogs with high privacy and if

possible with high availability

STORE ‘/webApp/logs/UserVisits.log’PREFERENCE Availability = ‘high’

Page 51: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

I want to store the weblogs with high privacy and if

possible with high availability

STORE ‘/webApp/logs/UserVisits.log’PREFERENCE Availability = ‘high’ CONSTRAINT Privacy = ‘high’

Page 52: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

I want to store the weblogs with high privacy and if

possible with high availability

STORE ‘/webApp/logs/UserVisits.log’PREFERENCE Availability = ‘high’ CONSTRAINT Privacy = ‘high’

job for the WWhow! data storage optimizer

Page 53: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

Euh!

Page 54: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

What about existing

technology?

Page 55: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

WWHow!here

WWHow

do we go?

Page 56: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

Data Management System

TPC-HLineitem

Store Enginero

w la

yout

Page 57: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

WWHow! Layer

row

layo

ut

Data Management System

TPC-HLineitem

Page 58: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

WWHow! Layer

colu

mn

layo

ut

Data Management System

TPC-HLineitem

Page 59: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

WWHow! Layer

colu

mn

layo

ut

Data Management Systems

TPC-HLineitem

DMS-X DMS-Y

Page 60: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

WWHow!WWHowto proceed?

Page 61: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

WWHow! Language

Physical Storage Interface

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System

Page 62: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

WWHow! Language

Physical Storage Interface

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System

Page 63: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

API (Logical View)WWHow! Language

Physical Storage Interface

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System

Page 64: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

API (Logical View)WWHow! Language

Physical Storage Interface

WWHow! Language

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System

Page 65: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

API (Logical View)WWHow! Language

Physical Storage Interface

WWHow! Language

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System

WWHow! LanguageInterpreter

Page 66: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

API (Logical View)WWHow! Language

Physical Storage Interface

WWHow! Language

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System

WWHow! LanguageInterpreter

WWHow! ControllerStorage themes

Page 67: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

API (Logical View)WWHow! Language

Physical Storage Interface

WWHow! Language

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System

WWHow! LanguageInterpreter

WWHow! ControllerStorage themes

Page 68: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

API (Logical View)WWHow! Language

Physical Storage Interface

WWHow! Language

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System

WWHow! LanguageInterpreter

WWHow! ControllerStorage themes

WWHow! OptimizerWhat

Where How

Data Storage Optimizer

Page 69: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

API (Logical View)WWHow! Language

Physical Storage Interface

WWHow! Language

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System

WWHow! LanguageInterpreter

WWHow! ControllerStorage themes

WWHow! OptimizerWhat

Where How

Data Storage Optimizer

Page 70: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

API (Logical View)WWHow! Language

Physical Storage Interface

WWHow! Language

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System

WWHow! LanguageInterpreter

WWHow! ControllerStorage themes

WWHow! OptimizerWhat

Where How

Data Storage Optimizer

Data Access Optimizer

Page 71: Freeing Data Storage from Cages - da.qcri.orgda.qcri.org/jquiane/jorgequianeFiles/presentations/talkCIDR13.pdf · storage is either heavily constrained by static decisions (e.g. fixed

API (Logical View)WWHow! Language

Physical Storage Interface

WWHow! Language

WWHow!Logi

cal

Dat

a Vi

ewPh

ysic

al

Dat

a Vi

ewW

WH

ow! L

ayer

Data Management

System

WWHow! LanguageInterpreter

WWHow! ControllerStorage themes

WWHow! OptimizerWhat

Where How

Data Storage Optimizer

Data Access Optimizer