12
Web Data Indexing in the Cloud: Efficiency and Cost Reductions Jesús Camacho-Rodríguez Dario Colazzo Ioana Manolescu OAK team, Inria Saclay and LRI, Université Paris-Sud [email protected] ABSTRACT An increasing part of the world’s data is either shared through the Web or directly produced through and for Web plat- forms, in particular using structured formats like XML or JSON. Cloud platforms are interesting candidates to handle large data repositories, due to their elastic scaling proper- ties. Popular commercial clouds provide a variety of sub- systems and primitives for storing data in specific formats (files, key-value pairs etc.) as well as dedicated sub-systems for running and coordinating execution within the cloud. We propose an architecture for warehousing large-scale Web data, in particular XML, in a commercial cloud plat- form, specifically, Amazon Web Services. Since cloud users support monetary costs directly connected to their consump- tion of cloud resources, we focus on indexing content in the cloud. We study the applicability of several indexing strate- gies, and show that they lead not only to reducing query evaluation time, but also, importantly, to reducing the mon- etary costs associated with the exploitation of the cloud- based warehouse. Our architecture can be easily adapted to similar cloud-based complex data warehousing settings, car- rying over the benefits of access path selection in the cloud. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Concurrency, Distributed databases, Query processing General Terms Design, Performance, Economics, Experimentation Keywords Cloud Computing, Web Data Management, Query Process- ing, Monetary Cost 1. INTRODUCTION An increasing part of the world’s interesting data is either shared through the Web, or directly produced through and for Web platforms, using formats like XML or more recently JSON. Data-rich Web sites such as product catalogs, social Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. EDBT/ICDT ’13, March 18–22 2013, Genoa, Italy. Copyright 2013 ACM 978-1-4503-1597-5/13/03 ...$15.00. media sites, RSS and tweets, blogs or online publications ex- emplify this trend. By today, many organizations recognize the value of the trove of Web data and the need for scalable platforms to manage it. Simultaneously, the recent development of cloud comput- ing environments has strongly impacted research and devel- opment in distributed software platforms. From a business perspective, cloud-based platforms release the application owner from the burden of administering the hardware, by providing resilience to failures as well as elastic scaling up and down of resources according to the demand. From a (data management) research perspective, the cloud provides a distributed, shared-nothing infrastructure for data storage and processing. Over the last few years, big IT companies such as Amazon, Google or Microsoft have started providing an increasing number of cloud services built on top of their infrastructure. Using these commercial cloud platforms, organizations and individuals can take advantage of a deployed infrastructure and build their applications on top of it. An important fea- ture of such platforms is their elasticity, i.e., the ability to allocate more (or less) computing power, storage, or other services, as the application demands grow or shrink. Cloud services are rented out based on specific service-level agree- ments (SLAs) characterizing their performance, reliability etc. Although the services offered by public clouds vary, they all provide some form of scalable, durable, highly-available store for files, or (equivalently) binary large objects. Cloud platforms also provide virtual machines (typically called in- stances) which are started and shut down as needed, and on which one can actually deploy code to be run. This gives a basic roadmap for warehousing large volumes of Web data in the cloud in a Software-as-a-Service (SaaS) mode: to store the data, load it in the cloud-based file store; to process a query, deploy some instances, have them read data from the file store, compute query results and return them. Clearly, the performance (response time) incurred by this processing is of importance; however, so are the monetary costs associated to this scenario, that is, the costs to load, store, and process the data for query answering. The costs billed by the cloud provider, in turn, are related to the total effort (or total work), in other terms, the total consumption of all the cloud resources, entailed by storage and query processing. In particular, when the warehouse is large, if query evaluation involves all (or a large share of) the data, this leads to high costs for: (i) reading the data from the file store and (ii) process the query on the data. 41

Web data indexing in the cloud: efficiency and cost …openproceedings.org/2013/conf/edbt/Camacho-RodriguezCM13.pdfWeb Data Indexing in the Cloud: Efficiency and Cost Reductions Jesús

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Web data indexing in the cloud: efficiency and cost …openproceedings.org/2013/conf/edbt/Camacho-RodriguezCM13.pdfWeb Data Indexing in the Cloud: Efficiency and Cost Reductions Jesús

Web Data Indexing in the Cloud:Efficiency and Cost Reductions

Jesús Camacho-Rodríguez Dario Colazzo Ioana ManolescuOAK team, Inria Saclay and LRI, Université Paris-Sud

[email protected]

ABSTRACTAn increasing part of the world’s data is either shared throughthe Web or directly produced through and for Web plat-forms, in particular using structured formats like XML orJSON. Cloud platforms are interesting candidates to handlelarge data repositories, due to their elastic scaling proper-ties. Popular commercial clouds provide a variety of sub-systems and primitives for storing data in specific formats(files, key-value pairs etc.) as well as dedicated sub-systemsfor running and coordinating execution within the cloud.

We propose an architecture for warehousing large-scaleWeb data, in particular XML, in a commercial cloud plat-form, specifically, Amazon Web Services. Since cloud userssupport monetary costs directly connected to their consump-tion of cloud resources, we focus on indexing content in thecloud. We study the applicability of several indexing strate-gies, and show that they lead not only to reducing queryevaluation time, but also, importantly, to reducing the mon-etary costs associated with the exploitation of the cloud-based warehouse. Our architecture can be easily adapted tosimilar cloud-based complex data warehousing settings, car-rying over the benefits of access path selection in the cloud.

Categories and Subject DescriptorsH.2.4 [Database Management]: Systems—Concurrency,Distributed databases, Query processing

General TermsDesign, Performance, Economics, Experimentation

KeywordsCloud Computing, Web Data Management, Query Process-ing, Monetary Cost

1. INTRODUCTIONAn increasing part of the world’s interesting data is either

shared through the Web, or directly produced through andfor Web platforms, using formats like XML or more recentlyJSON. Data-rich Web sites such as product catalogs, social

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.EDBT/ICDT ’13, March 18–22 2013, Genoa, Italy.Copyright 2013 ACM 978-1-4503-1597-5/13/03 ...$15.00.

media sites, RSS and tweets, blogs or online publications ex-emplify this trend. By today, many organizations recognizethe value of the trove of Web data and the need for scalableplatforms to manage it.

Simultaneously, the recent development of cloud comput-ing environments has strongly impacted research and devel-opment in distributed software platforms. From a businessperspective, cloud-based platforms release the applicationowner from the burden of administering the hardware, byproviding resilience to failures as well as elastic scaling upand down of resources according to the demand. From a(data management) research perspective, the cloud providesa distributed, shared-nothing infrastructure for data storageand processing.

Over the last few years, big IT companies such as Amazon,Google or Microsoft have started providing an increasingnumber of cloud services built on top of their infrastructure.Using these commercial cloud platforms, organizations andindividuals can take advantage of a deployed infrastructureand build their applications on top of it. An important fea-ture of such platforms is their elasticity, i.e., the ability toallocate more (or less) computing power, storage, or otherservices, as the application demands grow or shrink. Cloudservices are rented out based on specific service-level agree-ments (SLAs) characterizing their performance, reliabilityetc.

Although the services offered by public clouds vary, theyall provide some form of scalable, durable, highly-availablestore for files, or (equivalently) binary large objects. Cloudplatforms also provide virtual machines (typically called in-stances) which are started and shut down as needed, and onwhich one can actually deploy code to be run. This gives abasic roadmap for warehousing large volumes of Web data inthe cloud in a Software-as-a-Service (SaaS) mode: to storethe data, load it in the cloud-based file store; to process aquery, deploy some instances, have them read data from thefile store, compute query results and return them.

Clearly, the performance (response time) incurred by thisprocessing is of importance; however, so are the monetarycosts associated to this scenario, that is, the costs to load,store, and process the data for query answering. The costsbilled by the cloud provider, in turn, are related to the totaleffort (or total work), in other terms, the total consumptionof all the cloud resources, entailed by storage and queryprocessing. In particular, when the warehouse is large, ifquery evaluation involves all (or a large share of) the data,this leads to high costs for: (i) reading the data from thefile store and (ii) process the query on the data.

41

Page 2: Web data indexing in the cloud: efficiency and cost …openproceedings.org/2013/conf/edbt/Camacho-RodriguezCM13.pdfWeb Data Indexing in the Cloud: Efficiency and Cost Reductions Jesús

In this work, we investigate the usage of content indexing,as a tool to both improve query performance, and reducethe total costs of exploiting a warehouse of Web data withinthe cloud. We focus on tree-structured data, and in par-ticular XML, due to the large adoption of this and othertree-shaped formats such as JSON, and we considered theparticular example of the Amazon Web Services (AWS, inshort) platform, among the most widely adopted, and alsotarget of previous research works [6, 23]. Since there is astrong similarity among commercial cloud platforms, our re-sults could easily carry on to another platform (we brieflydiscuss this in Section 3). The contribution of this work are:

• A generic architecture for large-scale warehousing ofcomplex semistructured data (in particular, tree-shapeddata) in the cloud. Our focus is on building and ex-ploiting an index to simultaneously speed up process-ing and reduce cloud resource consumption (and thus,warehouse operating costs);

• A concrete implemented platform following this archi-tecture, demonstrating its practical interest and val-idating the claimed benefits of our indexes throughexperiments on a 40 GB dataset. We show that in-dexing can reduce processing time by up to two ordersof magnitude and costs by one order of magnitude;moreover, index creation costs amortize very quicklyas more queries are run.

Our work is among few to focus on cloud-based index-ing for complex data; a preliminary version appeared in [8],and an integrated system based on the work described herewas demonstrated in [4]. Among other previous works, ourwork can be seen as a continuation of [6, 19], which alsoexploited commercial clouds for fine-granularity data man-agement, but (i) for relational data, (ii) with a stronger fo-cus on transactions, and (ii) prior to the very efficient key-value stores we used to build indexes reducing both costsand query evaluation times.

The remainder of this paper is organized as follows. Sec-tion 2 discusses related works. Section 3 describes our archi-tecture. Section 4 presents our query language, while Sec-tion 5 focuses on cloud-based indexing strategies. Section 6details our system implementation on top of AWS, while Sec-tion 7 introduces its associated monetary cost model. Sec-tion 8 provides our experimental evaluation results. We thenconclude and outline future directions.

2. RELATED WORKTo the best of our knowledge, distributed XML indexing

and query processing directly based on commercial cloud ser-vices had not been attempted elsewhere, with the exceptionof our already mentioned preliminary work [8]. That workpresented index-based querying strategies together with afirst implementation on top of AWS, and focused on tech-niques to overcome Amazon SimpleDB1 limitations for man-aging indexes. This paper features several important nov-elties. First, it presents an implementation which relies onAmazon’s new key-value store, DynamoDB, in order to en-sure better performance in managing indexes, and, quite im-portantly, more predictable monetary cost estimation. Weprovide a performance comparison between both services in

1http://aws.amazon.com/simpledb/

Section 8. Second, it presents a proper monetary cost model,which still remains valid in the contexts of several alterna-tive commercial cloud services. Third, it introduces multipleoptimizations in the indexing strategies. Finally, it reportsabout extensive experiments on a large dataset, with a par-ticular focus on performances in terms of monetary cost.

Alternative approaches which may permit to attain thesame global goal of managing XML in the cloud comprisescommercial database products, such as Oracle Database,IBM DB2 and Microsoft SQL Server. These products haveincluded XML storage and query processing capabilities intheir relational databases over the last ten years, and thenhave ported their servers to cloud-based architectures [5].Differently from our framework, such systems have manyfunctionalities beyond those for XML stores, and requirenon-negligible efforts for their administration, since they arecharacterized by complex architectures.

Recently, a Hadoop-based architecture for processing mul-tiple twig-patterns on a very large XML document has beenproposed [10]. This system adopts static document parti-tioning; the input document is statically partitioned intoseveral blocs before query processing, and some path infor-mation is added to blocs to avoid loss of structural infor-mation. Also, the system is able to deal with a fragment ofXPath 1.0, and assumes the query workload to be known inadvance, so that a particular query-index can be built andinitially sent to mappers to start XML matching. Differ-ently, in our system we do not adopt document partition-ing, the query workload is dynamic (indexes only dependson data) and, importantly, we propose a model for monetarycost estimation.

Another related approach is to aim at leveraging large-scale distributed infrastructures (e.g., clouds) by intra-queryparallelism, as in [11], enabling parallelism in the processingof each query, by exploiting multiple machines. Differently,in our work, we consider the evaluation of one query as anatomic (inseparable) unit of processing, and focus on thehorizontal scaling of the overall indexing and query process-ing pipeline distributed over the cloud.

The greater problem of distributed XML data manage-ment has been previously addressed from many other angles.For instance, issues related to XML data management inclusters were largely studied within the Xyleme project [1].Further, XML data management in P2P systems has beena topic of extensive research from the early 2000’s [18].

Finally, some recent works related to cloud services haveput special emphasis on the economic side of the cloud. Forinstance, the cost of multiple architectures for transactionprocessing on top of different commercial cloud providers isstudied in [19]. In this case, the authors focus on read andupdate workloads rather than XML processing, as in ourstudy. On the other hand, some works have proposed mod-els for determining the optimal price [14] or have studiedcost amortization [15] for data-based cloud services. How-ever, in these works, the monetary costs are studied fromthe perspective of the cloud providers, rather than from theuser perspective, as in our work.

3. ARCHITECTUREWe now describe our proposed architecture for Web data

warehousing. To build a scalable, inexpensive data store,we store documents as files within Amazon’s Simple Stor-age Service (S3, in short). To host the index, we have dif-

42

Page 3: Web data indexing in the cloud: efficiency and cost …openproceedings.org/2013/conf/edbt/Camacho-RodriguezCM13.pdfWeb Data Indexing in the Cloud: Efficiency and Cost Reductions Jesús

Figure 1: Architecture overview.

ferent requirements: fine-grained access, and fast look-up.Thus, we rely on Amazon’s DynamoDB efficient key-valuestore for storing and exploiting the index. Within AWS, in-stances can be deployed through the Elastic Compute Cloudservice (EC2, in short). We deploy EC2 instances to (i) ex-tract from the loaded data index entries, and send them toDynamo DB’s index and (ii) evaluate queries on those sub-sets of the database resulting from the query-driven indexlookups. Finally, we rely on Amazon Simple Queue Ser-vice (SQS, in short) asynchronous queues to provide reliablecommunication between the different modules of the appli-cation. Figure 1 depicts this architecture.

User interaction with the system involves the followingsteps. When a document arrives (1), the front end of ourapplication stores it in the file storage service (2). Then,the front end module creates a message containing the ref-erence to the document and inserts it into the loader requestqueue (3). Such messages are retrieved by any of the virtualmachines running our indexing module (4). When a messageis retrieved, the application loads the document referencedby the message from the file store (5) and creates the indexdata that is finally inserted into the index store (6).

When a query arrives (7), a message containing the queryis created and inserted into the query request queue (8).Such messages are retrieved by any of the virtual instancesrunning our query processor module (9). The index datais then extracted from the index store (10). Index storageservices provide an API with rather simple functionalitiestypical of key-value stores, such as get and put requests.Thus, any other processing steps needed on the data re-trieved from the index are performed by a standard XMLquerying engine (11), providing value- and structural joins,selections, projections etc.

After the document references have been extracted fromthe index, the local query evaluator receives this informa-tion (12) and the XML documents cited are retrieved fromthe file store (13). Our framework includes “standard” XMLquery evaluation, i.e. the capability of evaluating a given

Provider Filestore

Key-valuestore

Computing Queues

AmazonWebServices

AmazonSimpleStorageService

AmazonDynamoDB,AmazonSimpleDB

AmazonElasticCloudCompute

AmazonSimpleQueueService

GoogleCloud

GoogleCloudStorage

Google HighReplicationDatastore

GoogleComputeEngine

GoogleTaskQueues

WindowsAzure

WindowsAzureBLOBStorage

WindowsAzureTables

WindowsAzureVirtualMachines

WindowsAzureQueues

Table 1: Component services from major commer-cial cloud platforms.

q1painting

nameval painter

nameval

q2painting

descriptioncont year=1854

q3painting

namecontains(Lion) painter

name

lastval

q4painting

nameval painter

name

last=Manet

year1854<val≤1865

q5

museum

nameval painting

@id

painting

@id painter

name

last=Delacroix

Figure 2: Sample queries.

query q over a given document d. This is done by meansof a single-site XML processor, which one can choose freely.Thus, the virtual instance runs the query processor over thedocuments and extracts the results for the query.

Finally, we write the results obtained in the file store (14)and we create a message with the reference to those resultsand insert it into the query response queue (15). Whenthe message is read by the front end (16), the results areretrieved from the file store (17) and returned (18).

Scalability, parallelism and fault-tolerance. The ar-chitecture described above exploits the elastic scaling of thecloud, for instance increasing and decreasing the number ofvirtual machines running each module. The synchronizationthrough the message queues among modules provides inter-machine parallelism, whereas intra-machine parallelism issupported by multi-threading our code. We have also takenadvantage of the primitives provided by message queues, inorder to make our code resilient to a possible virtual in-stance crash: if an instance fails to renew its lease on themessage which had caused a task to start, the message be-comes available again and another virtual instance will takeover the job.

Applicability to other cloud platforms. While we haveinstantiated the above architecture based on AWS, it can beeasily ported to other well-known commercial clouds, sincetheir services ranges are quite similar. Table 1 shows the ser-vices available in the Google and Microsoft cloud platforms,corresponding to the ones we use from AWS.

43

Page 4: Web data indexing in the cloud: efficiency and cost …openproceedings.org/2013/conf/edbt/Camacho-RodriguezCM13.pdfWeb Data Indexing in the Cloud: Efficiency and Cost Reductions Jesús

4. QUERY LANGUAGEWe consider queries are formulated in an expressive frag-

ment of XQuery, amounting to value joins over tree patterns.For illustration, Figure 2 depicts some queries in a graphicalnotation which is easier to read. The translation to XQuerysyntax is pretty straightforward and we omit it; examples,as well as the formal pattern semantics, can be found in [21].

In a tree pattern, each node is labeled either with an XMLelement or attribute name. By convention, attribute namesare prefixed with the @ sign. Parent-child relationships arerepresented by single lines while ancestor-descendant rela-tionships are encoded by double lines.

Each node corresponding to an XML element may be an-notated with zero or more among the labels val and cont,denoting, respectively, whether the string value of the ele-ment (obtained by concatenating all its text descendants)or the full XML subtree rooted at this node is needed. Wesupport these different levels of information for the followingreasons. The value of a node is used in the XQuery spec-ification to determine whether two nodes are equal (e.g.,whether the name of a painter is the same as the name ofa book writer). The content is the natural granularity ofXML query results (full XML subtrees are returned by theevaluation of an XPath query). A tree pattern node corre-sponding to an XML attribute may be annotated with thelabel val, denoting that the string value of the attribute isreturned. Further, any node may also be annotated withone among the following predicates on its value:

• An equality predicate of the form = c, where c is someconstant, imposing that its value is equal to c.

• A containment predicate of the form contains(c), im-posing that its value contains the word c.

• A range predicate of the form [a ≤ val ≤ b], where aand b are constants such that a < b, imposing that itsvalue is inside that range.

Finally, a dashed line connecting two nodes joins two treepatterns, on the condition that the value of the respectivenodes is the same on both sides.

For instance, in Figure 2, q1 returns the pair (paintingname, painter name) for each painting, q2 returns the de-scriptions of paintings from 1854, while q3 returns the lastname of painters having authored a painting whose nameincludes the word Lion. Query q4 returns the name of thepainting(s) by Manet created between 1854 and 1865, andfinally query q5 returns the name of the museum(s) exposingpaintings by Delacroix.

5. INDEXING STRATEGIESMany indexing schemes have been devised in the litera-

ture, e.g., [13, 17, 22]. In this Section, we explain how weadapted a set of relatively simple XML indexing strategies,previously used in other distributed environments [2, 12] intoour setting, where the index is built within a heterogeneouskey-value store.

Notations. Before describing the different indexing strate-gies, we introduce some useful notations.

We denote by URI (d) the Uniform Resource Identifier(or URI, in short) of an XML document d. For a givennode n ∈ d, we denote by inPath(n) the label path go-ing from the root of d to the node n, and by id(n) the

“delacroix.xml”

painting

@id

‘1854-1’

name

‘The Lion Hunt’

painter

name

first

‘Eugene’

last

‘Delacroix’

“manet.xml”

painting

@id

‘1863-1’

name

‘Olympia’

painter

name

first

‘Edouard’

last

‘Manet’

Figure 3: Sample XML documents.

node identifier (or ID, in short). We rely on simple (pre,post, depth) identifiers used, e.g., in [3] and many follow-up works. Such IDs allow identifying if node n1 is an an-cestor of node n2 by testing whether n1.pre<n2.pre andn1.post<n2.post . If this is the case, moreover, n1 is theparent of n2 iff n1.depth+1=n2.depth.

For a given node n ∈ d, the function key(n) computes astring key based on which n’s information is indexed. Let e,a and w be three constant string tokens, and ‖ denote stringconcatenation. We define key(n) as:

key(n) =

e‖n.label if n is an XML elementa‖n.name

if n is an XML attributea‖n.name n.valw‖n.val if n is a word

Observe that we extract two keys from an attribute node,one to reflect the attribute name and another which reflectsalso its value; these help speed up specific kinds of queries,as we will see. To simplify, we omit the ‖ when this doesnot lead to confusion.

Indexing strategies. Given a document d, an indexingstrategy I is a function that returns a set of tuples of theform (k, (a, v+)+)+. In other words, I(d) represents the setof keys k that must be added to the index store to reflectthe new document d, as well as the attributes to be storedon the respective key. Each attribute contains a name a anda set of values v.

Table 2 depicts the proposed indexing strategies, whichare explained in detail in the following.

5.1 Strategy LU (Label-URI)Index. For each node n ∈ d, strategy LU associates tothe key key(n) the pair (URI (d), ε) where ε denotes the nullstring. For example, applied to the documents depicted inFigure 3, LU produces among others these tuples:

key attribute name attribute values

ename“delacroix.xml” ε

“manet.xml” ε

aid“delacroix.xml” ε

“manet.xml” εaid 1863-1 “manet.xml” εwOlympia “manet.xml” ε

Look-up. Given a query q expressed in the language de-scribed in Section 4, the look-up task consists of finding, byexploiting the index, and as precisely as possible, those partsof the document warehouse that may lead to query answers.

Index look-up based on the LU strategy is quite simple:all node names, attribute and element string values are ex-tracted from the query and the respective look-ups are per-formed. Then URI sets thus obtained are intersected. The

44

Page 5: Web data indexing in the cloud: efficiency and cost …openproceedings.org/2013/conf/edbt/Camacho-RodriguezCM13.pdfWeb Data Indexing in the Cloud: Efficiency and Cost Reductions Jesús

Name Indexing function

LU ILU (d) = {(key(n), (URI (d), ε)) | n ∈ d}LUP ILUP (d) = {(key(n), (URI (d), {inPath1(n), inPath2(n), . . . , inPathy(n)})) | n ∈ d}LUI ILUI (d) = {(key(n), (URI (d), id1(n)‖id2(n)‖ . . . ‖idz(n))) | n ∈ d}2LUPI I2LUPI (d) = {[(key(n), (URI (d), {inPath1(n), inPath2(n), . . . , inPathy(n)})),

(key(n), (URI (d), id1(n)‖id2(n)‖ . . . ‖idz(n)))] |n ∈ d}

Table 2: Indexing strategies.

query is evaluated on those documents whose URIs are partof the intersection.

5.2 Strategy LUP (Label-URI-Path)Index. For each node n ∈ d, LUP associates to key(n)the attribute (URI (d), {inPath1(n), . . . , inPathy(n)}). Onthe “delacroix.xml” and “manet.xml” documents shown inFigure 3, extracted tuples include:

key attribute name attribute values

ename“delacroix.xml”

/epainting/ename,/epainting/epainter/ename

“manet.xml”/epainting/ename,

/epainting/epainter/ename

aid“delacroix.xml” /epainting/aid

“manet.xml” /epainting/aidaid 1863-1 “manet.xml” /epainting/aid 1863-1wOlympia “manet.xml” /epainting/ename/wOlympia

Look-up. The look-up strategy consists of finding, for eachroot-to-leaf path appearing in the query q, all documentshaving a data path that matches the query path. Here,a root-to-leaf query path is obtained simply by traversingthe query tree and recording node keys and edge types.For instance, for the query q1 in Figure 2, the paths are://epainting/ename and //epainting//epainter/ename. To findthe URIs of all documents matching a given query path

(/|//)a1(/|//)a2 . . . (/|//)aj

we look up in the LUP index all paths associated to key(aj),and then filter them to only those matching the path.

5.3 Strategy LUI (Label-URI-ID)Index. The idea of this strategy is to concatenate thestructural identifiers of a given node in a document, alreadysorted by their pre component, and store them into a singleattribute value. We propose this implementation becausestructural XML joins which are used to identify the relevantdocuments need sorted inputs: thus, by keeping the identi-fiers ordered, we reduce the use of expensive sort operatorsafter the look-up.

To each key key(n), strategy LUI associates the pair

(URI (d), id1(n)‖id2(n)‖ . . . ‖idz(n))

such that id1(n)<id2(n)<. . .<idz(n). For instance, fromthe documents “delacroix.xml” and “manet.xml”, some ex-tracted tuples are:

key attribute name attribute values

ename“delacroix.xml” (3, 3, 2)(6, 8, 3)

“manet.xml” (3, 3, 2)(6, 8, 3)

aid“delacroix.xml” (2, 1, 2)

“manet.xml” (2, 1, 2)aid 1863-1 “manet.xml” (2, 1, 2)wOlympia “manet.xml” (4, 2, 3)

Look-up. Index look-up based on LUI starts by searchingthe index for all the query labels. For instance, for the queryq2 in Figure 2, the look-ups will be epainting, edescription,eyear and w1854. Then, the input to the Holistic TwigJoin [7] must just be sorted by URI (recall that the structuralidentifiers for any given document are already sorted).

5.4 Strategy 2LUPI (Label-URI-Path, Label-URI-ID)

Index. This strategy materializes two previously intro-duced indexes: LUP and LUI. Sample index tuples resultingfrom the documents in Figure 3 are shown in Figure 4.

Look-up. 2LUPI exploits, first, LUP to obtain the setof documents containing matches for the query paths, andsecond, LUI to retrieve the IDs of the relevant nodes.

For instance, given the query q2 in Figure 2, we extractthe URIs of the documents matching //epainting//edescription

and //epainting/eyear/w1854. The URI sets are intersected,and we obtain a relation which we denote R1(URI ). This isreminiscent of the LUP look-up. A second look-up identi-fies the structural identifiers of the XML nodes whose labelsappear in the query, together with the URIs on their doc-uments. This reminds us of the LUI look-up. We denotethese relations by Ra1

2 , Ra22 , . . . R

aj

2 , assuming the querynode labels and values are a1, a2, . . ., aj . Then:

• We compute Sai2 = Rai

2 �<URIR1(URI ) for each ai,1 ≤ i ≤ j. In other words, we use R1(URI ) to reducethe R2 relations.

• We evaluate a holistic twig join over Sa12 , Sa2

2 , . . ., Saj

2

to obtain URIs of the documents with query matches.The tuples for each input of the holistic join are ob-tained by sorting the attributes by their name and thenbreaking down their values into individual IDs.

Figure 5 outlines this process; a1, a2, . . . , aj are the labelsextracted from the query, while qp1, qp2, . . . , qpm are theroot-to-leaf paths extracted from the query.

It follows from the above explanation that 2LUPI returnsthe same URIs as LUI. The reduction phase serves for pre-filtering, to improve performance.

5.5 Range and value-joined queriesThis type of queries which are supported by our language

need special evaluation strategies.

Queries with range predicates. Range look-ups in key-value stores usually imply a full scan, which is very expen-sive. Thus, we adopt a two-step strategy. First, we performthe index look-up without taking into account the rangepredicate, in order to restrict the set of documents to bequeried. Second, we evaluate the complete query over thesedocuments, as usual.

Queries with value joins. Since one tree pattern onlymatches one XML document, a query consisting of several

45

Page 6: Web data indexing in the cloud: efficiency and cost …openproceedings.org/2013/conf/edbt/Camacho-RodriguezCM13.pdfWeb Data Indexing in the Cloud: Efficiency and Cost Reductions Jesús

1st key 1st attribute name 1st attribute values

ename“delacroix.xml”

/epainting/ename,/epainting/epainter/ename

“manet.xml”/epainting/ename,

/epainting/epainter/ename

aid“delacroix.xml” /epainting/aid

“manet.xml” /epainting/aidaid 1863-1 “manet.xml” /epainting/aid 1863-1wOlympia “manet.xml” /epainting/ename/wOlympia

2nd key 2nd attribute name 2nd attribute values

ename“delacroix.xml” (3, 3, 2)(6, 8, 3)

“manet.xml” (3, 3, 2)(6, 8, 3)

aid“delacroix.xml” (2, 1, 2)

“manet.xml” (2, 1, 2)aid 1863-1 “manet.xml” (2, 1, 2)wOlympia “manet.xml” (4, 2, 3)

Figure 4: Sample tuples extracted by the 2LUPI strategy from the documents in Figure 3.

πURI

lookup(1st key=qp1)πURI

lookup(1st key=qp2)

. . .πURI

lookup(1st key=qpm)

lookup(2nd key=a1)

lookup(2nd key=a2)

. . .

lookup(2nd key=aj)

>�

>�

>�

>�

Holistic Twig Join

Figure 5: Outline of look-up using 2LUPI strategy.

database table1 1..n

item1 1..n

attribute1 1..n

name1

1

1

1..n valuekey11..2

Figure 6: Structure of a DynamoDB database.

tree patterns connected by a value join needs to be answeredby combining tree pattern query results from different doc-uments. Indeed, this is our evaluation strategy for suchqueries and any given indexing strategy I: evaluate first eachtree pattern individually, exploiting the index; then, applythe value joins on the tree pattern results thus obtained.

6. CONCRETE DEPLOYMENTAs outlined before, our architecture can be deployed on

top of the main existing commercial cloud platforms (seeTable 1). The concrete implementation we have used forour tests relies on Amazon Web Services. In this Section,we describe the AWS components employed in the imple-mentation, and discuss their role in the whole architecture.

Amazon Simple Storage Service, or S3 in short, is a filestorage service for raw data. S3 stores the data in bucketsidentified by their name. Each bucket consists of a set ofobjects, each having an associated unique name (within thebucket), metadata (both system-defined and user-defined),an access control policy for AWS users and a version ID. Weopted for storing the whole data set in one bucket because(i) in our setting we did not use e.g. different access controlpolicies for different users, and (ii) Amazon states that thenumber of buckets used for a given dataset does not affectS3 storage and retrieval performance.

Amazon DynamoDB is a NoSQL database service forstoring and querying a collection of tables. Each table isa collection of items whose size can be at most 64KB. Inturn, each item contains one or more attributes; an attributehas a name, and one or several values. Each table must con-tain a primary key, which can be either (i) a single-attributehash key with a value of at most 2KB or (ii) a compositehash-range key that combines a 2KB hash key and a 1KBrange key. Figure 6 shows the structure of a DynamoDBdatabase. Different items within a table may have differentsets of attributes.

A table can be queried through a get(T,k) operation, re-trieving all items in the table T having a hash key value k.If the table uses a composite hash-range key, we may useit in a call get(T,k,c) which retrieves all items in table Twith hash key k and range key satisfying the condition c.A batchGet variant permits to execute 100 get operationsthrough a single API request. To create an item, we usethe put(T,(a,v+)+) operation, which inserts the attribute(s)(a,v+)+ into a newly created item in table T ; in this case,(a,v+)+ must include the attribute(s) defining the primarykey in T . If an item already exists with the same primarykey, the new item completely replaces the existing one. AbatchPut variant inserts 25 items at a time.

Multiple tables cannot be queried by a single query. Thecombination of query results on different tables has to bedone in the application layer.

In our system, Dynamo DB is used for storing and query-ing indexes. Previously presented indexing strategies aremapped to DynamoDB as follows. For every strategy but2LUPI the index is stored in a single table, while for 2LUPItwo different tables (one for each sub-index) are used. Forany strategy, an entry is mapped into one or more items.Each item has a composite primary key, formed by a hashkey and a range key. The first one corresponds to the key ofthe index entry, while the second one is a UUID [20] globalunique identifier that can be created without a central au-thority, generated at indexing time. Attribute names andvalues of the entry are respectively stored into item attributenames and values.

Using UUIDs as range keys ensures that we can insertitems in the index concurrently, from multiple virtual ma-chines, as items with the same hash key always contain dif-ferent range keys and thus cannot be overwritten. Also,using UUID instead of mapping each attribute name to arange key allows the system to reduce the number of itemsin the store for an index entry, and thus to improve perfor-mances at query time (we can recover all items for an indexentry by means of a simple get operation).

Amazon Elastic Compute Cloud, or EC2 in short, pro-vides resizable computing capacity in the cloud. Using EC2,one can launch as many virtual computer instances as de-sired with a variety of operating systems and execute appli-cations on them.

AWS provides different types of instances, with differenthardware specifications and prices, among which the userscan choose: standard instances are well suited for most ap-plications, high-memory instances for high throughput ap-plications, high-CPU instances for compute-intensive appli-cations. We have experimented with two types of standardinstances, and we show in Section 8 the performance andcost differences between them.

Amazon Simple Queue Service, or SQS in short, pro-

46

Page 7: Web data indexing in the cloud: efficiency and cost …openproceedings.org/2013/conf/edbt/Camacho-RodriguezCM13.pdfWeb Data Indexing in the Cloud: Efficiency and Cost Reductions Jesús

vides reliable and scalable queues that enable asynchronousmessage-based communication between distributed compo-nents of an application over AWS. We rely on SQS heavilyfor circulating computing tasks between the various mod-ules of our architecture, that is: from the front end to thevirtual instances running our indexing module for loadingand indexing the data; from the front end again to the in-stances running the query processor for answering a query,then from the query processor back to the front end to indi-cate that the query has been answered and thus the resultscan be fetched.

7. APPLICATION COSTSA relevant question in a cloud-hosted application context

is, how much will this cost? Commercial cloud providerscharge specific costs for the usage of their services. This sec-tion presents our model for estimating the monetary costsof uploading, indexing, hosting and querying Web data in acommercial cloud using our algorithms previously described.Section 7.1 presents the metrics used for calculating thesecosts, while Section 7.2 introduces the components of thecloud provider’s pricing model relevant to our application.Finally, Section 7.3 proposes the cost model for index build-ing, index storage and query answering.

7.1 Data set metricsThe following metrics capture the impact of a given dataset,indexing strategy and query on the charged costs.

Data-dependent metrics. Given a set of documents D,|D| and s(D) indicate the number of documents in D, andthe total size (in GB) of all the documents in D, respectively.

Data- and index-determined metrics. For a given setof documents D and indexing strategy I, let |op(D, I)| bethe number of put requests needed to store the index for Dbased on strategy I.

Let tidx (D, I) be the time needed to build and store theindex for D based on strategy I. The meaningful way tomeasure this in a cloud is: the time elapsed between

• the moment when the first message (document load-ing request) entailed by loading D is retrieved from ourapplication’s queue, until

• the moment when the last such message was deletedfrom the queue.

To compute the index size, we use:

• sr(D, I) is the raw size (in GB) of the data extractedfrom D according to I.

• Cloud key-value stores (in our case, DynamoDB) cre-ate their own auxiliary data structures, on top of theuser data they store. We denote by ovh(D, I) the size(in GB) of the storage overhead incurred for the indexdata extracted from D according to I.

Thus, the size of the data stored in an indexing service is:

s(D, I) = sr(D, I) + ovh(D, I)

Data-, index- and query-determined metrics. First,let |r(q)| be the size (in GB) of the results for query q.

When using the indexing strategy I, for a query q anddocuments set D, let |op(q,D, I)| be the number of get op-erations needed to look up documents that may contain an-swers to q, in an index for D built based on the strategy I.Similarly, let Dq

I be the subset of D resulting from look-upon the index built based on strategy I for query q.

Let pt(q,D) be the time needed by the query processorto answer a query q over a dataset D without using anyindex, and ptq(q,D, I,Dq

I ) be the time to process q on anindex built according to the strategy I (thus, on the reduceddocument set Dq

I ), respectively. We measure it as the timeelapsed from the moment the message with the query wasretrieved from the queue service to the moment the messagewas deleted from it.

7.2 Cloud services costsWe list here the costs spelled out in the the cloud service

provider’s pricing policy, which impact the costs charged bythe provider for our Web data management application.

File storage costs. We consider the following three com-ponents for calculating costs associated to a file store.

• ST $m,GB is the cost charged for storing and keeping 1

GB of data in a file store for one month.

• STput$ is the price per document storage operationrequest.

• STget$ is the price per document retrieval operationrequest.

Indexing costs. We consider the following components forcalculating the index store associated costs.

• IDX $m,GB is the cost charged for storing and keeping

1 GB of data in the index store for one month.

• IDXput$ is the cost of a put API request that insertsa row into the index store.

• IDXget$ is the cost of a get API request that retrievesa row from the index store.

Virtual instance costs. VM $h is the price charged for

running for one hour a virtual machine. This price dependson the kind of virtual machine.

Queue service costs. QS$ is the price charged for eachrequest to the queue service API, e.g., send message, receivemessage, delete message, renew lease etc.

Data transfer. The commercial cloud providers consideredin this work do not charge anything for data transferred to orwithin their cloud infrastructure. However, data transferredout of the cloud incurs a cost: egress$

GB is the price chargedfor transferring 1 GB.

Concrete AWS costs vary depending on the geographicregion where AWS hosts the application. Our experimentstook place in the Asia Pacific (Singapore) AWS facility, andthe respective prices as of September-October 2012 are col-lected in Table 3. Note that virtual machine (instance) costs

are provided for two kinds of instances, “large” (VM $h,l) and

“extra-large” (VM $h,xl).

47

Page 8: Web data indexing in the cloud: efficiency and cost …openproceedings.org/2013/conf/edbt/Camacho-RodriguezCM13.pdfWeb Data Indexing in the Cloud: Efficiency and Cost Reductions Jesús

ST$m,GB = $0.125 IDXst$m,GB = $1.14

STput$ = $0.000011 IDXput$ = $0.00000032

STget$ = $0.0000011 IDXget$ = $0.000000032

VM $h,l = $0.34 QS$ = $0.000001

VM $h,xl = $0.68 egress$GB = $0.19

Table 3: AWS Singapore costs as of October 2012.

7.3 Web data management costsWe now show how to compute, based on the data-, index-

and query-driven metrics (Section 7.1), together with thecloud service costs (Section 7.2), the costs incurred by ourWeb data storage architecture in a commercial cloud.

Indexing and storing the data. Given a set of documentsD, we calculate the cost of uploading it a file store as follows:

ud$(D) = STput$ × |D|+ QS$ × |D|

Thus, the cost of building the index for D by means of theindexing strategy I is:

ci$(D, I) = ud$(D) + IDXput$ × |op(D, I)|+ STget$ × |D|

+ VM $h × tidx (D, I) + QS$ × 2× |D|

Note that we need two queue service requests for eachdocument: the first obtains the URI of the document thatneeds to be processed, while the second deletes the messagefrom the queue when the document has been indexed.

The cost for storing D in the file store and the index struc-ture created for D according to I in the index store for onemonth is calculated as:

st$m(D, I) = ST$m,GB × s(D) + IDX $

m,GB × s(D, I)

Querying. First, we estimate the cost incurred by thefront-end for sending query q and retrieving its results as:

rq$(q) = STget$ + egress$GB × |r(q)|+ QS$ × 3

Three queue service requests are issued: the first one sendsthe query, the second one retrieves the reference to the queryresults, and the third one deletes the message retrieved bythe second request.

The cost for answering a query q without using any indexis calculated as follows:

cq$(q,D) = rq$(q) + STget$ × |D|+ STput$

+ VM $h × pt(q,D) + QS$ × 3

Note that, again, three queue service requests are issued:the first one retrieves the message containing the query, thesecond one sends the message with the reference to the re-sults for the query, while the third removes the message withthe query from the corresponding queue. The same holds forthe formula below which calculates the cost of evaluating aquery q over D indexed according to I:

cq$(q,D, I,DqI ) = rq$(q) + IDXget$ × |op(q,D, I)|

+ STget$ × |DqI |+ STput$

+ VM $h × ptq(q,D, I,D

qI ) + QS$ × 3

This finalizes our monetary cost model for Web data storesin commercial clouds, according to the architecture we de-scribed in Section 3. Some parameters of the cost model are

Indexingstrategy

Average ex-traction time(hh:mm)

Average up-loading time(hh:mm)

Total time(hh:mm)

LU 0:24 1:33 2:11LUP 0:32 3:47 4:25LUI 0:41 2:31 3:222LUPI 1:13 6:30 7:46

Table 4: Indexing times using 8 large (L) instances.

well-known (those determined by the input data size and theprovider’s cost policy), while others are query- and strategy-dependent (e.g., how may documents match a query etc.) InSection 8 we measure actual charged costs, where the query-and strategy-dependent parameters are instantiated to con-crete operations. These measures allow to highlight the costsavings brought by our indexing.

8. EXPERIMENTAL RESULTSThis section describes the experimental environment and

results obtained. Section 8.1 describes the experimentalsetup, Section 8.2 reports our performance results, and fi-nally, Section 8.3 presents our cost study.

8.1 Experimental setupOur experiments ran on AWS servers from the Asia Pa-

cific region in September-October 2012. We used the (cen-tralized) Java-based XML query processor developed withinour ViP2P project [16], implementing an extension of thealgorithm of [9] to our larger subset of XQuery. On thisdialect, our experiments have shown that ViP2P’s perfor-mance is close to (or better than) Saxon-B v9.12.

We use two types of EC2 instances for running the index-ing module and query processor:

• Large (l), with 7.5 GB of RAM memory and 2 virtualcores with 2 EC2 Compute Units each.

• Extra large (xl), with 15 GB of RAM memory and4 virtual cores with 2 EC2 Compute Units each.

An EC2 Compute Unit is equivalent to the CPU capacityof a 1.0-1.2 GHz 2007 Xeon processor.

To test index selectivity, we needed an XML corpus withsome heterogeneity. We generated XMark [24] documents(20000 documents in all, adding up to 40 GB), using thesplit option provided by the data generator3. We modi-fied a fraction of the documents to alter their path structure(while preserving their labels), and modified another frac-tion to make them “more” heterogeneous than the originaldocuments, by rendering more elements optional children oftheir parents, whereas they were compulsory in XMark.

8.2 Performance studyXML indexing. To test the performance of index creation,the documents were initially stored in S3, from which theywere gathered in batches by multiple l instances runningthe indexing module. We batched the documents in orderto minimize the number of calls needed to load the indexinto DynamoDB. Moreover, we used l instances because inour configuration, DynamoDB was the bottleneck while in-dexing. Thus, using more powerful xl instances could nothave increased the throughput.

2http://saxon.sourceforge.net/3http://www.xml-benchmark.org/faq.txt

48

Page 9: Web data indexing in the cloud: efficiency and cost …openproceedings.org/2013/conf/edbt/Camacho-RodriguezCM13.pdfWeb Data Indexing in the Cloud: Efficiency and Cost Reductions Jesús

0!5000!10000!15000!20000!25000!30000!

0! 10! 20! 30! 40!

Inde

xing

tim

e (s

)!

Documents size (GB)!

LU! LUP! LUI! 2LUPI!

Figure 7: Indexing in 8 large (L) EC2 instances.

0!15!30!45!60!75!90!105!120!135!

0!20!40!60!80!

100!120!

LU! LUP! LUI! 2LUPI!

Cos

t ($/

mon

th)!

Size

(GB

)!

DynamoDB overhead data!Index content!XML data size!

0!

15!

30!

45!

0!

10!

20!

30!

40!

LU! LUP! LUI! 2LUPI!

Cos

t ($/

mon

th)!

Size

(GB

)!

Figure 8: Index size and storage costs per monthwith full-text indexing (top) and without (bottom).

Table 4 shows the time spent extracting index entries on 8l EC2 instances and uploading the index to DynamoDB us-ing each proposed strategy. We show the average time spentby each EC2 machine to extract the entries, the average timespent by DynamoDB to load the index data, and the totalobserved time (elapsed between the beginning and the endof the overall indexing process). As expected, the more andthe larger the entries a strategy produces, the longer index-ing takes. Next, Figure 7 shows that indexing time scaleswell, linearly in the size of the data for each strategy.

A different perspective on the indexing is given in Figure 8which shows the size of the index entries, compared with theoriginal XML size. In addition to the full-text indexes size,the figure includes the size for each strategy if keywords arenot stored. As expected, the index for the latter strategiesis quite smaller than the full-text variant. LUP and 2LUPIare the larger indexes, and in particular, if we index thekeywords, the index is quite larger than the data. The LUIindex is smaller than the LUP one, because IDs are morecompact than paths; moreover, we exploit the fact that Dy-namoDB allows storing arbitrary binary objects, to storecompressed (encoded) sets of IDs in a single DynamoDBvalue. Finally, the DynamoDB space overhead (Section 7)is noticeable, especially if keywords are not indexed, but inboth variants grows slower than the index size.

XML query processing. We now study the query process-ing performance, using 10 queries from the XMark bench-mark; they can be found in [25]. The queries have an averageof ten nodes each; the last three queries feature value joins.

Table 5 shows, for each query and indexing strategy, thenumber of documents retrieved by index look-up, the num-ber of documents which actually contain query results, andthe result size for each query. (These are obviously inde-pendent of the platform, and we provide them only to help

Query# Doc. IDs from index # Docs. Results

LU LUP LUI 2LUPI w. results size (KB)

q1 3 2 1 1 1 0.04q2 523 349 349 349 349 94000.00q3 144 66 33 33 33 52400.00q4 1089 1089 775 775 775 519.20q5 1115 740 370 370 370 7500.00q6 285 283 283 283 283 278.20q7 285 283 142 142 142 96.20q8 1400 1025 882 882 507 13800.00q9 1115 740 740 740 740 338800.00q10 1400 1025 512 512 116 9.10

Table 5: Query processing details (20000 docu-ments).

interpret our next results.) Table 5 shows that LUI and2LUPI are exact for queries q1-q7, which are tree patternqueries (the look-up in the index returns no false positive inthese cases). The imprecision of LU and LUP varies acrossthe queries, but it may reach 200% (twice as many falsepositives, as there are documents with results), even for treepattern queries like q5. For the last three queries, featuringvalue joins, even LUI and LUPI may bring false positives.For these queries, Table 5 sums the numbers of documentIDs retrieved for each tree pattern.

The response times (perceived by the user) for each query,using each indexing strategy, and also without any index, isshown in Figure 9a; note the logarithmic y axis. We haveevaluated the workload using l and then, separately, xl EC2instances. We see that all indexes considerably speed upeach query, by one or two orders of magnitude in most cases.Figure 9a also demonstrates that our strategies are able totake advantage of more powerful EC2 instances, that is, forevery query, the xl running times are shorter than the timesusing an l instance. The strategy with the shortest evalu-ation time is LUP, which strikes a good balance betweenprecision and efficiency; most of the time, LU is next, fol-lowed by LUI and 2LUPI (recall again that the y axis islog-scale). The difference between the slowest and fasteststrategy is a factor of 4 at most whereas the difference be-tween the fastest index and no index is of 20 at least.

The charts in Figures 9b and 9c provide more insight.They split query processing time into: the time to consultthe index (DynamoDB get), the time to run the physicalplan identifying the relevant document URIs out of the dataretrieved from the index, and the time to fetch the doc-uments from S3 into EC2 and evaluate the queries there.Importantly, since we make use of the multi-core capabilitiesof EC2 virtual machines, the times individually reported inFigures 9b and 9c were in fact measured in parallel. In otherwords, the overall response time observed and reported inFigure 9a is systematically less than the sum of the detailedtimes reported in Figures 9b and 9c.

We see that low-granularity indexing strategies (LU andLUP) have systematically shorter index look-up and indexpost-processing times than the fine-granularity ones (LUIand 2LUPI). The times to transfer the relevant documents toEC2 and evaluate queries there, is proportional to the num-ber of documents retrieved from the index look-ups (thesenumbers are provided in Table 5). For a given query, thedocument transfer + query evaluation time differs betweenthe strategies by the factor of up to 3, corresponding to thenumber of documents retrieved.

Impact of parallelism. Figure 10 shows how the query

Page 10: Web data indexing in the cloud: efficiency and cost …openproceedings.org/2013/conf/edbt/Camacho-RodriguezCM13.pdfWeb Data Indexing in the Cloud: Efficiency and Cost Reductions Jesús

0.1!1!

10!100!

1000!10000!

L! XL! L! XL! L! XL! L! XL! L! XL! L! XL! L! XL! L! XL! L! XL! L! XL!q1! q2! q3! q4! q5! q6! q7! q8! q9! q10!

Res

pons

e Ti

me

(s)! No Index! LU! LUP! LUI! 2LUPI!

(a) Response time using large (l) and extra large (xl) EC2 instances on the 40 GB database

0.1!1!

10!100!

1000!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

q1! q2! q3! q4! q5! q6! q7! q8! q9! q10!

Tim

e (s

)!

Lookup - DynamoDB Get! Lookup - Plan execution! S3 documents transfer and results extraction!

(b) Detail large (l) EC2 instance on the 40 GB database

0.1!1!

10!100!

1000!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

LU!

LUP!

LUI!

2LU

PI!

q1! q2! q3! q4! q5! q6! q7! q8! q9! q10!

Tim

e (s

)!

Lookup - DynamoDB Get! Lookup - Plan execution! S3 documents transfer and results extraction!

(c) Detail extra large (xl) EC2 instance on the 40 GB database

Figure 9: Response time (top) and details (middle and bottom) for each query and indexing strategy.

0!6000!

12000!18000!24000!30000!36000!

LU! LUP! LUI! 2LUPI! LU! LUP! LUI! 2LUPI!L! XL!

Res

pons

e Ti

me

(s)! 1 instance!

8 instances!

Figure 10: Impact of using multiple EC2 instances.

response time varies when running multiple EC2 query pro-cessing instances. To this purpose, we sent to the front-endall our workload queries, successively, 16 times: q1, q2, . . . , q10,q1, q2, . . . etc. We report the running times on a single EC2instance (no parallelism) versus running times on eight EC2instances in parallel. We can see that more instances signifi-cantly reduce the running time, more so for l instances thanfor xl ones: this is because many strong instances sendingindexing requests in parallel come close to saturating Dy-namoDB’s capacity of absorbing them.

8.3 Amazon charged costsWe now study the costs charged by AWS for indexing the

data, and for answering queries. We also consider the amor-tization of the index, i.e., when query cost savings broughtby the index balance the cost of the index itself.

Indexing cost. Table 6 shows the monetary costs for in-dexing data according to each strategy. These costs are bro-ken down across the specific AWS services. The most costlyindex to build is 2LUPI, while the cheapest is LU. The com-

Indexing DynamoDB EC2 S3 + SQS Totalstrategy

LU $21.15 $5.47

$0.02

$26.64LUP $44.78 $11.95 $56.75LUI $33.47 $8.95 $42.44

2LUPI $78.25 $21.17 $99.44

Table 6: Indexing costs for 40 GB using L instances.

bined price for S3 and SQS is constant across strategies, andis negligible compared to EC2 costs. In turn, the EC2 costis dominated by the DynamoDB cost in all strategies.

Finally, Figure 8 shows the storage cost per month of eachindex, which is proportional to its size in DynamoDB.

Query processing cost. Figure 11 shows the cost of an-swering each query when using no index, and when using thedifferent indexing strategies. Note that using indexes, thecost is practically independent of the machine type. Thisis because (i) the hourly cost for an xl machine is doublethan that of a l machine; but at the same time (ii) thefour cores of an xl machine allow processing queries simul-taneously on twice as many documents as the two cores ofl machines, so that the cost differences pretty much can-cel each other (whereas the time differences are noticeable).Figure 11 also shows that indexing significantly reduces mon-etary costs compared to the case where no index is used; thesavings vary between 92% and 97%.

To better understand the monetary costs shown in Fig-ure 11, we provide the details of evaluating the query work-load on an xl instance in Figure 12, again decomposed

50

Page 11: Web data indexing in the cloud: efficiency and cost …openproceedings.org/2013/conf/edbt/Camacho-RodriguezCM13.pdfWeb Data Indexing in the Cloud: Efficiency and Cost Reductions Jesús

0!0.2!0.4!0.6!0.8!

1!

L! XL! L! XL! L! XL! L! XL! L! XL! L! XL! L! XL! L! XL! L! XL! L! XL!q1! q2! q3! q4! q5! q6! q7! q8! q9! q10!

Cos

t ($)!

No Index! LU! LUP! LUI! 2LUPI!

Figure 11: Query processing costs decomposition.

$0.22012!

$6.38106!

$0.00006!$0.09194!

No Index!

DynamoDB!S3!EC2!SQS!AWSDown!

$0.00031!$0.00822!

$0.15518!

$0.00006!

$0.09194!

LU!

DynamoDB!S3!EC2!SQS!AWSDown!

$0.00103! $0.00628!

$0.13291!

$0.00006!

$0.09194!

LUP!

DynamoDB!S3!EC2!SQS!AWSDown!

$0.12960!

$0.00377!

$0.27844!

$0.00006!

$0.09194!

LUI!

DynamoDB!S3!EC2!SQS!AWSDown!

$0.14003!

$0.00377!

$0.28215!

$0.00006!

$0.09194!

2LUPI!

DynamoDB!S3!EC2!SQS!AWSDown!

Figure 12: Workload evaluation cost details on anextra large (XL) instance.

-100!-75!-50!-25!0!25!50!75!100!125!

0! 2! 4! 6! 8! 10! 12! 14! 16! 18! 20!

#run

s x

bene

fit(I,

W) -

bu

ildin

gCos

t(I)!

# runs!

LU! LUP! LUI! 2LUPI!

Figure 13: Index cost amortization for a single extralarge (XL) EC2 instance.

across the services we use, to which we add AWSDown, theprice charged by Amazon for transferring query results outof AWS. AWSDown cost is the same for all strategies, sincethe same results are obtained. S3 cost is proportional to theselectivity of the index strategy (recall Table 5). DynamoDBcosts reflect the amount of data extracted for each strategyfrom the index, and finally, EC2 cost is proportional to thetime necessary to answer the workload using each strategy,which means that the shorter it takes to answer a query, thelower it will be its cost. For every strategy, the cost of usingEC2 clearly dominates, which is expected and desired, sincethis is the time actually spent processing the query.

Amortization of the index costs. We now study howindexing pays off when evaluating queries. For an indexingstrategy I and workload W , we term benefit of I for W thedifference between the monetary cost to answer W using noindex, and the cost to answer W based on the index builtaccording to I. At each run of W , we “save” this bene-fit, whereas we had to pay a certain cost to build I. (Theindex costs and benefits also depend on the data set, and

Indexing Indexing speed Indexing coststrategy (ms/MB of XML data) ($/MB of XML data)

[8] This work [8] This work

LU 7491 196 0.019 0.00067LUP 8335 398 0.057 0.00142LUI 12447 302 0.021 0.00106

2LUPI 11265 699 0.070 0.00249

Monthly storage cost ($/GB of XML data)Index, [8] Index, this work Data, [8] and this work

0.275 1.14 0.125

Table 7: Indexing comparison.

Indexing Query speed Query costsstrategy (ms/MB of XML data) ($/MB of XML data)

[8] This work [8] This work

LU 141 21 4.7× 10−5 0.6× 10−5

LUP 121 18 4.2× 10−5 0.6× 10−5

LUI 186 37 5.6× 10−5 1.3× 10−5

2LUPI 164 37 5.4× 10−5 1.3× 10−5

Table 8: Query processing comparison.

increase with its size.) Figure 13 shows when the cumulatedbenefit (over several runs of the workload on a l instance)overweighs the index building cost (similarly, on a single linstance, recall Table 6). Figure 13 shows that any of ourstrategies allows recovering the index creation costs quitefast, i.e. just running the workload 4 times for LU, 8 timesfor LUP and LUI, and 16 times for 2LUPI respectively. (Thecost is recovered when the curves cross the Y = 0 axis.)

8.4 Comparison with previous worksThe closest related works are [6, 19] and our previous

work [8], which build database services on top of a com-mercial cloud, and in particular AWS.

The focus in [6] was on implementing transactions in thecloud with various consistency models; they present exper-iments on the TPC-W relational benchmark, using 10.000products (thus, a database of 315 MB of data, about 125times smaller than ours). The setting thus is quite differ-ent, but a rough comparison can still be done.

At that time, Amazon did not provide a key-value store,therefore the authors built B+ trees indexes within S3. Eachtransaction in [6] retrieves one customer record, searches forsix products, and orders three of them. These are all selec-tive (point) queries (and updates), thus, they only comparewith q1 among our queries, the only one which can be assim-ilated to a point query (very selective path matched in fewdocuments, recall Table 5). For a transaction, they reportrunning times between 2.8 and 11.3 seconds, while individ-ual queries/updates are said to last less than a second. Ourq1, running in 0.5 seconds (using one instance) on our 40 GBdatabase of data, is thus quite competitive. Moreover, [6] re-ports transaction costs between $1.5×10−4 and $2.9×10−3,very close to our $1.2× 10−4 cost of q1 using LUP.

51

Page 12: Web data indexing in the cloud: efficiency and cost …openproceedings.org/2013/conf/edbt/Camacho-RodriguezCM13.pdfWeb Data Indexing in the Cloud: Efficiency and Cost Reductions Jesús

Next, we compare with our previous work [8], evaluatedon only 1 GB of XML data. From a performance perspec-tive, the main difference is that [8] stores the index withinAWS’ previous key-value store, namely SimpleDB. Table 7compares this work with [8] from the perspective of index-ing, and Table 8 from that of querying; for a fair comparison,we report measures per MB (or GB) of data.

The tables show that the present work speeds up index-ing by one to two orders of magnitude, all the while index-ing costs are reduced by two to three orders of magnitude;querying is faster (and query costs lower) by a factor of five(roughly) wrt [8]. The reason is that DynamoDB allows stor-ing arbitrary binary objects as values, a feature we exploitedin order to efficiently encode our index data. Moreover, Dy-namoDB has a shorter response time and can handle moreconcurrent requests than SimpleDB.

The authors of [6] subsequently also ported their indexto SimpleDB [19]. Still on TPC-W transactions on a 10.000items database, processing with a SimpleDB index was mod-erately faster (by a factor of less than 2) than by using theirprevious S3-based one. As we have seen above, DynamoDBsignificantly outperforms SimpleDB when indexing.

8.5 Experiments conclusionOur experiments demonstrate the feasibility and interest

of our architecture based on a commercial cloud, using adistributed file system to store XML data, a key-value storeto store the index, and the cloud’s processing engines to in-dex and process queries. All our indexing strategies havebeen shown to reduce query response time and monetarycost, by 2 orders of magnitude in our experiments; more-over, our architecture is capable of scaling up as more in-stances are added. The monetary costs of query processingare shown to be quite competitive, compared with previoussimilar works [6, 8], and we have shown that the overhead ofbuilding and maintaining the index is modest, and quicklyoffset by the cost savings due to the ability to narrow thequery to only a subset of the documents. In our tests, theLUP indexing strategy allowed for the most efficient queryprocessing, at the expense of an index size somehow largerthan the data. Further compression of the paths in the LUPindex could probably make it even more competitive.

In our experiments, query execution based on the LUand LUP strategies is always faster than using the LUI and2LUPI strategies. We believe that cases for which LUI and2LUPI strategies behave better are those in which querytree patterns are multi-branched, highly selective and evalu-ated over a document set where most of the documents onlymatch linear paths of the query. Such cases can be stati-cally detected by using data summaries and some statisticalinformation. We postpone this study to future work.

9. CONCLUSIONWe have presented an architecture for building scalable

XML warehouses by means of commercial cloud resources,which can exploit parallelism to speed up index buildingand query processing. We have investigated and comparedthrough experiments several indexing strategies and shownthat they achieve query processing speed-up and monetarycosts reductions of several orders of magnitude within AWS.

Our future works include the development of a platformand index advisor tool, which based on the expected datasetand workload, estimates an application’s performance andcost and picks the best indexing strategy to use.

10. REFERENCES[1] S. Abiteboul, S. Cluet, G. Ferran, and M.-C. Rousset. The

Xyleme project. Computer Networks, 39(3), 2002.[2] S. Abiteboul, I. Manolescu, N. Polyzotis, N. Preda, and

C. Sun. XML processing in DHT networks. In ICDE, 2008.[3] S. Al-Khalifa, H. V. Jagadish, J. M. Patel, Y. Wu,

N. Koudas, and D. Srivastava. Structural Joins: APrimitive for Efficient XML Query Pattern Matching. InICDE, 2002.

[4] A. Aranda-Andujar, F. Bugiotti, J. Camacho-Rodrıguez,D. Colazzo, F. Goasdoue, Z. Kaoudi, and I. Manolescu.AMADA: Web Data Repositories in the Amazon Cloud(demo). In CIKM, 2012.

[5] P. A. Bernstein, I. Cseri, N. Dani, N. Ellis, A. Kalhan,G. Kakivaya, D. B. Lomet, R. Manne, L. Novik, andT. Talius. Adapting Microsoft SQL server for cloudcomputing. In ICDE, 2011.

[6] M. Brantner, D. Florescu, D. A. Graf, D. Kossmann, andT. Kraska. Building a database on S3. In SIGMOD, 2008.

[7] N. Bruno, N. Koudas, and D. Srivastava. Holistic twigjoins: optimal XML pattern matching. In SIGMOD, 2002.

[8] J. Camacho-Rodrıguez, D. Colazzo, and I. Manolescu.Building Large XML Stores in the Amazon Cloud. In DMCWorkshop (collocated with ICDE), 2012.

[9] Y. Chen, S. B. Davidson, and Y. Zheng. An EfficientXPath Query Processor for XML Streams. In ICDE, 2006.

[10] H. Choi, K.-H. Lee, S.-H. Kim, Y.-J. Lee, and B. Moon.HadoopXML: A Suite for Parallel Processing of MassiveXML Data with Multiple Twig Pattern Queries (demo). InCIKM, 2012.

[11] L. Fegaras, C. Li, U. Gupta, and J. Philip. XML QueryOptimization in Map-Reduce. In WebDB, 2011.

[12] L. Galanis, Y. Wang, S. Jeffery, and D. DeWitt. Locatingdata sources in large distributed systems. In VLDB, 2003.

[13] R. Goldman and J. Widom. Dataguides: Enabling queryformulation and optimization in semistructured databases.In VLDB, 1997.

[14] V. Kantere, D. Dash, G. Francois, S. Kyriakopoulou, andA. Ailamaki. Optimal service pricing for a cloud cache.IEEE Trans. Knowl. Data Eng., 23(9), 2011.

[15] V. Kantere, D. Dash, G. Gratsias, and A. Ailamaki.Predicting cost amortization for query services. InSIGMOD, 2011.

[16] K. Karanasos, A. Katsifodimos, I. Manolescu, andS. Zoupanos. The ViP2P Platform: XML Views in P2P. InICWE, 2012.

[17] R. Kaushik, P. Bohannon, J. F. Naughton, and H. F.Korth. Covering indexes for branching path queries. InSIGMOD, 2002.

[18] G. Koloniari and E. Pitoura. Peer-to-peer management ofXML data: issues and research challenges. SIGMODRecord, 34(2), 2005.

[19] D. Kossmann, T. Kraska, and S. Loesing. An evaluation ofalternative architectures for transaction processing in thecloud. In SIGMOD, 2010.

[20] P. Leach, M. Mealling, and R. Salz. A Universally UniqueIDentifier (UUID) URN Namespace. IETF RFC 4122, July2005.

[21] I. Manolescu, K. Karanasos, V. Vassalos, and S. Zoupanos.Efficient XQuery rewriting using multiple views. In ICDE,2011.

[22] T. Milo and D. Suciu. Index structures for pathexpressions. In ICDT, 1999.

[23] J. Schad, J. Dittrich, and J.-A. Quiane-Ruiz. RuntimeMeasurements in the Cloud: Observing, Analyzing, andReducing Variance. PVLDB, 3(1), 2010.

[24] A. Schmidt, F. Waas, M. L. Kersten, M. J. Carey,I. Manolescu, and R. Busse. XMark: A benchmark forXML data management. In VLDB, 2002.

[25] Technical report (extended version of this work), 2013.http://www.lri.fr/~camacho/_media/xml-aws/tech.pdf.

52