8
Efficient Execution of Conjunctive Complex Queries on Big Multimedia Databases Karina Fasolin * , Renato Fileto * , Marcelo Kruger , Daniel S. Kaster , onica R. P. Ferreira § , Robson L. F. Cordeiro § , Agma J. M. Traina § , Caetano Traina Jr. § * PPGCC/INE - CTC, Federal University of Santa Catarina, Florian´ opolis, SC, Brazil – Email: {kfasolin, r.fileto}@ufsc.br CCTMar, Itaja´ ı Valley University, Florian´ opolis, SC, Brazil – Email: marcelo [email protected] Computer Science Department, State University of Londrina, PR, Brazil – Email: [email protected] § Computer Science Dep., Univ. of S˜ ao Paulo at S˜ ao Carlos, Brazil – Email: {monika, robson, caetano, agma}@icmc.usp.br Abstract—This paper proposes an approach to efficiently execute conjunctive queries on big complex data together with their related conventional data. The basic idea is to horizontally fragment the database according to criteria frequently used in query predicates. The collection of fragments is indexed to efficiently find the fragment(s) whose contents satisfy some query predicate(s). The contents of each fragment are then indexed as well, to support efficient filtering of the fragment data according to other query predicate(s) conjunctively connected to the former. This strategy has been applied to a collection of more than 106 million images together with their related conventional data. Experimental results show considerable performance gain of the proposed approach for queries with conventional and similarity- based predicates, compared to the use of a unique metric index for the entire database contents. I. I NTRODUCTION Large amounts of complex data [1] (e.g., image, sound, video) have been collected and organized in databases every day. These complex data can come in various formats and cannot be ordered by their content in the same way as it is done for conventional data (those represented as strings or numbers). This scenario, combined with the huge size and considerable growth of some complex data collections, pose new challenges for information retrieval (IR) [2]–[4]. Complex data can be collected or enriched with conven- tional data (e.g., metadata such as tags, titles, upload time) that help to describe their contents. Huge collections of complex data with related conventional data have been gathered by a myriad of information systems (e.g., shared photo databases on the Web, medical databases). Such databases can be queried via their conventional data or/and by using some similarity measure of their complex data contents [5]–[7]. Nevertheless, neither of these strategies alone is enough to produce high- quality results in many situations. Thus, several techniques have been proposed to combine these strategies [7]–[12]. Complex queries [13], [14] logically combine traditional predicates (e.g., equality, inequality) on conventional data with similarity-based predicates on conventional or unconventional data. However, current IR systems are not able to efficiently execute such queries on large complex databases [4]. This paper proposes an approach to efficiently execute conjunctive complex queries on huge collections of complex data and related conventional data. The central idea is to horizontally fragment the database according to criteria fre- quently used in query predicates (e.g., the tag values). The collection of database fragments is indexed to support efficient identification of the fragment(s) containing data that satisfy particular query predicates (e.g., tag =“dog). Then, the data of each fragment can be indexed according to another criteria (e.g., some similarity measure of the complex data contents) to support further efficient data filtering according to other predicate(s) of conjunctive queries. The database fragments can be built and indexed to sup- port efficient execution of queries with multiple conjunctive predicates. Multi-level indexing structures (e.g., B-trees whose entries for particular indexing values point to indexes based on other criteria) enable efficient processing of some conjunctive complex queries. The fragments and access methods are trans- parent to the user, who poses the queries to a database view that integrates the fragments. The following examples show some queries that can be executed with much better performance by using the proposed approach, based on the conjugation of access methods, instead of using a unique index for the whole database contents. A. Motivating Examples Q1: Retrieve the k images more similar to q (im1) . Assume that the image q (im1) is given as the query center and presents a human face. Using this information extracted from q (im1) itself (by using a face detection algorithm, for example), an IR system can imply that the user who posed this query is probably interested in retrieving data about people whose face is similar to the one present in q (im1) . Generic image descriptors (e.g., color, texture) perform sub-optimally in this case. For retrieving images of similar human faces, it is better to use descriptors and similarity measures designed for this specific task, on horizontal fragment(s) of the database containing images of human faces. Q2: Retrieve the k images more similar to a given im- age q (im2) , and annotated with the tag “lost dog”. The contextual information in this query is the tag “lost dog”. It can be used to find an appropriate fragment (whose tuples are tagged with “lost dog”), and process the similarity-based predicate only for the contents of this frag- ment. If there is no fragment for this tag value, other predicates 2013 IEEE International Symposium on Multimedia 978-0-7695-5140-1/13 $26.00 © 2013 IEEE DOI 10.1109/ISM.2013.112 536

[IEEE 2013 IEEE International Symposium on Multimedia (ISM) - Anaheim, CA, USA (2013.12.9-2013.12.11)] 2013 IEEE International Symposium on Multimedia - Efficient Execution of Conjunctive

  • Upload
    caetano

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2013 IEEE International Symposium on Multimedia (ISM) - Anaheim, CA, USA (2013.12.9-2013.12.11)] 2013 IEEE International Symposium on Multimedia - Efficient Execution of Conjunctive

Efficient Execution of Conjunctive Complex Querieson Big Multimedia Databases

Karina Fasolin∗, Renato Fileto∗, Marcelo Kruger†, Daniel S. Kaster‡,Monica R. P. Ferreira§, Robson L. F. Cordeiro§, Agma J. M. Traina§, Caetano Traina Jr.§

∗PPGCC/INE - CTC, Federal University of Santa Catarina, Florianopolis, SC, Brazil – Email: {kfasolin, r.fileto}@ufsc.br†CCTMar, Itajaı Valley University, Florianopolis, SC, Brazil – Email: marcelo [email protected]‡Computer Science Department, State University of Londrina, PR, Brazil – Email: [email protected]

§Computer Science Dep., Univ. of Sao Paulo at Sao Carlos, Brazil – Email: {monika, robson, caetano, agma}@icmc.usp.br

Abstract—This paper proposes an approach to efficientlyexecute conjunctive queries on big complex data together withtheir related conventional data. The basic idea is to horizontallyfragment the database according to criteria frequently usedin query predicates. The collection of fragments is indexed toefficiently find the fragment(s) whose contents satisfy some querypredicate(s). The contents of each fragment are then indexed aswell, to support efficient filtering of the fragment data accordingto other query predicate(s) conjunctively connected to the former.This strategy has been applied to a collection of more than106 million images together with their related conventional data.Experimental results show considerable performance gain of theproposed approach for queries with conventional and similarity-based predicates, compared to the use of a unique metric indexfor the entire database contents.

I. INTRODUCTION

Large amounts of complex data [1] (e.g., image, sound,video) have been collected and organized in databases everyday. These complex data can come in various formats andcannot be ordered by their content in the same way as it is donefor conventional data (those represented as strings or numbers).This scenario, combined with the huge size and considerablegrowth of some complex data collections, pose new challengesfor information retrieval (IR) [2]–[4].

Complex data can be collected or enriched with conven-tional data (e.g., metadata such as tags, titles, upload time) thathelp to describe their contents. Huge collections of complexdata with related conventional data have been gathered by amyriad of information systems (e.g., shared photo databases onthe Web, medical databases). Such databases can be queriedvia their conventional data or/and by using some similaritymeasure of their complex data contents [5]–[7]. Nevertheless,neither of these strategies alone is enough to produce high-quality results in many situations. Thus, several techniqueshave been proposed to combine these strategies [7]–[12].

Complex queries [13], [14] logically combine traditionalpredicates (e.g., equality, inequality) on conventional data withsimilarity-based predicates on conventional or unconventionaldata. However, current IR systems are not able to efficientlyexecute such queries on large complex databases [4].

This paper proposes an approach to efficiently executeconjunctive complex queries on huge collections of complexdata and related conventional data. The central idea is to

horizontally fragment the database according to criteria fre-quently used in query predicates (e.g., the tag values). Thecollection of database fragments is indexed to support efficientidentification of the fragment(s) containing data that satisfyparticular query predicates (e.g., tag = “dog”). Then, the dataof each fragment can be indexed according to another criteria(e.g., some similarity measure of the complex data contents)to support further efficient data filtering according to otherpredicate(s) of conjunctive queries.

The database fragments can be built and indexed to sup-port efficient execution of queries with multiple conjunctivepredicates. Multi-level indexing structures (e.g., B-trees whoseentries for particular indexing values point to indexes based onother criteria) enable efficient processing of some conjunctivecomplex queries. The fragments and access methods are trans-parent to the user, who poses the queries to a database view thatintegrates the fragments. The following examples show somequeries that can be executed with much better performanceby using the proposed approach, based on the conjugation ofaccess methods, instead of using a unique index for the wholedatabase contents.

A. Motivating Examples

Q1: Retrieve the k images more similar to q(im1).

Assume that the image q(im1) is given as the query centerand presents a human face. Using this information extractedfrom q(im1) itself (by using a face detection algorithm, forexample), an IR system can imply that the user who posed thisquery is probably interested in retrieving data about peoplewhose face is similar to the one present in q(im1). Genericimage descriptors (e.g., color, texture) perform sub-optimallyin this case. For retrieving images of similar human faces, itis better to use descriptors and similarity measures designedfor this specific task, on horizontal fragment(s) of the databasecontaining images of human faces.

Q2: Retrieve the k images more similar to a given im-age q(im2), and annotated with the tag “lost dog”.

The contextual information in this query is the tag“lost dog”. It can be used to find an appropriate fragment(whose tuples are tagged with “lost dog”), and process thesimilarity-based predicate only for the contents of this frag-ment. If there is no fragment for this tag value, other predicates

2013 IEEE International Symposium on Multimedia

978-0-7695-5140-1/13 $26.00 © 2013 IEEE

DOI 10.1109/ISM.2013.112

536

Page 2: [IEEE 2013 IEEE International Symposium on Multimedia (ISM) - Anaheim, CA, USA (2013.12.9-2013.12.11)] 2013 IEEE International Symposium on Multimedia - Efficient Execution of Conjunctive

may be used (e.g., whose tuples are tagged just with “dog”).In this case, precision is compromised in favor of recall.

We believe that to divide the database into horizontal frag-ments, according to predicates typically used in conjunctivecomplex queries, have the following advantages: (i) speed upquery execution; (ii) allow better scalability for processingconjunctive complex queries on very large databases; and(iii) improve the quality of results by using access methodstailored for the contents of specific database fragments andparticular categories of IR predicates. In this paper, we showhow to achieve the goals (i) and (ii), for a real world databasewith millions of images and related metadata, and conjunctivequeries composed of a predicate based on tag values andanother one based on content similarity.

This paper is organized as follows. Section II presentssome foundations and related work. Section III describes theproposed approach to efficiently execute conjunctive complexqueries in horizontal fragments of databases with complexdata and related conventional data. Section IV presents thesystem architecture and tools used to implement this approachin a prototype. Section V reports and discusses experimentalresults. Finally, Section VI finishes the paper, with conclusionsand indications for future work.

II. FOUNDATIONS AND RELATED WORK

This section presents some fundamentals used in this paper.It describes how complex data and their related conventionaldata can be stored in relational databases, how complexdata contents can be compared by using similarity metrics,and how complex queries can be expressed as compositionsof conventional and similarity-based predicates. It closes bydiscussing some related work.

A. Complex Relation

A complex relation <(S,A) has a set of complex attributesS = {S1, S2, . . . Sm} and a set of conventional attributesA = {A1, A2, . . . An}. Each complex tuple t ∈ <(S,A)relates complex and conventional attribute values. For example,a complex tuple can relate an image or a series of images withconventional data, such as associated tags, title, description,upload time, and the geographic position where the image(s)has(have) been taken.

Queries can be specified on a complex relation by usingboth conventional and similarity-based predicates. Conven-tional predicates are used to compare values of conventionalattributes with constants and/or among themselves (e.g., toretrieve the tuples with a given tag value). Complex data (e.g.,images), on the other hand, usually cannot be compared byequality (=, 6=), inequality (<, ≤, ≥, >) or even spatialcontainment predicates (e.g., INSIDE). Complex data areusually compared by similarity. The notion of similarity usedfor IR can vary with the kind of data and the application. Dif-ferent descriptors (e.g., color, texture, shape) can be extractedfrom complex data and represented as vectors of varying sizes.These descriptors or/and conventional data can be comparedby using a variety of similarity or dissimilarity measures.

B. Dissimilarity Metrics

Dissimilarity functions measure the distance between ob-jects. The greater the distance between two objects the lesssimilar they are. Given a data domain D1, a dissimilarityfunction δ : D × D → R+ is called a metric iff it satisfiesthe following properties [5], ∀x, y, z ∈ D:

1) Symmetry: δ(x, y) = δ(y, x),2) Non-negativity: δ(x, y) ≥ 0,3) Identity: δ(x, x) = 0, and4) Triangular inequality: δ(x, z) ≤ δ(x, y) + δ(y, z).

There are several distance metrics that can be used forinformation retrieval. Some of the simplest, fastest to calculate,and most used metrics to compare general complex data (e.g.,collections of images of varied themes) are those from theMinkowski family (Lp) [5]. This set of metrics can be definedas stated in Equation 1.

Lp(x, y) = p

√√√√ d∑i=1

|xi − yi|p (1)

where d is the dimension of the space and each valuep = 1, 2, . . .∞ define a metric of this family, such as L1

(Manhattan) for p = 1, L2 (Euclidean) for p = 2 and L∞(Chebychev) for p =∞.

Other examples of distance metrics useful for IR areMahalanobis [5], Camberra [15] and Kullback-Leibler [16].Different metrics perform more or less accurately for differentdatasets, features extracted from the data, and categories ofqueries [17].

A metric space [18], [19] is a pair M = (D, δ) whereD denotes the universe of valid elements, and δ is a metric.The metric space (D, δ) allows comparing tuples of complexand/or conventional objects from the collection D by using themetric δ. When the compared objects are vectors of numericvalues in a d-dimensional space, with a metric distance defined,we have a particular case of the metric space that is called avector space. Vector and metric spaces can be indexed withmultidimensional and metric data structures, respectively, tospeed up the execution of queries based on spatial and/orsimilarity-based predicates [5], [20].

C. Similarity-based Query Operators

In similarity-based IR, one provides a reference to aquery object, named as the query center, to retrieve similarobjects from the database. The most used operators to specifysimilarity-based queries are:

• Range Query (Rangeq): Given a query center sq ∈ Dand a maximum distance ξ ∈ R+, the range queryRangeq(sq, ξ) retrieves all objects s ∈ D, such thatδ(s, sq) ≤ ξ, i.e., all the objects of the database thatare within a distance of at most ξ from sq .

1For comparing data of a complex relation <(S,A) consider D =ΠX(dom(S)× dom(A)), where X ⊆ (S ∪A), and dom(S) and dom(A)are the domains of the sets of attributes S and A, respectively.

537

Page 3: [IEEE 2013 IEEE International Symposium on Multimedia (ISM) - Anaheim, CA, USA (2013.12.9-2013.12.11)] 2013 IEEE International Symposium on Multimedia - Efficient Execution of Conjunctive

• k-Nearest Neighbor Query (kNNq): Given aquery center sq ∈ D and a natural number k, thekNN query kNNq(sq, k) retrieves the k objectsfrom D that are the closest ones to sq , i.e., D′ ={si ∈ D|∀sj ∈ (D− D′), |D′| = k, δ(sq, si) ≤ δ(sq, sj)}.

D. Conjunctive Complex Queries

Given a complex relation <(S,A), as described in Sec-tion II-A, a Conjunctive Complex Query on <(S,A) is ex-pressed as a conjunction of l predicates φX1

∧φX2∧ . . .∧φXl

.Each predicate φXi : <(S,A)→ {TRUE,FALSE} receivesa tuple t ∈ <(S,A) and returns TRUE or FALSE dependingon the values t[X] of the attributes X ⊆ S ∪A in the tuple t.Such a predicate can use equalities, inequalities, spatial, andsimilarity-based query operators.

In this paper, we only consider conjunctive complex querieswith two atomic predicates: (i) an equality of conventional at-tributes with a constant (e.g., tag = “lost dog”, tag = “dog”),and (ii) a similarity-based query operator (e.g., Rangeq orkNNq). For instance, the following conjunctive complex querysearches tuples of the complex relation PhotoSharingDataassociated with a tag "dog", and whose contents of theimage attribute are at a distance of at most 5 of the givenimage img1.jpg. This query uses the extended SQL syntaxproposed by [14] and the "ScalableColor" similarity, thatis defined by the scalable color descriptor [21] and the L1

(Manhattan) metric.

SELECT R.*FROM PhotoSharingData R NAT JOIN TagWHERE Tag.value = "dog" AND

R.picture NEAR "D:/images/img1.jpg"BY ScalableColor RANGE 5;

E. Related Work

Several methods have been proposed for efficient IR byexploiting various properties of the datasets and posed queries[2]–[4], [7], [22], and investigating appropriate data descriptorsand access methods for distinct situations [6], [7], [23]–[25].The combination of content and textual similarity has alsobeen investigated to improve IR on complex data [26]–[28].In addition, some works use parallelism and techniques suchas MapReduce to speed up IR on large datasets [29]–[31].

Our proposal combines several ideas of these previousproposals to improve the performance of IR systems for bigcollections of complex data with their related conventionaldata. For the best of our knowledge, it is the first proposal totake advantage of horizontal fragments defined in conformancewith typical query predicates to speed up query execution, andenable the customization of IR techniques according to thecontents of each fragment of a possibly huge complex datacollection with heterogeneous contents.

III. PROCESSING CONJUNCTIVE COMPLEX QUERIESUSING HORIZONTAL FRAGMENTS OF A DATABASE

Our proposal to speed up the execution of conjunctive,complex queries uses horizontal fragments of complex rela-tions. These fragments are built in accordance with predicatesthat are common in queries. For example, consider a database

with photos of different sources and mixed themes (suchas cities, homes, offices, landscapes, flowers, trees, animals,people, food, etc.). These photos can be organized in groups,so that to perform a search for some specific photo, it isnot necessary to consider all the elements in the database.Instead, a more efficient approach is to identify the appropriategroup(s) to solve some query predicate(s) and process thesearch considering only the contents of that group(s).

Figure 1 illustrates the proposed approach. Suppose that adatabase is divided in four fragments, according to the subjectsHuman Faces, Cars, Dogs, and Cats. Considering theexamples presented in Section I-A, query Q1 can be solved byjust checking the contents of the fragment of Human Faces,while query Q2 can be solved in the fragment of Dogs.Conjunctive complex queries on big databases can be solvedmuch more efficiently by accessing only fragments having datathat satisfy some of their predicates, instead of searching thewhole database. Furthermore, the data in each fragment can beexamined by using a distinct access method, tailored for thecontents of the respective fragment. It can help us to obtainmore accurate query results too.

Fig. 1. Query execution strategies for queries Q1 and Q2.

The major challenges to implement the proposed strategyare: (i) partition the database in suitable horizontal fragmentsto support query executions; (ii) devise efficient ways toidentify suitable fragments to solve particular query predicates;(iii) properly index the contents of each fragment whose sizerequires efficient access methods to support the verification offurther query predicates; (iv) develop smart strategies for opti-mized query processing by identifying and searching appropri-ate horizontal database fragments. The following subsectionsdescribe each one of these sub-problems in detail.

A. Creating Horizontal Fragments

The tuples of a complex relation <(S,A) can be frag-mented for information retrieval purposes by using a wide va-riety of methods. The proposed approach allows any complexrelation fragmentation function of the form:

H : <(S,A)→ 2(2<(S,A)−∅)

538

Page 4: [IEEE 2013 IEEE International Symposium on Multimedia (ISM) - Anaheim, CA, USA (2013.12.9-2013.12.11)] 2013 IEEE International Symposium on Multimedia - Efficient Execution of Conjunctive

The fragmentation function H takes as input a complexrelation <(S,A) and outputs a set H(<(S,A)) of horizontalfragments, i.e., subsets of the tuples in <(S,A), such that:

1) |H(<(S,A))| ≥ 1, i.e., H(<(S,A)) has at least onehorizontal fragment.

2) ∀=(S,A) ∈ H(<(S,A)) : |=(S,A)| ≥ 1, i.e., eachhorizontal fragment =(S,A) has at least one tuple.

3) Each fragment =(S,A) ∈ H(<(S,A)) has the sameschema as <(S,A) and contains a subset of its tuples.

Fragmentation functions can also be related to subsets ofattributes only. For example, a complex relation fragmentationfunction HX(<(S,A)) generates subsets of <(S,A) by check-ing only the values of the projection ΠX(<(S,A)). If X ⊆ Athen we say that HX(<(S,A)) is based on the projection ofconventional attributes, and if X ⊆ S we say that it is basedon the projection of complex attributes.

Notice that we allow one tuple t ∈ <(S,A) to appearin more than one fragment =(S,A) ∈ H(<(S,A)). It isallowed because the contents of any tuple may be of interestfor different IR purposes. In other words, even when twofragments =, =′ ∈ H(<(S,A)) refer to distinct data groups,sometimes they overlap, i.e., there are some tuples t ∈ < suchthat t ∈ = and t ∈ =′, enabling its retrieval according todifferent points of view. For instance, pictures of beaches maybe relevant to different kinds of people (fishermen, surfers,travelers, geologists, oceanographers, etc.). These communitiescan have distinct interests and use different notions of simi-larity, which use different features to compare data contents,as the same tuple may be of interest to people from differentcommunities, though for distinct reasons (e.g., the fishermenmay be interested in a particular texture caused by fish closeto the water surface, the surfers may be looking for waveswith a particular shape, while some ordinary travelers mayjust wonder pristine water).

Conversely, the fragmentation process can leave sometuples t ∈ < out of any fragment = ∈ H(<), i.e., ∃t ∈ < :(∀= ∈ H(<) : t /∈ =). It may happen, for example, if t is anoutlier with respect to the criteria considered in H to fragment< and/or if t is not of interest to the IR focus of any = ∈ H.

B. Indexing the Fragments Collection

When the number of horizontal fragments |H(<(S,A))|created to support IR from a complex relation <(S,A) islarge, it may be necessary to index the collection of fragmentsH(<(S,A)) (e.g., a collection with a fragment per tag value,for a large number of tag values) to efficiently find thefragment(s) suitable to solve particular query predicates. Theindexing method for this purpose may vary with the natureof the predicates that define the fragments. For example,collections of horizontal fragments defined by tag values canbe indexed by a conventional index (e.g., a B-tree) or by aninverted file.

C. Intra-Fragment Indexing

The contents of each fragment may have to be indexedas well, to accelerate additional filtering of the fragmentdata according to other predicates. For instance, to supportefficient processing of a similarity-based query operator (e.g.,

Rangeq , kNNq) on the contents of large fragments, a MetricAccess Method (MAM) [5] can be used. A MAM indexes thefragment contents in a metric space, defined by a descriptorextracted from the contents of some attribute(s) and a metric tocompare the data descriptors by similarity. Several MAMs havebeen proposed in the literature [5], [20], and many of themare available in well-known DBMS and IR tools [32]. Theappropriate descriptor, similarity metric, and MAM to supportefficient access to the contents of a fragment depends on thenature of the data contents and on the query predicates to beprocessed in that fragment [7], [17], [25].

D. Query Execution

Algorithm 1 describes our approach to efficiently processconjunctive complex queries on a big complex database, byusing horizontal fragments of that database and multi-levelindexing. The user, who is unaware of the database fragmenta-tion and access methods, poses the query referring to the wholedatabase. This query is received in the parameter c queryon line 1. First, the IR system extracts the predicates fromthe query, by calling the function EXTRACT PREDICATES(line 2). Then, the system chooses suitable fragments to pro-cess the query, i.e., fragments whose tuples satisfy some querypredicate(s), by calling the function SELECT FRAGMENTS(line 3). An index built over the fragments collection mayspeed up the fragment selection. The next step is to filter tuplesof the chosen fragment(s), according to the query predicates,by calling the function FILTER DATA (line 6). This functionreceives in its second parameter all the query predicatesto verify the other query predicates on the fragment data.Each chosen fragment is expected to be smaller than thewhole database. If the fragment size is still large, its contentscan be indexed and/or more fragmented to allow efficientprocessing of particular query predicates. Finally, if the queryprocessing has used more than one fragment, the IR systemcombine the results obtained for each fragment, by using theAPPEND RESULTS function (line 7). The combination ofresults may use unions or intersections, depending on the waythe query is structured and the criteria used to choose thefragments to process the query.

Algorithm 1 Query execution using horizontal fragments1: function EXECUTE QUERY(c query)2: predicates = EXTRACT PREDICATES(c query)3: fragments = SELECT FRAGMENTS(predicates)4: results = ∅5: for each f in fragments do6: f results = FILTER DATA(f , predicates)7: results.APPEND RESULTS(f results)8: end for9: return results

10: end function

The proposed approach is general in terms of the number,nature and logical connections of predicates in a complexquery. Nevertheless, for simplicity and lack of space, in ourcurrent implementation and experiments we only considerqueries with a conventional predicate of the form tag = valueconjunctively connected to a similarity-based query operator(i.e., Rangeq or kNNq). We believe that it is enough to showsome potential benefits of the proposed approach.

539

Page 5: [IEEE 2013 IEEE International Symposium on Multimedia (ISM) - Anaheim, CA, USA (2013.12.9-2013.12.11)] 2013 IEEE International Symposium on Multimedia - Efficient Execution of Conjunctive

IV. IMPLEMENTATION

We have implemented a prototype to validate our approachusing FMI-SiRO (user-defined Features, Metrics and Indexesfor Similarity Retrieval) [32], a module coupled to Oracle tosolve queries having similarity-based predicates. This modulesupports the two kinds of similarity-base query operatorsmentioned in section II-C (Rangeq and kNNq), and usesMAMs to efficiently execute these predicates on large datavolumes. In our implementation, FMI-SiRO has been changedto read complex objects’ feature vectors from tables.

A. Architecture

Figure 2 illustrates the implemented architecture. Themodule Extract Predicates parses the complex conjunctivequery written in SQL in the way accepted by FMI-SiRO. Thepredicates supported by our current implementation fall in twocategories: (i) comparison of a conventional attribute with aconstant (e.g., tag = “dog”) or (ii) similarity-based predicateson complex or conventional data (Rangeq or kNNq).

Fig. 2. Prototype architecture

A B-tree index allows the Select Fragments module toefficiently find fragments whose tuples satisfy predicates of thefirst category, when the cardinality of the compared attribute ishigh, and there are many horizontal fragments for the differentattribute values. Once the suitable fragment(s) (i.e. whosetuples satisfy some predicate(s) of the first category) has beenselected, the Oracle Query Processor solves the remainingpredicates of the conjunctive query on the contents of suchfragment(s). FMI-SiRO solves the similarity-based predicateson the fragments contents, using the Arboretum2 [33] MAMlibrary to improve the performance of these operations forlarge databases. In our experiments, we have used the Slim-tree [34] as the MAM for efficient similarity-based IR from thehorizontal fragments of the database. The Slim-tree is dynamic,height-balanced and bottom-up constructed.

2http://www.gbdi.icmc.usp.br/arboretum

V. EXPERIMENTS

This section reports the experiments done to demonstratethe benefits of the proposed approach for executing complexconjunctive queries on big complex databases. The primarygoal is to show that the queries have better performance whenexecuted in the fragments instead of using the entire database.

A. Experiment Setup

Our experiments were performed on CoPhIR3 (Content-based Photo Image Retrieval) [35], a multimedia metadatacollection that serves as a basis for experiments with content-based image retrieval techniques. It contains image descrip-tors (MPEG-7 feature vectors) and textual information (tags,title, description, upload time, location) regarding 106 millionimages uploaded in FLICKR4. CoPhIR does not include theimages themselves, but just their MPEG-7 feature vectors, andURLs pointing to the original images in FLICKR and to theirthumbnails in the CoPhIR Web site. The images presented inthe following results were obtained via their FLICKR URL.

We have converted each CoPhIR XML file, containingdata describing an image, into a tuple related with someother tuples (e.g., with associated tag values). The resultingrelational database was loaded in Oracle, to allow the executionof queries with conventional and similarity-based predicates.The efficient execution of the former is supported by Oracleitself (using conventional access methods) and the latter byFMI-SiRO (using Slim-trees).

The experiments were performed in a server equipped withan Intel R©CoreTMi7 3.8Ghz processor and 8GB of memory.This machine was running Oracle Database 11g on the Debian7.0 “wheezy” operating system (Kernel 3.2.0 x86-64).

B. Fragments Creation

The horizontal fragments of the database were createdaccording to the tag values associated to the images. Thetotal number of tag instances used to annotate the CoPhIRimages is 334,254,683, employing a set of 4,666,256 distincttag values [35]. A tag value can be associated with variousimages, and an image can be annotated with several tag values.

The strategy used to generate the fragments for the exper-iment was the following. First, the data collection was filteredto eliminate the tag values used to annotate only one image,leaving 2,111,554 distinct tag values, i.e. 46.86% of the total.Then, a filter based on the WordNet was applied to keep onlythe tag values of the English language. It left 68,767 tag values,i.e. just 2.87% of the total number of tag values in CoPhIR.

Figure 3 shows the frequency distribution of the selectedtag values among the CoPhIR image descriptions. The mostfrequent tag values are “wedding” (used to annotate 1,678,711images), “party” (1,334,741 images), and “travel” (1,154,688images). On the other extreme of our selection, the tag values“algonkin”, “precognitive”, and “chamberlains” are used toannotate just 2 images each one. We divided this distributionin quartiles yielding four regions (labeled with R1, R2, R3,and R4). We used the 5 tag values on the limits of each region

3http://cophir.isti.cnr.it4http://www.flickr.com

540

Page 6: [IEEE 2013 IEEE International Symposium on Multimedia (ISM) - Anaheim, CA, USA (2013.12.9-2013.12.11)] 2013 IEEE International Symposium on Multimedia - Efficient Execution of Conjunctive

(dashed vertical lines), making 10 fragments on each regionlimits. In addition, we have randomly chosen 10 distinct tagvalues inside each region, to build further fragments for ourexperiments. It gave a total of 80 horizontal fragments of theCoPhIR database, each one for the images annotated with oneof the chosen tag values.

Fig. 3. Frequency distribution of tag values in CoPhIR image annotations

C. Contents Indexing with MAMs

The contents of the whole database (all the 106 millionimages) and of each fragment whose size is above a certainthreshold (more than 1 thousand tuples in these experiments)have been indexed with Slim-trees [34] for efficient content-based image retrieval.

Among the various MPEG-7 feature vectors available fordescribing images available in CoPhIR, we have used thescalable color [21]. This descriptor is derived from a colorhistogram, defined in HSV (Hue-Saturation-Value). The valuesextracted from the histogram are normalized and mapped toa non-linear representation with four bits. After that, a Haartransformation was applied. Several distance functions can beused to retrieve the images described by MPEG-7 featuresvectors [36]. In these experiments, we used the L1 metric(Manhattan), because it usually provides more precise resultsthan other simple metrics, such as those of the Minkowskifamily, as reported in the literature [37]. This behavior wasobserved in our preliminary experiments.

D. Queries

The next step was to pose queries with equality predicateson tag values and similarity-based predicates on the imagecontents. The tag values used in the equality predicates werethe same used to build the fragments for the experiments(Section V-B). A randomly chosen image of each fragmentserve as the query center of the similarity-based predicate.Thereafter, we compare the average time to execute queries onfragments of each chosen size with that to execute the samesimilarity queries on the entire database. Figure 4 shows anexample of complex conjunctive query that looks for imagessimilar to a given one in the fragment with descriptors andmetadata of images that are tagged with ’puppy’. It usesthe FMI-SiRO Oracle syntax [32].

SELECT frag_name INTO fragmentFROM cophir_frag_catalogWHERE tag=’puppy’;EXECUTE IMMEDIATE’SELECT * FROM ’ || fragment ||’ WHERE MANHATTAN_DIST(coeff,

(SELECT coeff FROM ’ || fragment ||’ WHERE PHOTO_ID=123456)) <= 50’;

Fig. 4. An example of complex query on Oracle with FMI-SiRO

E. Experimental Results

Figure 5 shows the sizes of fragment indexes in disk andthe time spent to create horizontal fragments of a relationwith image metadata and descriptors taken from CoPhIR.Each fragment is defined by a tag value. Fragment sizes varywith the number of tag values occurrences. The creation ofa fragment includes selecting the tuples that refer to imagesannotated with the respective tag value, and the constructionof the Slim-tree index to support efficient image retrieval bycontents similarity in that fragment.

Fig. 5. Fragment indexes sizes and time spent to create the fragments

Figure 6 presents the number of disk accesses and the num-ber of similarity calculations done to execute queries analogousto the one of Figure 4 in database fragments. The total elapsedtime encompass (i) searching a B-tree to find the fragmentcontaining tuples annotated with the tag value appearing inthe conventional predicate, and (ii) solving the similarity-basedpredicate in a Slim-tree that indexes only the contents ofthat fragment. Unfortunately, the image descriptor, similarityfunction, indexes and fragments used in these experiments didnot ensure sub-linear growing of the time spent to execute thesimilarity-based predicates for growing fragment sizes.

The query execution using the Slim-tree that indexes thecontents of a fragment is around an order of magnitude fasterthan using the Slim-tree that indexes the entire database, formost fragments. It is still more than 10 times faster to solve thequeries in the biggest fragments than in the entire database. Forexample, the execution of a query to retrieve images annotatedwith the tag value “wedding” (1,334,741 images) and within

541

Page 7: [IEEE 2013 IEEE International Symposium on Multimedia (ISM) - Anaheim, CA, USA (2013.12.9-2013.12.11)] 2013 IEEE International Symposium on Multimedia - Efficient Execution of Conjunctive

Fig. 6. Number of Disk Accesses and number of similarity calculation onquery execution on database fragments of different sizes

a distance radius equal to 50 of a given image takes around1,200 seconds using the Slim-tree for the respective fragment,and 18,577 seconds using the Slim-tree for the entire database.On the other hand the query to retrieve images annotated withthe tag value “chamberlains” (just 2 images) and within thesame distance radius of 50 from a given image takes less than1 second using the respective fragment, and 12,514 secondsusing the Slim-tree index for the entire database.

Finally, Figure 7 presents the results of a conjunctivecomplex query with the equality predicate tag = “puppy”and a kNNq predicate with the center in the image presentedin the top left corner (highlighted by the red square). These4 images are ranked in the results from left to right andfrom top to bottom. The execution of this query using thefragment referring to the tag “puppy”, which contains thedescriptions of 105,570 images, took 108 seconds using aB-tree to find the fragment and a Slim-tree to process thesimilarity-based predicate in the contents of this fragment. Asthe kNN predicate is not commutative with other predicatesfor data filtering [22] we show in Figure 8 the results of aquery by a Rangeq predicate with radius 50 and center inthe image on the top left corner. These results were producedby using a Slim-tree that indexes the entire database. Thisquery took 13,176 seconds to execute. Filtering these resultsfor the tag value “puppy” to produce the result showed inFigure 7 would require further processing, but the time to solvethe Rangeq predicate on the Slim-tree that indexes the entiredatabase contents is dominant.

VI. CONCLUSIONS AND FUTURE WORK

This paper introduces an approach for efficiently processingqueries on big complex databases, by using horizontal frag-ments of the database and multi-level indexing. This approachhas three steps: (i) find fragments with data satisfying somequery predicate(s); (ii) filter the data in the chosen fragment(s)according to other predicate(s) conjunctively connected to theformer; (iii) compose the results obtained from each fragment.

The experimental results demonstrate that this proposaldrastically improve query execution speed. They show that itis not viable to run the similarity-based predicates over the

Fig. 7. Results of a conjunctive query executed by using the fragment thatdescribes only images tagged with ”puppy”

Fig. 8. Results of a Rangeq predicate on the entire database, that tookalmost 100 times longer to produce than those in Figure 7

entire CoPhIR database (that describes around 106 millionimages), even using the Slim-tree metric index to speedupthe execution of similarity-based predicates on image contentdescriptors. In fact, even big fragments (describing more thana 100 thousand images, approximately), need to be furtherfragmented to ensure acceptable response time.

Though the case study presented in this paper only con-siders conjunctive queries with an equality predicate and asimilarity-based predicate, the proposed approach can be em-ployed for efficient execution of queries with arbitrary numbersof predicates, of various kinds, and logically connected indifferent ways. In fact, our approach opens new researchpaths towards efficient query execution on big complex data.Among the challenges involved in the full exploitation of theproposed approach, we mention the following ones for futurework: (i) develop automatic techniques to create appropriatehorizontal fragments of large databases for efficient queryexecution; (ii) index fragment collections to efficiently findfragments suitable to solve different kinds of predicates; (iii)devise and validate queries optimization techniques that exploitappropriate database fragments and access methods.

VII. ACKNOWLEDGMENTS

Thanks to CNPq, CAPES, FEESC, and FAPESP for theirfinancial support.

542

Page 8: [IEEE 2013 IEEE International Symposium on Multimedia (ISM) - Anaheim, CA, USA (2013.12.9-2013.12.11)] 2013 IEEE International Symposium on Multimedia - Efficient Execution of Conjunctive

REFERENCES

[1] J. Darmont, O. Boussaid, J.-C. Ralaivao, and K. Aouiche, “Anarchitecture framework for complex data warehouses,” CoRR, vol.abs/0707.1534, 2007.

[2] A. Goker, J. Davies, and M. Graham, Information Retrieval: Searchingin the 21st Century. John Wiley & Sons, 2007.

[3] R. A. Baeza-Yates and B. A. Ribeiro-Neto, Modern Information Re-trieval - the concepts and technology behind search, Second edition.Pearson Education Ltd., Harlow, England, 2011.

[4] R. Baeza-Yates and M. Melucci, Eds., Advanced Topics in InformationRetrieval. Springer, 2011.

[5] P. Zezula, G. Amato, V. Dohnal, and M. Batko, Similarity Search - TheMetric Space Approach. Springer, 2006, vol. 32.

[6] H. Blanken, A. de Vries, H. Blok, and L. Feng, Eds., MultimediaRetrieval, ser. Data-Centric Systems and Applications. Heidelberg:Springer Verlag, 2007, iSBN=978-3-540-72894-8.

[7] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: Ideas,influences, and trends of the new age,” ACM Comput. Surv., vol. 40,no. 2, pp. 1–60, 2008.

[8] J. Wang, J. Li, and G. Wiederhold, “Simplicity: semantics-sensitiveintegrated matching for picture libraries,” Pattern Analysis and MachineIntelligence, IEEE Trans. on, vol. 23, no. 9, pp. 947 –963, sep 2001.

[9] Y. Zhuang, Q. Li, and R. Lau, “Web-based image retrieval: a hybridapproach,” in Computer Graphics International 2001. Proceedings,2001, pp. 62 –69.

[10] J.-R. Wen, Q. Li, W.-Y. Ma, and H.-J. Zhang, “A multi-paradigmquerying approach for a generic multimedia database managementsystem,” SIGMOD Rec., vol. 32, pp. 26–34, March 2003.

[11] D. Joshi, R. Datta, Z. Zhuang, W. P. Weiss, M. Friedenberg, J. Li,and J. Z. Wang, “PARAgrab: a comprehensive architecture for webimage management and multimodal querying,” in Proceedings of the32nd International Conference on Very Large Databases, ser. VLDB.VLDB Endowment, 2006, pp. 1163–1166.

[12] N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet,R. Levy, and N. Vasconcelos, “A new approach to cross-modal multi-media retrieval,” in ACM Multimedia, A. D. Bimbo, S.-F. Chang, andA. W. M. Smeulders, Eds. ACM, 2010, pp. 251–260.

[13] C. Traina Jr., A. J. M. Traina, M. R. Vieira, A. S. Arantes, andC. Faloutsos, “Efficient processing of complex similarity queries inrdbms through query rewriting,” in ACM 15th International Conferenceon Information and Knowledge Management (CIKM 06), P. S. Yu, V. J.Tsotras, E. A. Fox, and B. Liu, Eds. Arlington - VA, USA: ACMPress, 2006, pp. 4–13.

[14] M. C. N. Barioni, H. L. Razente, A. J. M. Traina, and C. Traina Jr.,“Seamlessly integrating similarity queries in SQL,” Software: Practiceand Experience, vol. 39, no. 4, pp. 355–384, 2009.

[15] F. Long, H. Zhang, and D. Feng, “Fundamentals of content-based imageretrieval,” Multimedia Information Retrieval and Management, 2002.

[16] D. R. Wilson and T. R. Martinez, “Improved heterogeneous distancefunctions,” J. of Artificial Intelligence Research, vol. 6, pp. 1–34, 1997.

[17] P. H. Bugatti, A. J. M. Traina, and C. Traina Jr., “Assessing the bestintegration between distance-function and image-feature to answer sim-ilarity queries,” in 23rd Annual ACM Symposium on Applied Computing(SAC2008). Fortaleza, Cear - Brazil: ACM Press, 2008, pp. 1225–1230.

[18] T. Bozkaya and M. Ozsoyoglu, “Indexing large metric spaces forsimilarity search queries,” ACM Trans. Database Syst., vol. 24, pp. 361–404, September 1999.

[19] P. Ciaccia and M. Patella, “Searching in metric spaces withuser-defined and approximate distances,” ACM Trans. DatabaseSyst., vol. 27, pp. 398–437, December 2002. [Online]. Available:http://doi.acm.org/10.1145/582410.582412

[20] H. Samet, Foundations of Multidimensional and Metric Data Structures(The Morgan Kaufmann Series in Computer Graphics and GeometricModeling). San Francisco, CA, USA: Morgan Kaufmann PublishersInc., 2005.

[21] L. C. (mitsubishi Electric Ite-vil, “The mpeg-7 color descriptors jens-rainer ohm (rwth aachen, institute of communications engineering).”

[22] M. R. P. Ferreira, L. F. D. Santos, A. J. M. Traina, I. Dias, R. Chbeir,and C. Traina Jr., “Algebraic properties to optimize knn queries,” inProc. of the 26th Brazilian Symposium on Databases (SBBD), 2011.

[23] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-basedmultimedia information retrieval: State of the art and challenges,” ACMTrans. Multimedia Comput. Commun. Appl., vol. 2, pp. 1–19, February2006. [Online]. Available: http://doi.acm.org/10.1145/1126004.1126005

[24] R. d. S. Torres, A. X. F. ao, M. A. Goncalves, J. P.Papa, B. Zhang, W. Fan, and E. A. Fox, “A geneticprogramming framework for content-based image retrieval,”Pattern Recognition, vol. 42, no. 2, pp. 283 – 292, 2009,learning Semantics from Multimedia Content. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0031320308001623

[25] T. Skopal, “Where are you heading, metric access methods?: a provoca-tive survey,” in SISAP, P. Ciaccia and M. Patella, Eds. ACM, 2010,pp. 13–21.

[26] U. Murthy, E. A. Fox, Y. Chen, E. Hallerman, R. d. S. Torres,E. J. Ramos, and T. R. C. Falcao, “Superimposed image descriptionand retrieval for fish species identification,” in ECDL, ser. LectureNotes in Computer Science, M. Agosti, J. L. Borbinha, S. Kapidakis,C. Papatheodorou, and G. Tsakonas, Eds., vol. 5714. Springer, 2009,pp. 285–296.

[27] K. C. L. Santos, H. M. de Almeida, M. A. Goncalves, and R. d. S. Tor-res, “Recuperacao de imagens da web utilizando multiplas evidenciastextuais e programacao genetica,” in SBBD, A. Brayner, Ed. SBC,2009, pp. 91–105.

[28] D. C. G. a. Pedronette and R. da S. Torres, “Exploiting contextualspaces for image re-ranking and rank aggregation,” in Proceedingsof the 1st ACM International Conference on Multimedia Retrieval,ser. ICMR ’11. New York, NY, USA: ACM, 2011, pp. 13:1–13:8.[Online]. Available: http://doi.acm.org/10.1145/1991996.1992009

[29] D. Hiemstra and C. Hauff, “Mapreduce for information retrievalevaluation: ”let’s quickly test this on 12 tb of data”,” in Proceedingsof the 2010 international conference on Multilingual and multimodalinformation access evaluation: cross-language evaluation forum,ser. CLEF’10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 64–69.[Online]. Available: http://dl.acm.org/citation.cfm?id=1889174.1889186

[30] N. Alipanah, P. Parveen, L. Khan, and B. Thuraisingham, “Ontology-driven query expansion using map/reduce framework to facilitatefederated queries,” in Proceedings of the 2011 IEEE InternationalConference on Web Services, ser. ICWS ’11. Washington, DC, USA:IEEE Computer Society, 2011, pp. 712–713. [Online]. Available:http://dx.doi.org/10.1109/ICWS.2011.21

[31] Z. Wu, B. Mao, and J. Cao, “Mrgir: Open geographical informationretrieval using mapreduce,” in Geoinformatics, 2011 19th InternationalConference on, 2011, pp. 1–5.

[32] D. S. Kaster, P. H. Bugatti, A. J. M. Traina, and C. T. Jr., “Fmi-sir: A flexible and efficient module for similarity searching on oracledatabase,” JIDM, vol. 1, no. 2, pp. 229–244, 2010.

[33] F. J. T. Chino, M. R. Vieira, A. J. M. Traina, and C. Traina, “Mamview:a visual tool for exploring and understanding metric access methods,”in Proceedings of the 2005 ACM Symposium on Applied computing,ser. SAC ’05. New York, NY, USA: ACM, 2005, pp. 1218–1223.[Online]. Available: http://doi.acm.org/10.1145/1066677.1066952

[34] C. Traina Jr., A. J. M. Traina, C. Faloutsos, and B. Seeger, “Fastindexing and visualization of metric datasets using slim-trees,” IEEETransactions on Knowledge and Data Engineering (TKDE), vol. 14,no. 2, pp. 244–260, 2002.

[35] P. Bolettieri, A. Esuli, F. Falchi, C. Lucchese, R. Perego, T. Piccioli, andF. Rabitti, “CoPhIR: a test collection for content-based image retrieval,”CoRR, vol. abs/0905.4627v2, 2009.

[36] H. Eidenberger, “Distance measures for mpeg-7-based retrieval,”in Proceedings of the 5th ACM SIGMM international workshopon Multimedia information retrieval, ser. MIR ’03. NewYork, NY, USA: ACM, 2003, pp. 130–137. [Online]. Available:http://doi.acm.org/10.1145/973264.973286

[37] R. Dorairaj and K. R. Namuduri, “Compact combination of mpeg-7color and texture descriptors for image retrieval,” in Signals, Systemsand Computers, 2004. Conference Record of the Thirty-Eighth AsilomarConference on, vol. 1. IEEE, 2004, pp. 387–391.

543