Indexing Dataspaces

1

Indexing DataspacesPresenter : Aviv AlonSeminar in Databases (236826)

Dataspaces are collections of heterogeneous and partially unstructured data.

Dataspaces

Dataspaces – Why we need them?

Looking for an architect with

good reviews and cheap materials?

Return “Architect B” as instance

4

Consider queries that are keyword basedbut also structure aware:

How to effectively query and search a dataspace

Main Problem

5

An inverted list where each row represents a keyword and each column represents a data item from the data sources.

Indexing Heterogeneous Data

6

We model the data as a set of triples Each triple is either of the form

(instance, attribute, value) for example: (“Architect B”, name, ‘Shalom’)

or of the form (instance, association, instance) for example: (“Architect B”, worksWith, “Architect A”)


7

We also model:


Person instances: p1, p2, p3

Article instance: a1

Conference instance: c1

Example Attributes firstName, lastName and

nickName are sub-attributes of name Association contactAuthor is a sub-

association of author.

Set of predicates of the form (v, {K1, ... , Kn})◦ v - an attribute or association label◦ {K1, …, Kn} - a keyword set

Predicate queries

Example 1: (title, ‘Birch’)

attribute predicate

Set of predicates of the form (v, {K1, ... , Kn})◦ v - an attribute or association label◦ {K1, …, Kn} - a keyword set

Predicate queries

association predicate

Example 2:(publishedIn ‘1996 Sigmod)’

Set of keywords K1, ... , Kn

◦ relevant instance◦ associated instances

Neighborhood keyword queries

Example: ‘Birch’relevant instance

associated instances

12

Build a separate index for each attribute to support structured queries on structured data.◦ Con: significant overhead to the index structure

Create an inverted list to support keyword search on unstructured data.◦ Con: Does not allow specifications on structure

Existing methods

13

Capture both text values and structuralinformation using an extended inverted list.

The index augments the text terms in the inverted list with labels denoting the structural aspects of the data such as attribute tags and associations between data items.

Proposed solution

Inverted Lists - ExampleWe cannot tell that “Tian” occurs as p1’s name and p3’s lastName

15

Indexing Attributes◦ Attribute inverted lists (ATIL)

Indexing Associations◦ Attribute-association inverted lists (AAIL)

Indexing hierarchies◦ Attribute inverted lists with duplication (Dup-ATIL)◦ Attribute inverted lists with hierarchies (Hier-ATIL)◦ Hybrid attribute inverted list (Hybrid-ATIL)

Indexing structure outline

16

Attribute inverted lists (ATIL) Whenever the keyword k appears in a value of the a

attribute, there is a row in the inverted list for k//a//

Indexing Attributes

keyword = 1996

Attribute = yeara1 c1 p1 p2 p3

1996//year// 0 1 0 0 0

17

Attribute inverted lists (ATIL) Whenever the keyword k appears in a value of the a

attribute, there is a row in the inverted list for k//a//

Indexing Attributes

18

To Answer an attribute predicate query (A,{K1, ... , Kn})

we need to search for {K1 //A//, ... , Kn //A//}Example:(lastName, ‘Tian’)

“tian//lastName//”

Attribute inverted lists (ATIL)

The search will yield p3

19





20

Attribute-association inverted lists (AAIL):

Indexing Attributes

keyword = Birch

Association = authoredPaper with p1, p2

a1 c1 p1 p2 p3

Birch//authoredPaper// 0 0 1 1 0

21

Attribute-association inverted lists (AAIL):

Indexing Associations

22

To Answer a association predicate query (R, {K1, ... , Kn})

we need to search for {K1 // R //, ... , Kn // R //}Example: (author ‘Raghu’)

“raghu//author//”

Attribute-association Inverted lists (AAIL)

23

For the query (name ‘Tian’), we wish to return instances p1 and p3, rather than only p1.

Indexing hierarchies

24

To Answer the query (name ‘Tian’)

we can search for:“tian//name// OR tian//firstName// OR tian//lastName// OR tian//nickName//”

A Naïve method

Can be very expensive!

25





26

Attribute inverted lists with duplication (Dup-ATIL):

Indexing Attributes

Attribute = nameSub-attribute = nickName

a1 c1 p1 p2 p3a1 c1 p1 p2 p3

Jeff//name//Jeff//nickName//

00

00

00

00

11

Attribute inverted lists with duplication (Dup-ATIL)

Index with Duplication

28


we need to search for {K1//A//, ... , Kn//A//}Example:(name, ‘Tian’)

“tian//name//”

Attribute inverted lists with duplication (Dup-ATIL)

The search will yield both p3 and p1

29

Pro: simple query answering

Con: may considerably expand the size of the index because of the duplication. Specially when:◦ Long paths from the root attribute to the leaf attributes ◦ Most values in the triple base belong to leaf attributes.

Dup-ATIL (cont.)

30





31

Attribute inverted lists with hierarchies (Hier-ATIL):

Index with Hierarchy Path

Attribute = nameSub-attribute = nickName

a1 c1 p1 p2 p3

Jeff//name//nickName// 0 0 0 0 1

32


we need to search for {K1//a0 // ... //am //*, ... , Kn// a0 // ... //am //*}

Example:(name, ‘Tian’)

“tian//name//*”

Attribute inverted lists with hierarchies (Hier-ATIL)

The search will yield both p3 and p1

a0 // ... //am : the hierarchy path for attribute A

33

Pro: Does not increase the number of indexed keywords (Although it can lengthen many of them)◦ real indexing systems typically record a keyword only by

the difference from its previous keyword

Con: answer a predicate query by transforming it into a prefix search, which can be more expensive than a keyword search.

Hier-ATIL (cont.)

34





35

Dup-ATIL is more suitable for the cases where a keyword occurs in many attributes with common ancestors

Hier-ATIL is more suitable for the cases where a keyword occurs in only a few attributes with common ancestors

Hybrid indexing combines the strengths of both methods

Hybrid Index – Why?

36

Hybrid attribute inverted list (Hybrid-ATIL): Inverted list that can answer any prefix search by

reading no more than t rows.

Hybrid Index

A1 c1 p1 p2 p3

Jeff//name//nickName//Jie//name//firstName//Tian//name////Tian//name//lastName//

0000

0000

0010

0000

1111

Tian//name//lastName//is shadowed by Tian//name//

summary row

37

To Answer prefix query of the form k//a0 // ... //am//* we look at all the rows with prefix k//a0 // ... //am // except

those shadowed by summary rowsExample:(name, ‘Tian’), t=1

“tian//name//*”

Hybrid Index

Answer the prefix search after reading 1 row. yield both p1 and p3

38

We build the Keyword Inverted List (KIL), which is essentially a Hybrid-AAIL

Neighborhood Keyword Queries

We build the Keyword Inverted List (KIL), which is essentially a Hybrid-AAIL

Example:“Birch”, t=1

“birch//*”

Neighborhood Keyword Queries

Answer the prefix search after reading 1 row. yield p1, p2, a1,, c1

40

Associations between disparate items on the desktop:◦ Latex and Bibtex files◦ Word documents◦ Powerpoint presentations◦ emails and contacts◦ webpages in the web cache

The instances and associations are stored in an RDF file. the size of the file is 52.4MB

Experimental Evaluation

Experimental Evaluation

Attribute clauses. No

sub-attributes

Attribute clauses. With sub-attributes

Association clauses

Observations about the results105,320 object

300,354 attribute468,402 association

predicate query: 15.2 ms neighborhood keyword query: 224.3 ms

(with no more than 5 keywords)

Answering queries using the KIL was very efficient!

Answering queries with / without sub-attributes consumed a similar amount of time

Effectiveness of hybrid indexing

Compared with KIL (on average): The Naïve method

◦ query-answering time increased by a factor of 15.9 XML Index (SepIL):

◦ query-answering time increased by a factor of 2

Comparison of methods

44

Main Contributions: An indexing method that is designed to support flexible

querying over dataspaces Extended inverted lists to capture both texts and

structure of data

Future Work Extend the index to support value heterogeneity and to

investigate appropriate ranking algorithms

Conclusions

45

THE ENDQuestions ?

Documents

Indexing Dataspaces