45
Indexing Dataspaces Presenter : Aviv Alon Seminar in Databases (236826) 1

Indexing Dataspaces

  • Upload
    shen

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

Indexing Dataspaces. Presenter : Aviv Alon Seminar in Databases  (236826). Dataspaces. Dataspaces are collections of heterogeneous and partially unstructured data. Dataspaces – Why we need them?. Looking for an architect with good reviews and cheap materials?. - PowerPoint PPT Presentation

Citation preview

Page 1: Indexing Dataspaces

1

Indexing DataspacesPresenter : Aviv AlonSeminar in Databases (236826)

Page 2: Indexing Dataspaces

Dataspaces are collections of heterogeneous and partially unstructured data.

Dataspaces

Page 3: Indexing Dataspaces

Dataspaces – Why we need them?

Looking for an architect with

good reviews and cheap materials?

Return “Architect B” as instance

Page 4: Indexing Dataspaces

4

Consider queries that are keyword basedbut also structure aware:

How to effectively query and search a dataspace

Main Problem

Page 5: Indexing Dataspaces

5

An inverted list where each row represents a keyword and each column represents a data item from the data sources.

Indexing Heterogeneous Data

Page 6: Indexing Dataspaces

6

We model the data as a set of triples Each triple is either of the form

(instance, attribute, value) for example: (“Architect B”, name, ‘Shalom’)

or of the form (instance, association, instance) for example: (“Architect B”, worksWith, “Architect A”)

Indexing Heterogeneous Data

Page 7: Indexing Dataspaces

7

We also model:

Indexing Heterogeneous Data

Page 8: Indexing Dataspaces

Person instances: p1, p2, p3

Article instance: a1

Conference instance: c1

Example Attributes firstName, lastName and

nickName are sub-attributes of name Association contactAuthor is a sub-

association of author.

Page 9: Indexing Dataspaces

Set of predicates of the form (v, {K1, ... , Kn})◦ v - an attribute or association label◦ {K1, …, Kn} - a keyword set

Predicate queries

Example 1: (title, ‘Birch’)

attribute predicate

Page 10: Indexing Dataspaces

Set of predicates of the form (v, {K1, ... , Kn})◦ v - an attribute or association label◦ {K1, …, Kn} - a keyword set

Predicate queries

association predicate

Example 2:(publishedIn ‘1996 Sigmod)’

Page 11: Indexing Dataspaces

Set of keywords K1, ... , Kn

◦ relevant instance◦ associated instances

Neighborhood keyword queries

Example: ‘Birch’relevant instance

associated instances

Page 12: Indexing Dataspaces

12

Build a separate index for each attribute to support structured queries on structured data.◦ Con: significant overhead to the index structure

Create an inverted list to support keyword search on unstructured data.◦ Con: Does not allow specifications on structure

Existing methods

Page 13: Indexing Dataspaces

13

Capture both text values and structuralinformation using an extended inverted list.

The index augments the text terms in the inverted list with labels denoting the structural aspects of the data such as attribute tags and associations between data items.

Proposed solution

Page 14: Indexing Dataspaces

Inverted Lists - ExampleWe cannot tell that “Tian” occurs as p1’s name and p3’s lastName

Page 15: Indexing Dataspaces

15

Indexing Attributes◦ Attribute inverted lists (ATIL)

Indexing Associations◦ Attribute-association inverted lists (AAIL)

Indexing hierarchies◦ Attribute inverted lists with duplication (Dup-ATIL)◦ Attribute inverted lists with hierarchies (Hier-ATIL)◦ Hybrid attribute inverted list (Hybrid-ATIL)

Indexing structure outline

Page 16: Indexing Dataspaces

16

Attribute inverted lists (ATIL) Whenever the keyword k appears in a value of the a

attribute, there is a row in the inverted list for k//a//

Indexing Attributes

keyword = 1996

Attribute = yeara1 c1 p1 p2 p3

1996//year// 0 1 0 0 0

Page 17: Indexing Dataspaces

17

Attribute inverted lists (ATIL) Whenever the keyword k appears in a value of the a

attribute, there is a row in the inverted list for k//a//

Indexing Attributes

Page 18: Indexing Dataspaces

18

To Answer an attribute predicate query (A,{K1, ... , Kn})

we need to search for {K1 //A//, ... , Kn //A//}Example:(lastName, ‘Tian’)

“tian//lastName//”

Attribute inverted lists (ATIL)

The search will yield p3

Page 19: Indexing Dataspaces

19

Indexing Attributes◦ Attribute inverted lists (ATIL)

Indexing Associations◦ Attribute-association inverted lists (AAIL)

Indexing hierarchies◦ Attribute inverted lists with duplication (Dup-ATIL)◦ Attribute inverted lists with hierarchies (Hier-ATIL)◦ Hybrid attribute inverted list (Hybrid-ATIL)

Indexing structure outline

Page 20: Indexing Dataspaces

20

Attribute-association inverted lists (AAIL):

Indexing Attributes

keyword = Birch

Association = authoredPaper with p1, p2

a1 c1 p1 p2 p3

Birch//authoredPaper// 0 0 1 1 0

Page 21: Indexing Dataspaces

21

Attribute-association inverted lists (AAIL):

Indexing Associations

Page 22: Indexing Dataspaces

22

To Answer a association predicate query (R, {K1, ... , Kn})

we need to search for {K1 // R //, ... , Kn // R //}Example: (author ‘Raghu’)

“raghu//author//”

Attribute-association Inverted lists (AAIL)

Page 23: Indexing Dataspaces

23

For the query (name ‘Tian’), we wish to return instances p1 and p3, rather than only p1.

Indexing hierarchies

Page 24: Indexing Dataspaces

24

To Answer the query (name ‘Tian’)

we can search for:“tian//name// OR tian//firstName// OR tian//lastName// OR tian//nickName//”

A Naïve method

Can be very expensive!

Page 25: Indexing Dataspaces

25

Indexing Attributes◦ Attribute inverted lists (ATIL)

Indexing Associations◦ Attribute-association inverted lists (AAIL)

Indexing hierarchies◦ Attribute inverted lists with duplication (Dup-ATIL)◦ Attribute inverted lists with hierarchies (Hier-ATIL)◦ Hybrid attribute inverted list (Hybrid-ATIL)

Indexing structure outline

Page 26: Indexing Dataspaces

26

Attribute inverted lists with duplication (Dup-ATIL):

Indexing Attributes

Attribute = nameSub-attribute = nickName

a1 c1 p1 p2 p3a1 c1 p1 p2 p3

Jeff//name//Jeff//nickName//

00

00

00

00

11

Page 27: Indexing Dataspaces

Attribute inverted lists with duplication (Dup-ATIL)

Index with Duplication

Page 28: Indexing Dataspaces

28

To Answer an attribute predicate query (A,{K1, ... , Kn})

we need to search for {K1//A//, ... , Kn//A//}Example:(name, ‘Tian’)

“tian//name//”

Attribute inverted lists with duplication (Dup-ATIL)

The search will yield both p3 and p1

Page 29: Indexing Dataspaces

29

Pro: simple query answering

Con: may considerably expand the size of the index because of the duplication. Specially when:◦ Long paths from the root attribute to the leaf attributes ◦ Most values in the triple base belong to leaf attributes.

Dup-ATIL (cont.)

Page 30: Indexing Dataspaces

30

Indexing Attributes◦ Attribute inverted lists (ATIL)

Indexing Associations◦ Attribute-association inverted lists (AAIL)

Indexing hierarchies◦ Attribute inverted lists with duplication (Dup-ATIL)◦ Attribute inverted lists with hierarchies (Hier-ATIL)◦ Hybrid attribute inverted list (Hybrid-ATIL)

Indexing structure outline

Page 31: Indexing Dataspaces

31

Attribute inverted lists with hierarchies (Hier-ATIL):

Index with Hierarchy Path

Attribute = nameSub-attribute = nickName

a1 c1 p1 p2 p3

Jeff//name//nickName// 0 0 0 0 1

Page 32: Indexing Dataspaces

32

To Answer an attribute predicate query (A,{K1, ... , Kn})

we need to search for {K1//a0 // ... //am //*, ... , Kn// a0 // ... //am //*}

Example:(name, ‘Tian’)

“tian//name//*”

Attribute inverted lists with hierarchies (Hier-ATIL)

The search will yield both p3 and p1

a0 // ... //am : the hierarchy path for attribute A

Page 33: Indexing Dataspaces

33

Pro: Does not increase the number of indexed keywords (Although it can lengthen many of them)◦ real indexing systems typically record a keyword only by

the difference from its previous keyword

Con: answer a predicate query by transforming it into a prefix search, which can be more expensive than a keyword search.

Hier-ATIL (cont.)

Page 34: Indexing Dataspaces

34

Indexing Attributes◦ Attribute inverted lists (ATIL)

Indexing Associations◦ Attribute-association inverted lists (AAIL)

Indexing hierarchies◦ Attribute inverted lists with duplication (Dup-ATIL)◦ Attribute inverted lists with hierarchies (Hier-ATIL)◦ Hybrid attribute inverted list (Hybrid-ATIL)

Indexing structure outline

Page 35: Indexing Dataspaces

35

Dup-ATIL is more suitable for the cases where a keyword occurs in many attributes with common ancestors

Hier-ATIL is more suitable for the cases where a keyword occurs in only a few attributes with common ancestors

Hybrid indexing combines the strengths of both methods

Hybrid Index – Why?

Page 36: Indexing Dataspaces

36

Hybrid attribute inverted list (Hybrid-ATIL): Inverted list that can answer any prefix search by

reading no more than t rows.

Hybrid Index

A1 c1 p1 p2 p3

Jeff//name//nickName//Jie//name//firstName//Tian//name////Tian//name//lastName//

0000

0000

0010

0000

1111

Tian//name//lastName//is shadowed by Tian//name//

summary row

Page 37: Indexing Dataspaces

37

To Answer prefix query of the form k//a0 // ... //am//* we look at all the rows with prefix k//a0 // ... //am // except

those shadowed by summary rowsExample:(name, ‘Tian’), t=1

“tian//name//*”

Hybrid Index

Answer the prefix search after reading 1 row. yield both p1 and p3

Page 38: Indexing Dataspaces

38

We build the Keyword Inverted List (KIL), which is essentially a Hybrid-AAIL

Neighborhood Keyword Queries

Page 39: Indexing Dataspaces

We build the Keyword Inverted List (KIL), which is essentially a Hybrid-AAIL

Example:“Birch”, t=1

“birch//*”

Neighborhood Keyword Queries

Answer the prefix search after reading 1 row. yield p1, p2, a1,, c1

Page 40: Indexing Dataspaces

40

Associations between disparate items on the desktop:◦ Latex and Bibtex files◦ Word documents◦ Powerpoint presentations◦ emails and contacts◦ webpages in the web cache

The instances and associations are stored in an RDF file. the size of the file is 52.4MB

Experimental Evaluation

Page 41: Indexing Dataspaces

Experimental Evaluation

Attribute clauses. No

sub-attributes

Attribute clauses. With sub-attributes

Association clauses

Page 42: Indexing Dataspaces

Observations about the results105,320 object

300,354 attribute468,402 association

predicate query: 15.2 ms neighborhood keyword query: 224.3 ms

(with no more than 5 keywords)

Answering queries using the KIL was very efficient!

Answering queries with / without sub-attributes consumed a similar amount of time

Effectiveness of hybrid indexing

Page 43: Indexing Dataspaces

Compared with KIL (on average): The Naïve method

◦ query-answering time increased by a factor of 15.9 XML Index (SepIL):

◦ query-answering time increased by a factor of 2

Comparison of methods

Page 44: Indexing Dataspaces

44

Main Contributions: An indexing method that is designed to support flexible

querying over dataspaces Extended inverted lists to capture both texts and

structure of data

Future Work Extend the index to support value heterogeneity and to

investigate appropriate ranking algorithms

Conclusions

Page 45: Indexing Dataspaces

45

THE ENDQuestions ?