ITrails: Pay-as-you-go Information Integration in Dataspaces Presented By Marcos Vaz Salles, Jens...

Preview:

Citation preview

iTrails: Pay-as-you-go Information Integration in iTrails: Pay-as-you-go Information Integration in DataspacesDataspaces

Presented By Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi

ETH Zurich

2008-02-22

Summerized By Sungchan Park

Copyright 2008 by CEBT

Problem: Querying Several SourcesProblem: Querying Several Sources

Center for E-Business Technology

Copyright 2008 by CEBT

Solution #1: Use a Search EngineSolution #1: Use a Search Engine

Center for E-Business Technology

Copyright 2008 by CEBT

Solution #2: Use an Information Integration Solution #2: Use an Information Integration SystemSystem

Center for E-Business Technology

Copyright 2008 by CEBT

iTrail Core IdeaiTrail Core Idea

Is there an integration solution in-between these two extremes?

Center for E-Business Technology

Copyright 2008 by CEBT

iTrail Core IdeaiTrail Core Idea

Center for E-Business Technology

Is there an integration solution in-between these two extremes?

Declaratively add lightweight ‘hints’ to a search engine thus allowing gradual enrichment of loosely integrated data sources

Copyright 2008 by CEBT

Example ScenarioExample Scenario

Query

“pdf yesterday”

Hints(Trails)

1. The date attribute is mapped to modified attribute

2. The date attribute is mapped to received attribute

3. The yesterday keyword is mapped to a query for values of the date attribute equal to the date of yesterday

4. The pdf keyword is mapped to a query for elements whose names end in pdf

Center for E-Business Technology

Copyright 2008 by CEBT

Where hints come from?Where hints come from?

Given by the user

Explicitly

Via Relevance Feedback

(Semi-)Automatically

Information extraction techniques

Automatic schema matching

Ontologies and thesauri (e.g., wordnet)

User communities (e.g., trails on gene data, bookmarks)

All these aspects are beyond the scope of this paper

Center for E-Business Technology

Copyright 2008 by CEBT

Data and Query ModelData and Query Model

Data Model

Assume that all data is represented by a logical graph G

Query also represented by graph

Center for E-Business Technology

Copyright 2008 by CEBT

Query SyntaxQuery Syntax

Center for E-Business Technology

Copyright 2008 by CEBT

Query ExampleQuery Example

“//Home/projects//*[“Mike”]”

Center for E-Business Technology

Copyright 2008 by CEBT

Basic Form of a TrailBasic Form of a Trail

An unidirectional trail

An bidirectional trail

Center for E-Business Technology

Copyright 2008 by CEBT

Trail ExampleTrail Example

Trails in an example scenario

Trails

Given query

– “pdf yesterday”

Transformed query

– “//*.pdf[modified=yesterday() OR received=yesterday() ].”

Center for E-Business Technology

Copyright 2008 by CEBT

iTrail Query ProcessingiTrail Query Processing

1. Matching

2. Transforming

3. Merging

Center for E-Business Technology

Copyright 2008 by CEBT

iTrail Query Processing ExampleiTrail Query Processing Example

Given Query

Q1 = //home/projects//* [“Mike”]

Trail

Ψ8 := //home/*.name ->

//calendar//*.tuple.category

Resulting Query

Q1{Ψ8} = //home/projects/*[“Mike”] U

//calendar//*[category=“project”]//*.[“Mike”]

Center for E-Business Technology

Utilizing G. Miklau and D. Suciu. Containment and Equivalence for an Xpath Fragment. In PODS, 2002.

Copyright 2008 by CEBT

Applying Multiple TrailApplying Multiple Trail

MMCA(Multiple Match Colouring Algorithm) algorithm

Trail can be applied infinitely

To prevent infinite recursion, a trail should not be rematched to nodes in a logical plan generated by itself

Center for E-Business Technology

Copyright 2008 by CEBT

Other IssuesOther Issues

Trail Pruning

Problem: MMCA is exponential in number of levels

Solution: Trail Pruning

– Prune by number of levels

– Prune by top-K trails matched in each level

Give weight and prob. to trails

– Prune by both top-K trails and number of levels

Trail Indexing

Precompute trail expressions in order to speed up query processing

Trail materialization

Center for E-Business Technology

Copyright 2008 by CEBT

ExperimentsExperiments

Setting

Configured iMeMex to act in three modes

– Baseline: Graph / IR search engine

– iTrails: Rewrite search queries with trails

– Perfect Query: Semantics-aware query

Data

Center for E-Business Technology

Copyright 2008 by CEBT

Experiment, QualityExperiment, Quality

Compare with baseline

Center for E-Business Technology

Copyright 2008 by CEBT

Experiment, overheadExperiment, overhead

Compare with perfect query

Overhead is not negligible

However, this can be fixed by exploiting trail materializations

Center for E-Business Technology

Copyright 2008 by CEBT

Experiment, Scalability #1Experiment, Scalability #1

Center for E-Business Technology

Rewrite Time

Query-rewrite time can be controlled with pruning

Copyright 2008 by CEBT

Experiment, Scalability #2Experiment, Scalability #2

Quality

Pruning improves precision

Center for E-Business Technology

Copyright 2008 by CEBT

ConclusionConclusion

Our Contributions

iTrails: generic method to model semantic relationships (e.g. implicit meaning, bookmarks, dictionaries, thesauri,attribute matches, ...)

We propose a framework and algorithms for Pay-as-you-go Information Integration

Smooth transition between search and data integration

Future Work

Trail Creation

– Use collections (ontologies, thesauri, wikipedia)

– Work on automatic mining of trails from the dataspace

Other types of trails

Center for E-Business Technology

Recommended