19
A Self-Healing Approach for A Domain- Specific Deep Web Search Tool Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA A Self-Healing Approach for A Domain-Specific Deep Web Search Tool Fan Wang and Gagan Agrawal The Ohio State University Presented by : Tantan Liu

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

  • Upload
    ellard

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool. Fan Wang and Gagan Agrawal The Ohio State University. Presented by : Tantan Liu. The Deep Web. The definition of “Deep web” from Wikipedia. - PowerPoint PPT Presentation

Citation preview

Page 1: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang and Gagan AgrawalThe Ohio State University

Presented by : Tantan Liu

Page 2: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

The Deep Web

• The definition of “Deep web” from Wikipedia

The Deep Web refers to World Wide Web content that is not part of the surface web, which is indexed by standard search engines.

Page 3: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Deep Web in Biological Domain• 500 times larger than the surface web

• Nearly 800 deep web data sources in the bio-domain

• 95 percent of the deep web is publicly accessible

Page 4: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Search the Deep Web: Solved and Unsolved Issues

• Data Source Integration– Schema matching and Schema mining

• Query Planning and Answering– Keyword search and Structured query answering

• Fault Tolerance– Data access over wide-area networks– Unpredictable data source inaccessibility/unavailability– Network contention– However, uncompromised user search experience

Page 5: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Our Solution: A Redundancy based Self-Healing Approach

• Identify data redundancy across independent data sources

• Find the minimal “have to be replaced” sub-plan caused by data source unavailability/inaccessibility

• Find the sub-query corresponding to the “have to be replaced” sub-plan

• Generate a new replacing sub-plan based on redundancy using other data sources

Page 6: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Roadmap

• Introduction and Motivation

• Problem Formulation in Detail

• Our Self-Healing Approach

• Evaluation

• Conclusion

Page 7: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Data Redundancy Model

• A data source is represented by a three-tuple– IN: input attribute– O: output attribute– Con: attribute conditions imposed on data source

• Data redundancy condition between data source A and B– They have the same input attributes– They have overlapping output attributes– They have non-conflicting attribute conditions

Page 8: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Query and Query Plan

• Query– SQL query format

select t1,t2,…,tn search term set ST

from the deep web

where in1=e1 and in2=e2,…,nm=em input term set INT

• Query Plan– A DAG of data source nodes “covers” the user query

D1

D2

D3

D4

Starting node

Its input attributesare input terms in query

Query plan nodes

Output attributes maybe user requested searchterms

Data source dependency

Page 9: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Algorithm Overview (1)• Find the part of the query plan needs to be replaced

– Impacted sub-plan• the sub-graph reachable from the unavailable data source

nodes

– Minimal impacted sub-plan• The impacted sub-plan without usable data source nodes

considering given data redundancy

A E F

D

Bt1

t3

t4 t6BH

Page 10: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Algorithm Overview (2)• Find Maximal Fixable Sub-Query

– The sub-query corresponding to the minimal impacted sub plan

• New Sub-Plan Generation– Use our existing query planning algorithm

A E F

D

Bt1

t3

t4 t6B

Select t3, t4 where input=t1

Page 11: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Minimal Impacted Sub-Plan Algorithm

A

E F :t11

D :t1 0

B:t9t1

t3t4 t5

JI:t12K :t13

L :t14

t2 t6t7

t8

B:t9

I:t12

1. Identify unavailable data sources {B, I}

2. Find the sub-graph reachable from them (impacted sub-plan)

3. Cascading-crash conditions for data source X which is dependent on data source D

A. At least one data source, sharing redundant data with D, is not crashed B. At least one such above data source has the same usage as D

J

L :t14

Page 12: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Minimal Impacted Sub-Plan Fixability

• Minimal Impacted Sub-Plan Fixability– How much the minimal impacted sub-plan can be fixed using other

data sources taking advantage of data redundancy

• Dead Attribute– No un-crashed data source can provide the attribute as its output

attribute

• Plan Fixability Categorization– Fully fixable: only self crashed node, no dead attribute– Partial fixable: only self crashed node, dead attribute– Cascading fully fixable: cascading crashed node, no dead attribute– Cascading partial fixable: cascading crashed node, dead attribute

Page 13: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Maximal Fixable Sub-Query Generation• For each source in the minimal impacted sub-plan, we

compute– Input set IN– Requested output set RO– Linking set L

• Maximal Fixable Sub-Query– Input term set: input attributes of all data sources in the minimal impacted

sub-plan without incoming edges (self-crashed data sources)

– Search term set• Users requested search terms which are supposed to be covered by the

minimal impacted sub-plan• Terms in the linking set of the nodes in the minimal impacted sub-plan which

have outgoing edges to data sources outside of the minimal impacted sub-plan

A E F

D

Bt1

t3

t4 t6

IN={t1} L={t3,t4}

Page 14: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Roadmap

• Introduction and Motivation

• Problem Formulation in Detail

• Our Self-Healing Approach

• Evaluation

• Conclusion

Page 15: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Evaluation• 12 biological deep web data sources• 20 queries, 4 groups• Each group corresponding to one fixability

category• Methods compared

– Baseline: start from stretch– Our method

Page 16: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Query Answering Time Comparison

1. Our method is more efficient in fixing failed query plans than the baseline method2. Our method is at least 20% faster for all queries in this figure.

Page 17: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Query Result Quality Comparison

For 18 out of 20 cases, the recall from our method is exactly the same as the ideal recall from the baseline method

Page 18: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Conclusion• Propose a self-healing approach to support fault

tolerance for deep web searches• Find the minimal impacted sub-plan caused by

unavailable/inaccessible data sources• Find a new plan to replace the minimal impacted

sub-plan• Our method outperforms a baseline method in

terms of both efficiency and result quality

Page 19: A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

A Self-Healing Approach for A Domain-Specific Deep Web Search Tool

Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA

Questions?

Contact us: Fan Wang [email protected]

Gagan Agrawal [email protected]