26
BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As- You-Go Data Integration Systems Presented by Andrew Zitzelberger

BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration

Embed Size (px)

Citation preview

BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER,

BRITISH COLUMBIA, CANADA, JUNE 2008

Bootstrapping Pay-As-You-Go Data Integration

Systems

Presented by Andrew Zitzelberger

Data Integration

Offer a single-point interface to a set of data sources Mediated schema Semantic mappings Query through mediated schema

Pay-as-you-go Many contexts can be useful without full integration System starts with few (or inaccurate) semantic mappings Mappings are improved over time

Problem Requires significant upfront and ongoing effort

Contributions

Self-configuring data integration system Provides an advanced starting point for pay-as-you-go

systems Initial configuration provides good precision and

recall

Algorithms Mediated schema generation Semantic mapping generation

Concept Probabilistic mediated schema

Probabilistic Mediated Schema

Mediated Schema Generation

1) Remove infrequent attributes Ensure mediated schema contain most relevant

attributes

2) Construct weighted graph Nodes are remaining attributes Edges are the values of some similarity measure: s(ai,

aj) Cull edges below threshold τ

3) Cluster nodes Cluster is a connected component of the graph

Probabilistic Mediated Schema Generation

Allow for error є in weighted graph Certain edges ≥ τ + є τ - є < Uncertain edges ≤ τ + є Cull edges < τ – є

Remove unnecessary uncertain edgesCreate schema from every subset of

uncertain edges

Probabilistic Mediated Schema Generation

Assign probability

Probabilistic Mediated Schema

Probabilistic Semantic Mappings

Probabilistic Mapping Generation

Weighted correspondence

Choose the consistent p-mapping with the maximum entropy.

Probabilistic Mapping Generation

1) Enumerate one-to-one mappings Mappings must contain subset of correspondences

2) Assign probabilities that maximize entropy Solve the following constraint maximization problem

Probabilistic Mediated Schema Consolidation

Why? User expects a single deterministic schema More efficient query answering

How?

Schema Consolidation Example

M = {M1, M2}M1 contains {a1, a2, a3}, {a4}, and {a5, a6}M2 contains {a2, a3, a4} and {a1, a5, a6}T contains {a1}, {a2, a3}, {a4}, and {a5,

a6}

Probabilistic Mapping Consolidation

Modify p-mapping Update the mappings to match new mediated schema

Modify probabilities Schema mapping probability by Pr(Mi)

Consolidate Add all new mappings to new set

If mapping already in new set during addition, add probabilites

Experimental Setup

UDI – the data integration system Accepts select-project queries (only one table)

Source data – MySQLQuery processor – JavaJaro Winkler simularity computation –

SecondStringEntropy maximization problem – KnitroOperating System – Windows VistaCPU – Intel Core 2 GHzMemory – 2GB

Experimental Setup

τ = 0.85є = 0.02θ = 10%

Experiments

Domains: Movie, Car, People, Course, Bibliography

Golden Standards Manually created for People and Bibliography Partially created for others

10 test queries One to four attributes in SELECT clause Zero to three predicates in WHERE clause

Results

Estimated actual recall between 0.8 and 0.85

Experiments

Compare to other methods: MySQL keyword search engine

KEYWORDNAIVE KEYWORDSTRUCT KEYWORDSTRICT

SOURCE Unions results of each data source

TOPMAPPING Only consider p-mapping with highest probability

Results

Experiments

Compare against other Q&A methods: SINGLEMED – single deterministic mediated schema UNIONALL – single deterministic mediated schema

that contains a singleton cluster for each frequent source attribute

Results

Experiment and Results

Quality of mediated schema Test against manually created schema

Experiment and Result

Setup efficiency 3.5 minutes for 817 data sources Roughly linear increase of time with data sources

Maximum-entropy problem is most time consuming

Future Work

Different schema matcherDealing with multiple-table sourcesIncluding multi-table schemasNormalizing mediated schemas

Analysis

Positives Lots of support (proofs and experiments)

Negatives Detail Pictures