25
Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern [email protected] 2nd VLDB Workshop on Data Management in Grids Seoul, South Korea September 11th, 2006

Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern [email protected]

Embed Size (px)

Citation preview

Page 1: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sourcesin a Dynamic Grid Environment

Jürgen GöresHeterogeneous Information Systems Group

University of [email protected]

2nd VLDB Workshop on Data Management in GridsSeoul, South Korea

September 11th, 2006

Page 2: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 2

Outline

• Motivation – The Role of Data on the Grid

• The Discovery Problem

• Conclusion & Outlook

• Data Source Utility

Page 3: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 3

The Role of Data in the Grid―A Database Perspective

• From: Moving input and output data for number crunching

• Via: File-oriented bulk data storage

– Store large volumes of unstructured data (“BLOBs”)

– Retrieved and used in its original format and context

• To: Reuse and sharing of existing data

– Data becomes a resource in its own right

– The Grid is aware of the structure of the data

• Individual data sources will rarely fulfill all application requirements

Data from different sources has to be combined!

• Problem: Data sources are highly heterogeneous

Effective use of data requires application-specific integrated view

Page 4: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 4

Goals of Information Integration in Brief

• Provide an integrated, homogeneous view over a number of heterogeneous data sources, i.e.

– Create a mapping from the sources to an integrated schema

• Resolve heterogeneity:

– Technical Issues

– Different data models and structuring

– Uncertainties in the semantics of data

– Duplicate/ambiguous/contradictory records

Integration is difficult! (“AI-complete”)

To this day a largely manual task

• Bad news: Integration in the Grid won’t get any easier

• Good news: Lots of new research opportunities

Page 5: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 5

Conventional Information Integration

Planning

Dynamic

Integration Plan

ConcreteRequirements

(Target Schema)

DataSources

Deployment

IntegrationSystem

AnalysisUser/Application

Requirements

Discovery

101 - 102

Candidate Data Sources

Autonomous &ChangingSources

More sources103 - 106

Page 6: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 6

The Challenge of Data Source Discovery in the Grid

• Number of potential sources several magnitudes larger Informal manual discovery not an option Cannot start integration planning with all sources Idea: Only consider the most useful sources

• What makes a data source useful? Source must have the same “universe of discourse”

as the target Source and target must deal with identical or related concepts Concept represented by Tables, Classes, (XML-)Elements...Concept CoverageChoose Top-N sources

• Problems:– No support for concept-oriented search in current registries– How to identify identical or related concepts?

Page 7: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 7

Schema Matching

• Identify schema elements that are in some way similar • Result: semantic correspondences (a.k.a. "matches")

– Usually have a confidence ranking [0..1]– Can be basic or complex

• Problem: this is really hard to do automatically– Lots of automatic matching approaches

• Linguistic

• Structural

• Hybrid

• …

– Quality and performance is limited– User needs to review/correct/amend matches

Semi-automatically

Schema Matching against 103 - 106 sources?!

Page 8: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 8

Indirect Schema Matching

• Idea: Provide reference schemas for schema matching– Any “good” schema that models a given domain – Purpose built domain schemas (comp. “Ontologies”)

• Deployment– Match sources against domain schema(s)– store matches in the registry

• Discovery– Only match target schema against selected domain schema(s)– Semi-automatical matching feasible– Assuming transitivity, infer matches between source and target

via domain schema

Page 9: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 9

Indirect Schema Matching

Data Source 2

equivalent concept

superconcept

related concept

Data Source 1

AB AB BA AB

BC AC AC CA AC

BC AC AC AC AC

CB CA AC CA AC

BC AC AC AC AC

Target Schema

9

C

B

A

Data Source 2Data Source 1

Target Schema

Domain Schema

Page 10: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 10

Data Source 2

equivalent concept

superconcept

related concept

AB AB BA AB

BC AC AC CA AC

BC AC AC AC AC

CB CA AC CA AC

BC AC AC AC AC

Data Source 1

Target Schema

C

B

A

Indirect Schema Matching

Target Schema

Data Source 1 Data Source 2

9

Domain Schema

Page 11: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 11

Data Source 2

equivalent concept

superconcept

related concept

Target Schema

Domain Schema

AB AB BA AB

BC AC AC CA AC

BC AC AC AC AC

CB CA AC CA AC

BC AC AC AC AC

Data Source 1

Target Schema

A

B

C

Indirect Schema Matching

9

Data Source 2Data Source 1

Page 12: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 12

C

B

A

Indirect Schema Matching

equivalent concept

superconcept

related concept

Data Source 1

Target Schema

Domain Schema

AB AB BA AB

BC AC AC CA AC

BC AC AC AC AC

CB CA AC CA AC

BC AC AC AC AC

Data Source 1

Target Schema

C

B

A

Schema

Data Source 2

Data Source 2

9

Data Source 2

Page 13: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 13

Data Source 2

Data Source 2

equivalent concept

superconcept

related concept

Data Source 1

Target Schema

Domain Schema

AB AB BA AB

BC AC AC CA AC

BC AC AC AC AC

CB CA AC CA AC

BC AC AC AC AC

Data Source 1

Target Schema

A

B

C

Indirect Schema Matching

9

Page 14: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 14

• Weighted Utility Measure (Weighted)– Reduced weight for concepts farther away from schema root– Consider match types– Consider match confidence

Schema B

Schema A

Thoughts about Utility

• Isn‘t utility just a similarity measure?– Similarity is intuitively symmetric: sim(A, B) = sim(B, A)– Utility is asymmetric/directed:

– Schema A is very useful for Schema B: util(A, B) 1– Schema B is not as useful for Schema A: util(B, A) 0.4

10

• Basic Utility Measure (Base)# corresponding concepts in source / # concepts in target

Schema B

Schema A

Page 15: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 15

<Product> <GTIN>01234567891011</GTIN> <Name>Typewriter X-1000</Name> ... <Description>The X-1000 represents the culmination of typewriter development... </Description> <Category>Office Supplies</Category> <Supplier> <Name>Office World</Name> <URL>www.officeworld.com</URL> <Price>99.99</Price> <DeliveryTime>5 min</DeliveryTime> </Supplier> <Supplier> ... </Supplier></Product><Product> ...

EAN

Name

Spec

PID

ID

Address

Commodity

avail_at

Group

Price

SID

Shop

PriceSearch

Name

Scenario “Procurement” − Data Source 1

Product

GTIN

Name

Description

Category Name

URL

OrderNo

Price

DeliveryTime

Supplier

Target Schema Data Source S1Price Search Engine

11

…………

Typewriters…X-1000…00930…

GroupSpecNameEAN

………

4711109.9900930…

SIDPricePID

………

www.write...WriteTypers4711

AddressNameID

Commodity

avail_at

Shop

="office supplies"

ProcurementDepartment

Purchasepencils, paper, toner,

envelopes, …=“office supplies”

//Product[Category = “Office Supplies”]

Base = 7 / 11 0.73Weighted 0.61

Page 16: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 16

Scenario “Procurement” − Data Source 2

Product

Barcode

=„groceries"

Name

Description

Delivery

URL

Address

Phone

Contact

Price

Type

GroSupply

Target Schema Data Source S2Grocery Store

12

Product

GTIN

="office supplies"

Name

Description

Category Name

URL

OrderNo

Price

DeliveryTime

Supplier

Base = 9 / 11 0.82Weighted 0.74?

Page 17: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 17

Scenario “Procurement” − Data Source 3

Product

GTIN

="office supplies"

Name

Description

Category Name

URL

OrderNo

Price

DeliveryTime

Supplier

Target Schema

Product

UPC

Name

Information

Price OfficeWorld

13

Data Source S3

Office Supply Store

Base = 5 / 11 0.45Weighted 0.45?

Page 18: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 18

• Ranking:

0.45

0.73

0.82

Base

0.45Office Supply (S3)3

0.61Price Search (S1)2

0.74Grocery Store (S2)1

WeightedSourceRank• Ranking:

0.45

0.73

0.82

Base

0.45Office Supply (S3)3

0.61Price Search (S1)2

0.74Grocery Store (S2)1

WeightedSourceRank

Evaluation of the basic measures

• Basic measures only consider similar concepts– Instances of concepts can be completely disjoint!

Utility measure should consider instance properties– Using constraints

• Satisfiability is NP-complete

• Satisfiability does not indicate presence of useful instances

– Using histograms• Independent for each atomic feature/attribute

• No information about the combination of values (complex objects)

• But useful as a filter: lower weight to 0 if constraint is not satisfied

Instance-based measure Inst

14

Page 19: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 19

Scenario “Procurement” − Data Source 2

Product

Barcode

Name

Description

Delivery

Name

URL

Address

Phone

Contact

Price

Type

GroSupply

Target Schema Data Source S2Grocery Store

P= //Product[Category = “office supplies”] =

12

Product

GTIN

="office supplies"

Name

Description

Category Name

URL

OrderNo

Price

DeliveryTime

Supplier

263“sweets”

......

Histogram for Type

21“cereals”

45“beverages”

countvalue

Inst 0.3

Page 20: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 20

• Ranking:

0.3

0.45

0.61

Inst

Grocery Store (S2)3

Office Supply (S3)2

Price Search (S1)1

SourceRank• Ranking:

0.3

0.45

0.61

Inst

Grocery Store (S2)3

Office Supply (S3)2

Price Search (S1)1

SourceRank

Evaluation of Instance Completeness Measure

• Instance completeness– Devaluates false positives

• What about the Office Supply Store (S3)?

15

Page 21: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 21

Scenario “Procurement” − Data Source 3

Product

GTIN

="office supplies"

Name

Description

Category Name

URL

OrderNo

Price

DeliveryTime

Supplier

Target Schema

Product

UPC

Name

Information

Price OfficeWorld

13

Data Source S3Office Supply Store

URL

ShopName

Category

3454“office supplies”countvalue

1“Office World”countvalue

1“www.officew...countvalue

Schema Augmentation

Inst+ 0.77

Page 22: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 22

Ranking with Augmentation

• Ranking:

0.3

0.61

0.77

Inst+

Grocery Store (S2)3

Price Search (S1)2

Office Supply (S3)1

SourceRank• Ranking:

0.3

0.61

0.77

Inst+

Grocery Store (S2)3

Price Search (S1)2

Office Supply (S3)1

SourceRank

• Augmentation and instance completeness reproduce the intuitive ranking

16

Page 23: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 23

Conclusion

• Data source discovery as a grid-specific problem – Very large number of data sources– Only the most useful sources should be considered

• Basic utility measure based on concept coverage– Use schema matching to identify similar concepts– Use indirect schema matching during deployment

• Limitations of the basic measure• Instances completeness

– Use histograms to filter sources that are not possibly useful

• Missing context information in data sources– Implicitly known in original usage– Schema augmentation by data provider

17

Page 24: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 24

Outlook & Open Questions

• Who provides domain schemas?• Instance-based utility − Caveats

– Record matching problem– „Like schema matching with data“ instead of metadata– Scalability?– Concept hierarchies on values

• Limitations of the „Top-N“ approach– Sources which are very specific to a subset of concepts might

be filtered out

Partition target schema– The best n sources might not provide all concepts

Repeat discovery with the missing concepts only

18

Page 25: Discovering Data Sources in a Dynamic Grid Environment Jürgen Göres Heterogeneous Information Systems Group University of Kaiserslautern goeres@informatik.uni-kl.de

Discovering Data Sources in a Dynamic Grid Environment 25

Thank you!

Questions?

19