51
Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Embed Size (px)

Citation preview

Page 1: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Data Integration:Achievements and Perspectives in the Last Ten Years

AiJing

Page 2: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Outline

Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

Page 3: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Motivation & Background Data integration is a pervasive challenge

faced in applications that need to query across multiple autonomous and heterogeneous data sources.

Data integration is crucial in large enterprises that own a multitude of data sources.

For better cooperation among agencies, each with their own data sources.

Page 4: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Data Integration

Legacy DatabasesServices and Applications

Enterprise Databases

Page 5: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Outline

Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

Page 6: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Ten-Year Best PaperQuerying Heterogeneous Information Sources using Source Descriptions. VLDB96

Alon Halevy a principal member of technical staff at AT&T Bell Laboratories, and then at AT&T Laboratories.

• Main idea: the Information Manifold

• led to tremendous progress on data integration and to quite a few commercial data integration products.

Page 7: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

The Information Manifold An implemented data integration system

Goal: provide a uniform query interface to a heterogeneous collection of Web data sources

Main contribution: the way it described the contents of the data sources it knew about.

IM contains declarative descriptions of the contents and capabilities of the information sources. (Source Description)

Page 8: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

An example of complex query

find reviews of movie directed by Woody Allen playing in my area three web sites join!

1. a movie site containing actor and director information (IMDB)

2. movie playing sources(e.g.,777film.com)

3. movie review sites (e.g., a newspaper)

Page 9: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

wrapper wrapper wrapper wrapper wrapper

Mediated Schema

Semantic mappingsoptimization &

execution

query reformulation

Design time Run time

Page 10: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Semantic Mappings

Books TitleISBNPriceDiscountPriceEdition

CDs AlbumASINPriceDiscountPriceStudio

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

AuthorsISBNFirstNameLastName

CD: ASIN, Title, Genre,…Artist: ASIN, name, …

Mediated Schema

Mapping logicMapping logic

InformatioInformation sourcesn sources

Page 11: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Global-as-View (GAV)(Previous approaches)

SourceSource Source Source SourceR1 R2 R3 R4 R5

CD: ASIN, Title, Genre,…Artist: ASIN, name, …

Mediated Schema

Mapping:

Page 12: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Local-as-View (LAV)

SourceSource Source Source SourceR1 R2 R3 R4 R5

CD: ASIN, Title, Genre, YearArtist: ASIN, Name, …

Mediated Schema

Mapping:

Mediated View

Mediated View

Mediated View

Mediated View

Mediated View

Page 13: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

benefits of LAV

Describing information sources became easier

a data integration system could accommodate new sources easily

The descriptions of the information sources could be more precise

describe precise constraints on the contents of the sources become easier

Page 14: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Query reformulation

Books TitleISBNPriceDiscountPriceEdition

CDs AlbumASINPriceDiscountPriceStudio

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

AuthorsISBNFirstNameLastName

CD: ASIN, Title, Genre,…

Mediated SchemaA query

posed over CD(A,T,G)

a set of queries on the data sources

Page 15: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Query Answering in LAV =Answering queries using views (AQUV) a problem which was earlier considered in the

context of query optimization

Given a set of views V1,…,Vn,

And a query Q,

Can we answer Q using only the answers to V1,…,Vn?

Page 16: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

AQUV Query optimization & Supporting physical

data independence

AQUV for data integration: Not necessarily equivalent rewriting Find maximally contained rewriting

Main AQUV Algorithms: Bucket Inverse rules Minicon

Page 17: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Outline

Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

Page 18: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Building on the Foundation

Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

Page 19: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Generating Schema Mappings

Look at that observation: Who’s going to write all these LAV/GAV formulas

(the semantic mappings between the sources

and the mediated schema)?

1.create the source descriptions

2. writing the semantic mappings This was the main bottleneck.

Page 20: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Techniques for Schema Mapping

semi-automatically generating schema mappings Goal: create tools that speed up the creation of

the mappings and reduce the amount of human

effort involved.

Compare schema elements based on: Linguistic similarities overlaps in data values or data types schema mapping tasks are often repetitive.

Page 21: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

A Machine Learning Approach

Map multiple schemas in the same domain to the same mediated schema.

Learn from previous experience: the manually created schema mappings as training data generalize from them to predict mappings between

unseen schemas.

Mediated schema

Given matches Predict new ones

Page 22: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Building on the Foundation

Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

Page 23: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Adaptive query processing look at that observation:

Once we have mappings, how can we execute queries?

Traditional plan-then-execute doesn’t work.

Root: the dynamic nature of data integration contexts

Page 24: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Adaptive query processing

data integration system:

the context is very dynamic and the optimizer has much less information than the traditional setting.

Two results: the optimizer can’t decide a good plan a plan may be arbitrarily bad.

Dynamic adjust query plan

Page 25: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Building on the Foundation

Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

Page 26: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

XML characters for data integration XML offered a common syntactic format for s

haring data among data sources. since it appeared as if data could actually be

shared integration systems using XML as the underly

ing data Model and XML query languages (XQuery)

Page 27: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Building on the Foundation

Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

Page 28: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Model Management

Goal: provide an algebra for manipulating schemas and mappings

With such an algebra: complex operations on data sources

simple sequences of operators in the algebra Some of the operators in Model Management

create & compose mappings, merge & diff models

Page 29: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Building on the Foundation

Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

Page 30: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Peer Data Management Systems

Berkeley

Stanford

DBLP

UW (Washington)

UW (Wisconsin)

CiteSeerUW (Waterloo)

Q

Q1

Q2Q6

Q5

Q4

Q3

LAV, GLAV

Page 31: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Two Additional Benefits

A P2P architecture offers a truly distributed

mechanism for sharing data. Every data source only provide semantic mappings to a set

of neighbors. complex integrations emerge follows semantic paths

P2P architecture is more appropriate than a single mediated schema in data sharing context. there is never a single global mediated schema data sharing occurs in local neighborhoods of the network.

Page 32: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Building on the Foundation

Generating Schema mappings Adaptive query processing XML Model management Peer-to-Peer Data Management The Role of Artificial Intelligence

Page 33: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

The Role of Artificial Intelligence Description Logics describe relationships between

data sources data sources need to be represented declaratively the mediated schema of IM was based on Classic

Description Logic

Description Logics offered more flexible mechanisms for representing a mediated schema

Recent work: combine the expressive power of Description Logics with the ability to manage large amounts of data.

Page 34: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Outline

Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

Page 35: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

The Data Integration Industry Late 90’s——commercialization Enterprise Information Integration (EII):

without having to first load all the data into a central warehouse

the development of the EII industry Technologies from research labs matured enough The needs of data management XML

Inappropriate:

data warehousing solutions, ad-hoc solutions

Page 36: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

data sources

mediated schema

will participate in the application

buildbuild

applicationsapplications applicationsapplications

queryquery

semantic mappings a query posed over the

virtual schemaquery query reformulationa query over the data sources

Execute with an engine that create plans that span multiple data sources

A data integration scenario Query processing

Page 37: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Other EII Products XML data model and XQuery

Challenge: the research on integration for XML was only in its infancy

customer-relationship management

Challenge: how to provide the customer-facing

worker a global view of a customer whose data is

residing in multiple sources, and track information

from multiple sources in real time.

Page 38: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Outline

Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

Page 39: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Future Challenges The factors of data integration challenges:

Social: Data integration is fundamentally about getting people to collaborate and share data.

complexity of integration

Data integration has been referred to as a problem as hard as AI, maybe even harder!

Our goal: create tools that facilitate data integration in a variety of scenarios.

Page 40: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Several Specific Challenges

Dataspaces: Pay-as-you-go data management

Uncertainty and lineage

Reusing human attention

Page 41: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Dataspaces

database system: create the schema first! data integration system: create the semantic

mappings first!

fundamental shortcoming: long setup time!

Dataspaces: the idea of pay-as-you-go data

management

Page 42: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Pay-as-you-go

offer some services immediately without any

setup time, and improve the services as more

investment is made into creating semantic

relationships. A dataspace should offer keyword search ove

r any data in any source with no setup time.

Page 43: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Pay-as-you-go Data Management

Benefit

Investment (time, cost)

Dataspaces

Data integration solutions

Dataspaces: Franklin, Halevy, Maier [see PODS 2006]

Page 44: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Several Specific Challenges

Dataspaces: Pay-as-you-go data management

Uncertainty and lineage

Reusing human attention

Page 45: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Uncertain data & data lineage A necessity in data integration system

introspect about the certainty of the data

when not automatically determine its certainty, refer the user to the lineage of the data

Web search engines provide URLs along with their search results, so users can consider the URLs in the decision of which results to explore further.

Page 46: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Several Specific Challenges

Dataspaces: Pay-as-you-go data management

Uncertainty and lineage

Reusing human attention

Page 47: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Reusing human attention

achieving tighter semantic integration among data sources

Users’ any operation to data sources:

Giving a semantic clue about the data or

about relationships between data sources Systems that leverage these semantic clues: obta

in semantic integration much faster an area for additional research and development

Page 48: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Outline

Motivation & Background Best Paper: Information Manifold Building on the Foundation Data Integration Industry Future Challenges Conclusion

Page 49: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Conclusion

not so long ago a nice feature and an area

for intellectual curiosity

today a necessity

Today’s economy further emphasize the need for data integration solutions.

Thomas Friedman: The World is Flat.

data integrationtimetime

Page 50: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

A Framework for Deep Web Integration

Query Translation

Resul ts Extraction

Data Merging

Integrated Interface

Deep Web

WDB Discovery

Interface Integration

RDBWeb DB

Web DB

Web DB

Web DBWeb DB

Interface Schema Extraction

WDB Clustering

Query Process Modul e

I nterface I ntegrati on Modul e

WDB Selection

Query Submission

Resul ts Annotation

Resul t Process Modul e

Developed issue Developing issue Undeveloped issue Our focuses

Page 51: Data Integration: Achievements and Perspectives in the Last Ten Years AiJing

Q & A