33
Kambhampati & Knoblock Information Integration on the Web (MA-1) 1 Information Integration

Information Integration

  • Upload
    jenski

  • View
    23

  • Download
    0

Embed Size (px)

DESCRIPTION

Information Integration. Services. Source Trust. Webpages. Structured data. Sensors (streaming Data). Source Fusion/ Query Planning. Query. Mediator. Executor. Monitor. Answers. The Problem. Services. Source Trust Ontologies; Source/Service Descriptions. Probing Queries. - PowerPoint PPT Presentation

Citation preview

Page 1: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 1

Information Integration

Page 2: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 2

Services

Webpages

Structureddata

Sensors(streamingData)

The Problem

Query

Executor

Source Fusion/Query Planning

Source Trust

Answers

Monitor

Mediator

Page 3: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 3

Query

Services

Webpages

Structureddata

Sensors(streamingData)

ExecutorNeeds to handleSource/network

Interruptions,Runtime uncertainity,

replanning

Source Fusion/Query Planning

Needs to handle:Multiple objectives,Service composition,

Source quality & overlap

Source TrustOntologies;

Source/ServiceDescriptions

Replanning

Requests

Prefer

ence

/Util

ity M

odel

Answers

Probing Queries

Sour

ce C

alls

Monitor

Updating Statistics

• User queries refer to the mediated schema.

• Data is stored in the sources in a local schema.

• Content descriptions provide the semantic mappings between the different schemas.

• Mediator uses the descriptions to translate user queries into queries on the sources.

DWIM

Page 4: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 4

Who is dying to have it? (Applications)

• WWW:– Comparison shopping

– Portals integrating data from multiple sources

– B2B, electronic marketplaces

• Science and culture:– Medical genetics: integrating genomic data

– Astrophysics: monitoring events in the sky.

– Culture: uniform access to all cultural databases produced by countries in Europe provinces in Canada

• Enterprise data integration– An average company has 49 different databases and

spends 35% of its IT dollars on integration efforts

Skeptic’s c

orner

Page 5: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 9

Why isn’t this just

• Search engines do text-based retrieval of URLS– Works reasonably well for single document texts, or for finding sites

based on single document text

• Cannot integrate information from multiple documents

• Cannot do effective “query relaxation” or generalization

• Cannot link documents and databases

• The aim of Information integration is to support query processing over structured and semi-structured sources as well as services.

Skeptic’s c

orner

Page 6: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 10

Is it like Expedia/Travelocity/Orbitz…

• Surpringly, NO!• The online travel sites don’t quite need to do data integration; they just

use SABRE– SABRE was started in the 60’s as a joint project between American

Airlines and IBM– It is the de facto database for most of the airline industry (who voluntarily

enter their data into it)• There are very few airlines that buck the SABRE trend—SouthWest airlines is

one (which is why many online sites don’t bother with South West)• So, online travel sites really are talking to a single database (not

multiple data sources)…– To be sure, online travel sources do have to solve a hard problem. Finding

an optimal fare (even at a given time) is basically computationally intractable (not to mention the issues brought in by fluctuating fares). So don’t be so hard on yourself

• Check out http://www.maa.org/devlin/devlin_09_02.html

Skeptic’s c

orner

Page 7: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 11

Why isn’t this just

• No common schema– Sources with heterogeneous schemas (and ontologies)– Semi-structured sources

• Legacy Sources– Not relational-complete– Variety of access/process limitations

• Autonomous sources– No central administration– Uncontrolled source content overlap

• Unpredictable run-time behavior– Makes query execution hard

• Predominantly “Read-only”– Could be a blessing—less worry about transaction management– (although the push now is to also support transactions on web)

Database(relational)

Database Manager(DBMS)

-Storage mgmt-Query processing-View management-(Transaction processing)

Query(SQL)

Answer(relation)

Databases

Distributed DatabasesSkeptic

’s corner

Page 8: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 12

Netbot

Junglee

DealPilot.Com

Are we talking “comparison shopping” agents?

• Certainly closer to the aims of these

• But:

• Wider focus

• Consider larger range of databases

• Consider services

• Implies more challenges

• “warehousing” may not work

• Manual source characterization/ integration won’t scale-up

Page 9: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 14

11/7

--Project part C due 11/30

--Homework 4 due 11/14

--Google guy talk on 11/18 (TBA)

Search Engine

Reported Size

Page Depth

Google 8.1 billion 101K

MSN 5.0 billion 150K

Yahoo4.2 billion (estimate)

500K

Ask Jeeves 2.5 billion 101K+

Page 10: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 15

Query

Services

Webpages

Structureddata

Sensors(streamingData)

ExecutorNeeds to handleSource/network

Interruptions,Runtime uncertainity,

replanning

Source Fusion/Query Planning

Needs to handle:Multiple objectives,Service composition,

Source quality & overlap

Source TrustOntologies;

Source/ServiceDescriptions

Replanning

Requests

Prefer

ence

/Util

ity M

odel

Answers

Probing Queries

Sour

ce C

alls

Monitor

Updating Statistics

• User queries refer to the mediated schema.

• Data is stored in the sources in a local schema.

• Content descriptions provide the semantic mappings between the different schemas.

• Mediator uses the descriptions to translate user queries into queries on the sources.

DWIM

Page 11: Information Integration

Information Integration

• Discovering information sources (e.g. deep web modeling, schema learning, …)

• Gathering data (e.g., wrapper learning & information extraction, federated search, …)

Queries

• Querying integrated information sources (e.g. queries to views, execution of web-based queries, …)

• Data mining & analyzing integrated information (e.g., collaborative filtering/classification learning using extracted data, …)

Linkage

• Cleaning data (e.g., de-duping and linking records) to form a single [virtual] database

Another view

Page 12: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 17

Different “Integration” scenarios• “Data Aggregation” (Vertical)

– All sources export (parts of a) single relation

• No need for joins etc• Could be warehouse or virtual

– E.g. BibFinder, Junglee, Employeds etc

– Challenges: Schema mapping; data overlap

• “Data Linking” (Horizontal)

– Joins over multiple relations stored in multiple DB

• E.g. Softjoins in WHIRL

• Ted Kennedy episode

– Challenges: record linkage over text fields (object mapping); query reformulation

• “Collection Selection”

– All sources export text documents

– E.g. meta-crawler etc.

• Challenges: Similarity definition; relevance handling

• All together (vertical & horizontal)

– Many interesting research issues

– ..but few actual fielded systems

Page 13: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 18

Dimensions to Consider

• How many sources are we accessing?• How autonomous are they?• Can we get meta-data about sources?• Is the data structured?

– Discussion about soft-joins. See slide next

• Supporting just queries or also updates?• Requirements: accuracy, completeness,

performance, handling inconsistencies.• Closed world assumption vs. open world?

– See slide next

Page 14: Information Integration

19

Soft Joins..WHIRL [Cohen] We can extend the notion of Joins to

“Similarity Joins” where similarity is measured in terms of vector similarity over the text attributes. So, the join tuples are output n a ranked form—with the rank proportional to the similarity

Neat idea… but does have some implementation difficulties

Most tuples in the cross-product will have non-zero similarities. So, need query processing that will somehow just produce highly ranked tuples

Also other similarity/distance metrics may be used

E.g. Edit distance

Page 15: Information Integration

20

Page 16: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 21

Local Completeness Information• If sources are incomplete, we need to look at each one of them.

• Often, sources are locally complete.

• Movie(title, director, year) complete for years after 1960, or for American directors.

• Question: given a set of local completeness statements, is a query Q’ a complete answer to Q?

True source contentsAdvertised description

Guarantees (LCW; Inter-source comparisons)

Problems: 1. Sources may not be interested in giving these! Need to learn hard to learn!

2. Even if sources are willing to give, there may not be any “big enough” LCWs Saying “I definitely have the car with vehicle ID XXX is useless

Page 17: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 23

Source Descriptions• Contains all meta-information about the

sources:

– Logical source contents (books, new cars).

– Source capabilities (can answer SQL queries)

– Source completeness (has all books).

– Physical properties of source and network.

– Statistics about the data (like in an RDBMS)

– Source reliability

– Mirror sources

– Update frequency.

QueryQuery

Services

Webpages

Structureddata

Sensors(streamingData)

Services

Webpages

Structureddata

Sensors(streamingData)

ExecutorNeeds to handleSource/network

Interruptions,Runtime uncertainty,

replanning

Source Fusion/Query Planning

Needs to handle:Multiple objectives,Service composition,

Source quality & overlap

Source TrustOntologies;

Source/ServiceDescriptions

Replanning

Requests

Prefe

renc

e/Util

ity M

odel

Answers

ProbingQueries

Sour

ce C

alls

Monitor

Updating Statistics

Page 18: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 24

Source Access

• How do we get the “tuples”?– Many sources give “unstructured” output

• Some inherently unstructured; while others “englishify” their database-style output

– Need to (un)Wrap the output from the sources to get tuples

– “Wrapper building”/Information Extraction– Can be done manually/semi-manually

Page 19: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 25

Source Fusion/Query Planning

QueryQuery

Services

Webpages

Structureddata

Sensors(streamingData)

Services

Webpages

Structureddata

Sensors(streamingData)

ExecutorNeeds to handleSource/network

Interruptions,Runtime uncertainty,

replanning

Source Fusion/Query Planning

Needs to handle:Multiple objectives,Service composition,

Source quality & overlap

Source TrustOntologies;

Source/ServiceDescriptions

Replanning

Requests

Prefer

ence

/Util

ity M

odel

Answers

ProbingQueries

Sour

ce C

alls

Monitor

Updating Statistics

• Accepts user query and generates a plan for accessing sources to answer the query

– Needs to handle tradeoffs between cost and coverage

– Needs to handle source access limitations

– Needs to reason about the source quality/reputation

Page 20: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 26

Monitoring/Execution

QueryQuery

Services

Webpages

Structureddata

Sensors(streamingData)

Services

Webpages

Structureddata

Sensors(streamingData)

ExecutorNeeds to handleSource/network

Interruptions,Runtime uncertainty,

replanning

Source Fusion/Query Planning

Needs to handle:Multiple objectives,Service composition,

Source quality & overlap

Source TrustOntologies;

Source/ServiceDescriptions

Replanning

Requests

Prefer

ence

/Util

ity M

odel

Answers

ProbingQueries

Sour

ce C

alls

Monitor

Updating Statistics

• Takes the query plan and executes it on the sources

– Needs to handle source latency

– Needs to handle transient/short-term network outages

– Needs to handle source access limitations

– May need to re-schedule or re-plan

Page 21: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 29

Models for Integration

Modified from Alon Halevy’s slides

Overview

• Motivation for Information Integration [Rao]

• Accessing Information Sources [Craig]

• Models for Integration [Rao]

• Query Planning & Optimization [Rao]

• Plan Execution [Craig]

• Standards for Integration/Mediation [Rao]

• Ontology & Data Integration [Craig]

• Future Directions [Craig]

Query

Query

Ser vi ces

Webpages

Struc tureddata

Sensors(streamingData)

Ser vi ces

Webpages

Struc tureddata

Sensors(streamingData)

ExecutorNee ds t o hand leSource/ net work

Int errupt ions,Runt ime unce rt ain it y,

repl anning

Source Fusion/Query Planning

Ne eds t o handl e:Multi pl e ob je cti ves,Se rv ice composit ion,

Source qua lit y & ove rl ap

Source TrustOnt ol og ies;

Source /Servic eDescrip ti ons

Replanning

Requests

Prefe

renc

e/Util

ity M

odel

Answers

ProbingQueries

Sour

ce C

alls

Monitor

Updating Statis ticsExecutor

Nee ds t o hand leSource/ net work

Int errupt ions,Runt ime unce rt ain it y,

repl anning

Source Fusion/Query Planning

Ne eds t o handl e:Multi pl e ob je cti ves,Se rv ice composit ion,

Source qua lit y & ove rl ap

Source TrustOnt ol og ies;

Source /Servic eDescrip ti ons

Replanning

Requests

Prefe

renc

e/Util

ity M

odel

Answers

ProbingQueries

Sour

ce C

alls

Monitor

Updating Statis tics

Page 22: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 30

Solutions for small-scale integration

• Mostly ad-hoc programming: create a special solution for every case; pay consultants a lot of money.

• Data warehousing: load all the data periodically into a warehouse.– 6-18 months lead time– Separates operational DBMS

from decision support DBMS. (not only a solution to data integration).

– Performance is good; data may not be fresh.

– Need to clean, scrub you data.

QueryQuery

Services

Webpages

Structureddata

Sensors(streamingData)

Services

Webpages

Structureddata

Sensors(streamingData)

ExecutorNeeds to handleSource/network

Interruptions,Runtime uncertainity,

replanning

Source Fusion/Query Planning

Needs to handle:Multiple objectives,Service composition,

Source quality & overlap

Source TrustOntologies;

Source/ServiceDescriptions

Replanning

Requests

Prefer

ence

/Util

ity M

odel

Answers

ProbingQueries

Sour

ce C

alls

Monitor

Updating Statistics

ExecutorNeeds to handleSource/network

Interruptions,Runtime uncertainity,

replanning

Source Fusion/Query Planning

Needs to handle:Multiple objectives,Service composition,

Source quality & overlap

Source TrustOntologies;

Source/ServiceDescriptions

Replanning

Requests

Prefer

ence

/Util

ity M

odel

Answers

ProbingQueries

Sour

ce C

alls

Monitor

Updating Statistics

Datasource

Datasource

Datasource

Relational database (warehouse)

User queries

Data extractionprograms

Data cleaning/scrubbing

OLAP / Decision support/Data cubes/ data miningJunglee

did this,

for

employmen

t clas

sified

s

Page 23: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 32

The Virtual Integration Architecture

• Leave the data in the sources.• When a query comes in:

– Determine the relevant sources to the query

– Break down the query into sub-queries for the sources.

– Get the answers from the sources, and combine them appropriately.

• Data is fresh. Approach scalable• Issues:

– Relating Sources & Mediator– Reformulating the query– Efficient planning & execution

Datasource

wrapper

Datasource

wrapper

Datasource

wrapper

Mediator:

User queriesMediated schema

Data sourcecatalog

Reformulation engine

optimizer

Execution engineWhich data

model?

QueryQuery

Services

Webpages

Structureddata

Sensors(streamingData)

Services

Webpages

Structureddata

Sensors(streamingData)

ExecutorNeeds to handleSource/network

Interruptions,Runtime uncertainity,

replanning

Source Fusion/Query Planning

Needs to handle:Multiple objectives,Service composition,

Source quality & overlap

Source TrustOntologies;

Source/ServiceDescriptions

Replanning

Requests

Prefer

ence

/Util

ity M

odel

Answers

ProbingQueries

Sour

ce C

alls

Monitor

Updating Statistics

ExecutorNeeds to handleSource/network

Interruptions,Runtime uncertainity,

replanning

Source Fusion/Query Planning

Needs to handle:Multiple objectives,Service composition,

Source quality & overlap

Source TrustOntologies;

Source/ServiceDescriptions

Replanning

Requests

Prefer

ence

/Util

ity M

odel

Answers

ProbingQueries

Sour

ce C

alls

Monitor

Updating Statistics

Garlic [IBM], Hermes[UMD];Tsimmis, InfoMaster[Stanford]; DISCO[INRIA]; Information Manifold [AT&T]; SIMS/Ariadne[USC];Emerac/Havasu[ASU]

Page 24: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 35

Desiderata for Relating Source-Mediator Schemas

• Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources.

• Easy addition: make it easy to add new data sources.

• Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively.

• Nonlossy: be able to handle all queries that can be answered by directly accessing the sources

Datasource

wrapper

Datasource

wrapper

Datasource

wrapper

Mediator:

User queriesMediated schema

Data sourcecatalog

Reformulation engine

optimizer

Execution engine

• Given:– A query Q posed over the mediated schema

– Descriptions of the data sources

• Find:– A query Q’ over the data source relations, such

that:• Q’ provides only correct answers to Q, and

• Q’ provides all possible answers to Q given the sources.

Reformulation

Page 25: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 36

11/9

--Google Mashups—what type of integration are they?

--Mturk.comcollaborative computing the capitalist way

Page 26: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 38

Information Integration

Data Integration Text/Data Integration Service integration

Data aggregation(vertical integration)

Data Linking(horizontal integration)

Collection Selection

Page 27: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 39

Extremes of automation in Information Integration

• Fully automated II (blue sky for now)– Get a query from the user

on the mediator schema– Go “discover” relevant data

sources– Figure out their “schemas”– Map the schemas on to the

mediator schema– Reformulate the user query

into data source queries– Optimize and execute the

queries– Return the answers

• Fully hand-coded II– Decide on the only query

you want to support– Write a (java)script that

supports the query by accessing specific (pre-determined) sources, piping results (through known APIs) to specific other sources

• Examples include Google Map Mashups

E.g. We may start with known sources and their known schemas, do hand-mapping and support automated reformulation and optimization

(most interesting action is

“in between”)

Page 28: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 40

Approaches for relating source & Mediator Schemas

• Global-as-view (GAV): express the mediated schema relations as a set of views over the data source relations

• Local-as-view (LAV): express the source relations as views over the mediated schema.

• Can be combined…?

CREATE VIEW Seattle-view AS

SELECT buyer, seller, product, store FROM Person, Purchase WHERE Person.city = “Seattle” AND Person.name = Purchase.buyer

We can later use the views: SELECT name, store FROM Seattle-view, Product WHERE Seattle-view.product = Product.name AND Product.category = “shoes”

Virtual vsMaterialized

“View” Refresher

Let’s compare them in a movie

Database integration scenario..

Differences minor for data aggregation…

Page 29: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 42

Global-as-ViewMediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time).Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title

Express mediator schemarelations as views oversource relations

Page 30: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 43

Global-as-ViewMediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time).Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title

Express mediator schemarelations as views oversource relations

Mediator schema relations are Virtual views on source relations

Page 31: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 46

Local-as-View: example 1Mediated schema:

Movie(title, dir, year, genre),

Schedule(cinema, title, time).

S1(title,dir,year,genre)

S3(title,dir)

S5(title,dir,year), year >1960

Create Source S1 AS

select * from Movie

Create Source S3 AS

select title, dir from Movie

Create Source S5 AS

select title, dir, year

from Movie

where year > 1960 AND genre=“Comedy”Sources are “materialized views” ofmediator schema

Express source schemarelations as views overmediator relations

Page 32: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 49

GAV vs. LAVMediated schema:

Movie(title, dir, year, genre),

Schedule(cinema, title, time).

Create View Movie AS

select NULL, NULL, NULL, genre

from S4

Create View Schedule AS

select cinema, NULL, NULL

from S4.

But what if we want to find which cinemas are playing comedies?

Create Source S4

select cinema, genre

from Movie m, Schedule s

where m.title=s.title

Now if we want to find which cinemas are playing comedies, there is hope!

Source S4: S4(cinema, genre)

Lossy mediation

Page 33: Information Integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 50

GAV vs. LAV• Not modular

– Addition of new sources changes the mediated schema

• Can be awkward to write mediated schema without loss of information

• Query reformulation easy– reduces to view unfolding

(polynomial)– Can build hierarchies of

mediated schemas

• Best when– Few, stable, data sources– well-known to the mediator (e.g.

corporate integration)• Garlic, TSIMMIS, HERMES

• Modular--adding new sources is easy

• Very flexible--power of the entire query language available to describe sources

• Reformulation is hard– Involves answering queries only

using views (can be intractable—see below)

• Best when– Many, relatively unknown data

sources– possibility of addition/deletion of

sources• Information Manifold,

InfoMaster, Emerac, Havasu