Upload
lajos
View
31
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Information Integration: A Status Report. Alon Halevy University of Washington, Seattle IJCAI 2003. Mediated Schema. Entity. Sequenceable Entity. Structured Vocabulary. Experiment. Phenotype. Gene. Nucleotide Sequence. Microarray Experiment. Protein. OMIM. HUGO. Swiss- Prot. GO. - PowerPoint PPT Presentation
Citation preview
Information Integration:A Status Report
Alon Halevy
University of Washington, Seattle
IJCAI 2003
Mediated Schema
OMIMSwiss-Prot
HUGO GO
Gene-Clinics
EntrezLocus-Link
GEO
Entity
Sequenceable Entity
GenePhenotypeStructured Vocabulary
Experiment
ProteinNucleotide Sequence
Microarray Experiment
Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code?
Motivation and ActivityApplication areas of data integration: Enterprise information integration ($$) The government Data sources on the web Scientific data sharing.
Several data sharing architectures: Virtual data integration, warehousing, message-
passing, web-services.
Many research projects: Mine: Information Manifold, Tukwila, LSD, Piazza.
EII: a new industry buzzword.
Today’s AgendaRecent progress Mediation languages Query processing (XML and other) Some lessons from commercial world.
Current challenges Enabling large-scale data sharing: peer-data
management systems. The age of problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR.
AI is more vital than ever for progress here!
Mediation Languages
Goal: Mediated Schema
SourceSource Source Source Source
Language forSpecifyingSemanticRelationships (not full FOL)
Q
Q’ Q’ Q’ Q’ Q’
Assume: data at the sources is structure (or seems so).
Global-as-View (GAV)
Mediated Schema
SourceSource Source Source SourceR1 R2 R3 R4 R5
Title, Actor, …
Actor(x,y) :- R1(x,y,z)Actor(x,y) :- R2(x,z), R3(z,y)
Local-as-View (LAV,GLAV)
Mediated Schema
SourceSource Source Source SourceR1 R2 R3 R4 R5
Title, Actor …
R1(x,y,z) :- Title(x,y), Actor(x,z), y< 1970R5(x,y,z) :- Movie(x,y,”French”)
Mediation Languages: Summary
A lot of nice theory and practical algorithms.
Careful choice of expressive power mattered.
Algorithms for answering queries using views are in every commercial DBMS.
Description Logics – also an attractive formalism for mediation.
Bottleneck is coming up with the mapping expressions.
Outline
Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world.
Current challenges Enabling large-scale data sharing: peer-data
management systems. The age old problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR.
Adaptive Query Processing
Problem: no stats, network unstableCannot ‘Plan and then execute’Need to adapt plan during execution.Ideas already in Ingres (1976) (early database system) Interleaving planning and execution (AI)
Key question: when and granularity of adaptation: For every tuple? Materialization points? See [Ives et al. 2002] for our solution.
Convergent Query Processing[Ives et al., 2002]
(I O S)
I OS
I1
O1 S1
O1S1
I1 O1S1
IO
I0 O0S0
I0 O0
“Cleanup” query plan
Join In-stock, Orders, Shipping
I2 O2S2
I2S2
XML Query Processing
XML facilitates integration. Mediator query processor may manipulate XML
directly.
Challenges: XML is not flat, but nested; Path queries. Can be irregular; doesn’t adhere to a strict
schema.
Progress: Defining and optimizing XQuery. Going back and forth: XML to relational.
The Commercial World
Some startups: Nimble, MetaMatrix, Calixa, Composite, Enosys
Big guys making announcements: IBM, BEA, MS, (Oracle still being defiant). Integration technology in different layers:
E.g., reporting companies want it (Actuate)
Progress: analysts have buzzword -- EII.Challenges: Integration with EAI? Yet another middleware? Horizontal vs. vertical?
What Worked?
Performance was not an issue.
Tools, tools, toolsFor managing sources and creating
mediated schemas.
XML query processing was needed.
Concordance: need common keys to join sources:Active research area!
Outline
Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world.
Current challenges Enabling large-scale data sharing: peer-data
management systems. The age old problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR.
Limitations of Mediated Schema
Mediated Schema
SourceSource Source Source Source
Q
Q’ Q’ Q’ Q’ Q’
Peer Data-Management
PDMS: a network of peers (data sources)
Peers can:Export base data, or combinations of dataServe as logical mediators for other peers
A peer can be both a server and a client.
Semantic relationships are specified locally (between small sets of peers).
This is a Semantic Web (different angle)
Network of Mappings (Piazza)
UW Stanford
DBLP
Roma Paris
CiteSeer
Vienna
GAV, LAVGLAV
Q
Q’
Q’Q’’
Q’’
Q’’
Q’’
Advantages of PDMS
No need for a central mediated schema.Can map data opportunistically, as is most convenient.Queries are posed using the peer’s schema. Answers come from anywhere in the system.Infrastructure for Semantic Web applicationsThis is not P2P file sharing. Data has rich semantics Membership is not as dynamic.
Schema Mediation for PDMS
UW Stanford
DBLP
Roma Paris
CiteSeer
Vienna
GAV, LAVGLAV
Q
Q’
Q’Q’’
Q’’
Q’’
Q’’When can LAV and GAV be combined to form such a network structure?(semantics not yet obvious.
[ICDE-03],[WWW-03 for XML]
Efficient Query Answering
UW Stanford
DBLP
Roma Paris
CiteSeer
Vienna
Q
Q’
Q’Q’’
Q’’
Q’’
Q’’Problems: • redundant paths• expensive reformulation.
Possible solution:• Pre-compose some paths
Mapping Composition[Jayant Madhavan and Halevy, VLDB 2003]
Incredibly subtle! In general, composition can be an infinite set of GLAV formulas.Results:Finite in many casesEven when infinite, often has finite, useful
encoding.Hence, compositions can usually be pre-
optimized.
Other Research Issues
UW Stanford
DBLP
Saarbruecken Leipzig
CiteSeer
Berlin
Q
Q’
Q’Q’’
Q’’
Q’’
Q’’Intelligent data placement
Management of mapping networks
Improving networks: finding additional connections.
Handling inconsistencies
PDMS-Related Projects
Hyperion (Toronto)
PeerDB (Singapore)
Local relational models (Trento)
Edutella (Hannover, Germany)
Semantic Gossiping (EPFL Zurich)
Raccoon (UC Irvine)
Orchestra (Ives, U. Penn)
Outline
Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world.
Current challenges Enabling large-scale data sharing: peer-data
management systems. The age old problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR.
Schema/Ontology Matching
Schema heterogeneity: a key roadblock for information integration Different data sources speak their own schema Mapping is key to any data sharing architecture
MediatorMediator
ConsumerConsumer
Data SourceData Source
Data SourceData Source
Data SourceData Source
Hotel, GaststätteBrauerei,
Kathedrale
Lodges, Restaurants
Beaches, Volcanoes
Hotel, Restaurant,AdventureSports,
HistoricalSites
Schema Matching
Schema Matching: Schema Matching: Discovering correspondences between similar elementsDiscovering correspondences between similar elementsEventually… BooksAndMusic(x:Title,…) = Books(x:Title,…) Eventually… BooksAndMusic(x:Title,…) = Books(x:Title,…) CDs(x:Album,…) CDs(x:Album,…)
BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords
Books TitleISBNPriceDiscountPriceEdition
CDs AlbumASINPriceDiscountPriceStudio
BookCategoriesISBNCategory
CDCategoriesASINCategory
ArtistsASINArtistNameGroupName
AuthorsISBNFirstNameLastName
Inventory Database A
Inventory Database B
Typical ApproachesMultiple sources of evidences in the schemas Schema element names
BooksAndCDs/Categories ~ BookCategories/Category Descriptions and documentation
ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book
Data types, data instances DateTime Integer, addresses have similar formats
Schema structure All books have similar attributes
Use domain knowledge
Combine multiple techniques to exploit all available evidence
In isolation, techniques are incomplete or brittle
Philosophy of Solutions
Effective schema matching requires a principled combination of techniques.Like human experts, the matcher should improve over timeLSD: Mapping data sources to a mediated schema. Use a few mappings as training examples to learn
hypotheses for elements of the mediated schema. See [Doan et al., SIGMOD-2001, MLJ-2003]
Next step: corpus-based matching.
Corpus-Based Matching
Collection of schemas and mappings
Reuse extracted informationto match new schemas
CDs Categories Artists
Items
Artists
Authors Books
Music
Information
Litreture
Publisher
Authors
Corpus of Books and Inventory SchemasCorpus of Books and Inventory Schemas
Identify common concepts and patterns Books, Authors, Publishers, …Books Title, Author, Price, Publisher
Mapping Knowledge BaseData Instances
LearnerName Learner
Data TypeLearner
DescriptionLearner
StructureLearner
NL:… DIL:…DTL:… DL:…SL:… ML:…
Meta Learner
C1
NL:… DIL:…DTL:… DL:…SL:… ML:…
CN
Learners:Learners: extract knowledge from schemas and mappings
Schemas and mappings: Schemas and mappings: accumulated over time
Learned models:Learned models: for each unique element in any schema.
Mapping Knowledge BaseMapping Knowledge Base
Preliminary results: Corpus is useful
Shipping Domain
-15
-10
-5
0
5
10
15
P1a P1b P2a P2b P3a P3b P4a P4b
Schema Pairs
Avg
Nu
mb
er
of
Matc
hes
Only MKB Only BASIC
With and without the corpusInventory Domain
0
0.2
0.4
0.6
0.8
1
P1a P1b P2a P2b P3a P3b P4a P4b
Schema Pairs
Recall
MKB BASIC COMB
Outline
Recent progress Mediation languages Query processing (XML and other) Some lessons from commercial world.
Current challenges Enabling large-scale data sharing: peer-data
management systems. The age old problem: semantic heterogeneity. A new agenda item for AI: corpus-based KR.
Corpus vs. Traditional KR
A large corpus of uncoordinated knowledge fragments
vs.Carefully designed knowledge base
Can a corpus offer a more attractive solution for some KR problems?
Pause: KR vs. Corpus
Knowledge base: Hard to engineer, brittle at the boundariesOnly one way of saying things.
Corpus: “Easier” to build, coverage not predefined.Many views of the domain.
See proceedings for full argument.
Corpus-based KRContents:Schemas, ontologies, meta-data, data,
queries, mappings.
Collect statistics on the corpus:How often does a word appear as a
relation name? When it does, what tend to be the attribute
names? What other tables are there?
Support a KR-style interface on the corpus (OKBC-like)
Other Applications of C-B-KR
Question answering on the web
Focused crawling
Natural language interfaces to DB’s
Schema and ontology authoring
Semantic query optimization.
Whenever we need knowledge to help us rank multiple answers/plans.
Example Queries
How are two terms related? GPA(studentID, $value), Student(studentID, GPA, address)
Find different ways of saying the same:Class(Lexus, Luxury)LuxuryCar(Lexus, Toyota)
When do two terms play similar roles? IJCAIReview(p1, rev2, accept)AIJReferees(round2, p3, rev4, reject)
Challenges for C-B-KR
Building the corpus.
How focused should the corpus be?
Is human tuning needed or helpful?
How do we accommodate inference?
How do we leverage traditional KR?
SummaryThe vision: data authoring, querying and sharing by everyone. We got the plumbing to work. To go further, we need AI
techniques.
Challenge: cross the structure chasm: It’s hard to author & query structured data! PDMS: architecture for ad-hoc sharing. Ontology/schema matching is key!
Are we providing the right tools? Corpus-based knowledge representation.
We need benchmarks!
Some References
www.cs.washington.edu/homes/alonPiazza: ICDE03, WWW03, VLDB-03The Structure Chasm: CIDR-03Mediation surveys: VLDB Journal 01 Lenzerini tutorial.
Schema matching: Rahm and Bernstein, VLDB Journal 01.
Workshops: IJCAI, Semantic Web Conf.Teaching integration to undergraduates: SIGMOD Record, September, 2003.