Crossing the Structure Chasm
Alon HalevyUniversity of Washington, Seattle
UBC, January 15, 2004
The Structure ChasmAuthoring Creating a
schemaWriting text
Querying keywords Using someone else’s schema
Data sharing Easy Committees, standardsBut we can pose
complex queries
Why is This a Problem?Databases used to be isolated and administered only by experts.Today’s applications call for large-scale data sharing: Big science (bio-medicine, astrophysics, …) Government agencies Large corporations The web (over 100,000 searchable data sources)
The vision: Content authoring by anyone, anywhere Powerful database-style querying Use relevant data from anywhere to answer the query The Semantic Web
Fundamental problem: reconciling different models of the world.
OutlineOther benefits of structure: (Semantic) email Personal data management
A tour of recent data sharing architectures Data integration systems Peer-data management systems
The algorithmic problems: Query reformulation Reconciling semantic heterogeneity What can we do with a large corpus of schemas?
Adding Structure to Email Email is often used for lightweight data management tasks: Organizing a PC meeting + dinner. Arranging a ‘balanced’ potluck Giving away opera tickets Announcing an event and associated reminders.
Some specialized tools/services: Outlook scheduling, evite.com
Can we delegate some email tasks easily?
Constraints
Check OK
bringingemail
jane@cs Entree
Semantic Email ProcessesOriginator RecipientsProcess Database
“Start a potluck process”
“Here is whateveryone isbringing…”
“What willyou bring?”
john@cs Dessert“I’ll bringa dessert”
mary@ee Appetizer “I’ll bringan appetizer”
jayant@u Dessert“I’ll bringa dessert”
“I’ll bringa dessert”
“I’ll bringan entree”
“Too many desserts.Appetizer or entrée?”
STOP
“I’ll bringa dessert”
Semantic Email[Etzioni, McDowell, (Ha)Levy]
Creating the structure?We’ll help with template interfaces
Incorporating additional knowledge? I always bring desserts I don’t schedule morning meetings Another data sharing challenge.
But it’s free: (and cross platform) www.cs.washington.edu/research/semweb
Personal Data Management
HTMLMail &
calendar
Cites
EventMessag
e
Document
Web Page
Presentation
Cached
SoftcopySoftcopySender,
Recipients
Organizer, Participants
Person
Paper
Author
Homepage
Author
Data is organized by application
[Semex: Sigurdsson, Nemes, H.]
Papers Files Presentations
Finding Publications
Person: A. HalevyPerson: Dan SuciuPerson: Maya RodrigPerson: Steven GribblePerson: Zachary Ives
Publication: What Can Peer-to-Peer Do for Databases, and Vice Versa
Publication
Bernstein
Following Associations (1)
“A survey of approaches to automatic schema matching”
“Corpus-based schema matching”
“Database management for peer-to-peer computing: A vision”
“Matching schemas by learning from others”
“A survey of approaches to automatic schema matching”
“Corpus-based schema matching”
“Database management for peer-to-peer computing: A vision”
“Matching schemas by learning from others”
Publication
Bernstein
Following Associations (2)
Publication
Bernstein
Cited by
Publication
Citations
Following Associations (3)
Cited Authors
Bernstein
Publication
Following Associations (4)
Structure for Personal Data
High-level concepts are given, but laterextend and personalize concept hierarchy,share (parts) of our data with others, incorporate external data into our view.
Concepts are populated automatically with instancesNeed Instance level reconciliation:
Alon Halevy, A. Halevy, Alon Y. Levy – same guy!
Outline Other benefits of structure:
(Semantic) email Personal data management
A tour of recent data sharing architectures Data integration systems Peer-data management systems
The algorithmic problems: Query reformulation Reconciling semantic heterogeneity What can we do with a large corpus of schemas?
Data Integration
Goal: provide a uniform interface to a set of autonomous data sources.First step towards data sharing. Many research projects (DB & AI) Mine: Information Manifold, Tukwila, LSD
Recent industry: Startups: Nimble, Enosys, Composite, MetaMatrix Products from big players: BEA, IBM
Relational DBMS RefresherSchema: the template for data.
Queries:
SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
Students: Takes:
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
Courses:
SELECT C.nameFROM Students S, Takes T, Courses CWHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid
Data Integration: Higher-level Abstraction
Mediated Schema
Q
Q1 Q2 Q3SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
… …
Semantic mappings
Mediated Schema
OMIM Swiss-ProtHUGO GO
Gene-Clinics EntrezLocus-
Link GEO
Entity
Sequenceable EntityGenePhenotype Structured
Vocabulary Experiment
Protein Nucleotide Sequence
Microarray Experiment
Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code?
www.biomediator.orgTarczy-Hornoch, MorkTarczy-Hornoch, Mork
Semantic Mappings
BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords
Books TitleISBNPriceDiscountPriceEdition
CDs AlbumASINPriceDiscountPriceStudio
BookCategoriesISBNCategory
CDCategoriesASINCategory
ArtistsASINArtistNameGroupName
AuthorsISBNFirstNameLastName
Inventory Database A
Inventory Database B
Differences in: Names in schema Attribute grouping
Coverage of databases Granularity and format of attributes
Issues for Semantic Mappings
Mediated Schema
Q
Q’ Q’ Q’SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
… …
Semantic mappings
Formalism for mappings Reformulation algorithms
How will we create them?
Beyond Data IntegrationMediated schema is a bottleneck for large-scale data sharing
It’s hard to create, maintain, and agree upon.
Peer Data Management Systems
UW
Stanford
DBLP
UBC Waterloo
CiteSeer
TorontoQ
Q1
Q2Q6
Q5
Q4
Q3Mappings specified locallyMap to most convenient nodesQueries answered by traversing semantic paths.
Piazza: [Tatarinov, H., Ives, Suciu, Mork]
PDMS-Related Projects
Hyperion (Toronto)PeerDB (Singapore)Local relational models (Trento)Edutella (Hannover, Germany)Semantic Gossiping (EPFL Zurich)Raccoon (UC Irvine)Orchestra (Ives, U. Penn)
A Few Comments about CommerceUntil 5 years ago: Data integration = Data warehousing.
Since then: A wave of startups:
Nimble, MetaMatrix, Calixa, Composite, Enosys Big guys made announcements (IBM, BEA). [Delay] Big guys released products.
Success: analysts have new buzzword – EII New addition to acronym soup (with EAI).
Lessons: Performance was fine. Need management tools.
Data Integration: Before
Mediated Schema
SourceSource Source Source Source
Q
Q’ Q’ Q’ Q’ Q’
XML Query
User Applications Lens™ File InfoBrowser™ Software
Developers KitNIMBLE™ APIs
Front-End
XML
Lens Builder™
Management Tools
Integration Builder
Security Tools
Data Administrator
Data Integration: After
Concordance Developer
Integration
Layer
Nimble Integration Engine™Compiler Executor
MetadataServerCache
Relational Data Warehouse/ Mart
Legacy Flat File Web Pages
Common XML View
Sound Business ModelsExplosion of intranet and extranet information80% of corporate information is unmanagedBy 2004 30X more enterprise data than 1999The average company: maintains 49 distinct
enterprise applications spends 35% of total IT
budget on integration-related efforts
1995 1997 1999 2001 2003 2005
Enterprise Information
Source: Gartner, 1999
Outline Other benefits of structure:
(Semantic) email Personal data management
A tour of recent data sharing architectures Data integration systems Peer-data management systems
The algorithmic problems: Query reformulation Reconciling semantic heterogeneity What can we do with a large corpus of schemas?
Languages for Schema Mapping
Mediated Schema
SourceSource Source Source Source
Q
Q’ Q’ Q’ Q’ Q’
GAV LAV GLAV
Local-as-View (LAV)
Book: ISBN, Title, Genre, Year
R1 R2 R3 R4 R5
Author: ISBN, Name
R1(x,y,n) :- Book(x, y, z, t), Author(x, n), t < 1970R5(x,y) :- Book(x,y,”Humor”)
Books before 1970 Humor books
Query Reformulation
Book: ISBN, Title, Genre, Year
R1 R2 R3 R4 R5
Author: ISBN, Name
Books before 1970 Humor books
Query: Find authors of humor books
Plan: R1 Join R5
Query Reformulation
Book: ISBN, Title, Genre, Year
R1 R2 R3 R4 R5
Author: ISBN, Name
ISBN, Title, Name ISBN, Title
Find authors of humor books before 1960
Plan: Can’t do it!(subtle reasons)
Query Reformulation
Query is posed on mediated schema that contains no data.Sources are answers to queries (views).Problem: answering queries using views (Conceptually) Need to invert query
expression.
Traditional databases also use this:Can you reuse previously cached results?
Answering Queries Using Views
NP-Complete for basic queries [LMSS, PODS 95].Results depend on:Query language used for sources and
queries,Open-world vs. Closed-world assumptionAllowable access patterns to the sources
A lot of beautiful theory!
Theory?
A lot of beautiful theory.
“There is in these words the beautiful maneuverability of the abstract, rushing in to replace the intractability of the concrete.”
Milan KunderaThe Book of Laughter and Forgetting
Practical Query ReformulationA lot of nice theory.But also very practical algorithms:MiniCon [Pottinger and H., 2001]: scales to
thousands of sources.Every commercial DBMS implements some
version of answering queries using views.
See [Halevy, 2001] for survey.
Reformulation in PDMS
UW
Stanford
DBLP
UBC Waterloo
CiteSeer
Toronto
Can’t follow all paths naivelyPruning techniques
[Tatarinov, H.]Can we pre-compute some paths?
Need to compose mappings [Madhavan, H.,
VLDB-2003]
Open PDMS Research Issues
UW
Stanford
DBLP
UBC Waterloo
CiteSeer
Toronto
Managing large networks of mappings:
• Consistency• Trust
Improving networks: finding additional mappings
Indexing:Heterogeneous data across the networkCaching:Where? What?
Outline Other benefits of structure:
(Semantic) email Personal data management
A tour of recent data sharing architectures Data integration systems Peer-data management systems
The algorithmic problems: Query reformulation Reconciling semantic heterogeneity What can we do with a large corpus of schemas?
Semantic Mappings
BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords
Books TitleISBNPriceDiscountPriceEdition
CDs AlbumASINPriceDiscountPriceStudio
BookCategoriesISBNCategory
CDCategoriesASINCategory
ArtistsASINArtistNameGroupName
AuthorsISBNFirstNameLastName
Inventory Database A
Inventory Database B
Need mappings in every data sharing architecture
“Standards are great, but there are too many.”
Why is it so Hard?Schemas never fully capture their intended meaning:Schema elements are just symbols.We need to leverage any additional information
we may have.
‘Theorem’: Schema matching is AI-Complete.Hence, a human will always be in the loop.Goal is to improve designer’s productivity.Solution must be extensible.
Matching HeuristicsMultiple sources of evidences in the schemas Schema element names
BooksAndCDs/Categories ~ BookCategories/Category Descriptions and documentation
ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book
Data types, data instances DateTime Integer, addresses have similar formats
Schema structure All books have similar attributes
Use domain knowledge
All these techniques consider only the two schemas.
In isolation, techniques are incomplete or brittle:Need principled combination.
Using Past ExperienceMatching tasks are often repetitive Humans improve over time at matching. A matching system should improve too!
LSD: Learns to recognize elements of mediated schema. [Doan, Domingos, H., SIGMOD-01, MLJ-03]
Doan: 2003 ACM Distinguished Dissertation Award.
Mediated Schema
data sources
Mediated Schema
listed-price $250,000 $110,000 ...
address price agent-phone description
Example: Matching Real-Estate Sources
location Miami, FL Boston, MA ...
phone(305) 729 0831(617) 253 1429 ...
commentsFantastic houseGreat location ...
realestate.com
location listed-price phone comments
Schema of realestate.com
If “fantastic” & “great”
occur frequently in data values =>
description
Learned hypotheses
price $550,000 $320,000 ...
contact-phone(278) 345 7215(617) 335 2315 ...
extra-infoBeautiful yardGreat beach ...
homes.com
If “phone” occurs in the name =>
agent-phone
Mediated schema
Learning Source Descriptions
We learn a classifier for each element of the mediated schema.Training examples are provided by the given mappings.Multi-strategy learning:Base learners: name, instance, descriptionCombine using stacking.
Accuracy of 70-90% in experiments.
Corpus-Based Schema MatchingCan we use previous experience to match two new schemas?Can a corpus of schemas and matches be a general purpose resource?Information Retrieval and NLP progressed by using corpora –Can the same be done for structured data?
Corpus-Based Schema MatchingCan we use previous experience to match two new schemas?
CDs Categories Artists
Items
Artists
Authors Books
Music
Information
Litreture
Publisher
Authors
Corpus of Schemas and MatchesCorpus of Schemas and MatchesReuse extracted knowledgeto match new schemas
Learn general purpose knowledge
Classifier for every corpus element
Data InstancesLearnerName Learner
Data TypeLearner
DescriptionLearner
StructureLearner
Meta Learner
multi-strategy learning
The Corpus vs. Other MatchersInventory Domain
0
0.2
0.4
0.6
0.8
1
P1a P1b P2a P2b P3a P3b P4a P4b
Schema Pairs
Recall
MKB BASIC COMB
Exploiting Previous Experience
Shipping Domain
-15
-10
-5
0
5
10
15
P1a P1b P2a P2b P3a P3b P4a P4b
Schema Pairs
Avg Number of Matches
Only MKB Only BASIC
Corpus Challenges
What exactly should we learn?Generalizing with few training examplesBalancing previous experience with other cluesSize and scope of the corpus
Other Corpus Based Tools
Conjecture: a corpus of schemas can be the basis for many useful tools. Auto-complete: I start creating a schema (or show sample
data), and the tool suggests a completion. Formulating queries on new databases: I ask a query using my terminology, and it
gets reformulated appropriately. Now we can cross the structure chasm.
ConclusionVision: data authoring, querying and sharing by everyone, everywhere. Structure is useful in our daily tasks. Key challenge: reconciling semantic heterogeneity
CorpusOf
schemas
schemamapping
Some References
www.cs.washington.edu/homes/alonPiazza: ICDE03, WWW03, VLDB-03The Structure Chasm: CIDR-03Surveys on schema matching languages: Halevy, VLDB Journal 01 Lenzerini, PODS 2002
Semi-automatic schema matching: Rahm and Bernstein, VLDB Journal 01.
Teaching integration to undergraduates: SIGMOD Record, September, 2003.