62
Automatic Schema Automatic Schema Matching Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Embed Size (px)

Citation preview

Page 1: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Automatic Schema MatchingAutomatic Schema Matching

Nicole OldhamCSCI 8350

(Semantic Web Course Univ of Georgia)Topic Presentation

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Schema MatchingSchema Matching

bull Match Takes two schemas as input and produces a mapping between the elements that correspond to each other semantically

bull It is usually performed manually- Tedious- Time Consuming- Error Prone- Expensive

We must automate this process

ExampleExample

bull GTE telecommunications needed to integrate 40 databases with a total of 27000 elements

bull Project planners estimated that manual matching would take 12 person years to integrate

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Various Levels of HeterogenityVarious Levels of Heterogenity

ftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

How to deal with Semantic How to deal with Semantic HeterogenityHeterogenity

1 Standardize agree on a common representation

2 Translate create mappings between different schemas1048766 -requires human input and machine reasoning1048766 -mappings can be difficult and expensive

3 Annotate create relationships between agreed upon conceptualizations

1048766 -requires human input and machine reasoning1048766 -annotation can be difficult and expensive1048766

ftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

ChallengesChallengesbull Actual semantics of the involved elements are typically only from the

creators or documentation ndash so we must use clues in the schema and data instead

bull These clues are often misleading bull Ie lsquoArearsquo can refer to different entitiesbull Ie The same entities can have very different names

bull Clues are often ambiguousbull Ie lsquoContact-agentrsquo Agent name or phone number

bull Matching process can be very costlybull Each element of the schema must be examined to ensure discovery of

the best match

bull Matching is often subjective depending on the application

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Where is Schema Matching Where is Schema Matching usedused

bull Database Application Domains- Data Integration- Data Warehousing- E-Business- Query Processing

bull Semantic Web- XMLHTML to an Ontology- Semantic Web Services

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema IntegrationSchema Integration

Problem Construct a global view from a set of independently constructed schemas

(ie ontologies)

- Different structure and terminologies

Solution Schema Matching is performed to find relationships between concepts in each schema Then the matching elements can be unified

Bernstein P Rahm E A survey of approaches to automatic schema matching

Data WarehousesData Warehouses

Problem Integrating data sources into a data warehouse

- Different formats between the source and warehouse

Solution Use matching to find the elements of the source that are also present in the warehouse Then the details of the semantics can be examined to integrate the two

Bernstein P Rahm E A survey of approaches to automatic schema matching

E-CommerceE-Commerce

Problem Message translation

-Each trading partner uses its own message format

Solution A match operation would reduce the amount of manual work to specify how the formats are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

Query ProcessingQuery Processing

Problem The terms used in the userrsquos query may be different from those in the database

Solution Matching is used to map the user-specified concepts in the query to schema elements

Bernstein P Rahm E A survey of approaches to automatic schema matching

Need for Data Integration on the Need for Data Integration on the Semantic WebSemantic Web

bull Problem Web documents are not in RDF or any form suitable for the SW

bull We must annotate them with concepts from ontologies

bull Solution Use schema matching to map between elements represented in OWL and the different schemas of web documents

Semantic Web ServicesSemantic Web Services

bull Problem Web Services are currently searched for using keywords

bull We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently

bull WSDLs are in XML Ontologies in OWL

bull Solution Use schema matching approaches to map between the two different schemas

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 2: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Schema MatchingSchema Matching

bull Match Takes two schemas as input and produces a mapping between the elements that correspond to each other semantically

bull It is usually performed manually- Tedious- Time Consuming- Error Prone- Expensive

We must automate this process

ExampleExample

bull GTE telecommunications needed to integrate 40 databases with a total of 27000 elements

bull Project planners estimated that manual matching would take 12 person years to integrate

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Various Levels of HeterogenityVarious Levels of Heterogenity

ftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

How to deal with Semantic How to deal with Semantic HeterogenityHeterogenity

1 Standardize agree on a common representation

2 Translate create mappings between different schemas1048766 -requires human input and machine reasoning1048766 -mappings can be difficult and expensive

3 Annotate create relationships between agreed upon conceptualizations

1048766 -requires human input and machine reasoning1048766 -annotation can be difficult and expensive1048766

ftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

ChallengesChallengesbull Actual semantics of the involved elements are typically only from the

creators or documentation ndash so we must use clues in the schema and data instead

bull These clues are often misleading bull Ie lsquoArearsquo can refer to different entitiesbull Ie The same entities can have very different names

bull Clues are often ambiguousbull Ie lsquoContact-agentrsquo Agent name or phone number

bull Matching process can be very costlybull Each element of the schema must be examined to ensure discovery of

the best match

bull Matching is often subjective depending on the application

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Where is Schema Matching Where is Schema Matching usedused

bull Database Application Domains- Data Integration- Data Warehousing- E-Business- Query Processing

bull Semantic Web- XMLHTML to an Ontology- Semantic Web Services

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema IntegrationSchema Integration

Problem Construct a global view from a set of independently constructed schemas

(ie ontologies)

- Different structure and terminologies

Solution Schema Matching is performed to find relationships between concepts in each schema Then the matching elements can be unified

Bernstein P Rahm E A survey of approaches to automatic schema matching

Data WarehousesData Warehouses

Problem Integrating data sources into a data warehouse

- Different formats between the source and warehouse

Solution Use matching to find the elements of the source that are also present in the warehouse Then the details of the semantics can be examined to integrate the two

Bernstein P Rahm E A survey of approaches to automatic schema matching

E-CommerceE-Commerce

Problem Message translation

-Each trading partner uses its own message format

Solution A match operation would reduce the amount of manual work to specify how the formats are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

Query ProcessingQuery Processing

Problem The terms used in the userrsquos query may be different from those in the database

Solution Matching is used to map the user-specified concepts in the query to schema elements

Bernstein P Rahm E A survey of approaches to automatic schema matching

Need for Data Integration on the Need for Data Integration on the Semantic WebSemantic Web

bull Problem Web documents are not in RDF or any form suitable for the SW

bull We must annotate them with concepts from ontologies

bull Solution Use schema matching to map between elements represented in OWL and the different schemas of web documents

Semantic Web ServicesSemantic Web Services

bull Problem Web Services are currently searched for using keywords

bull We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently

bull WSDLs are in XML Ontologies in OWL

bull Solution Use schema matching approaches to map between the two different schemas

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 3: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Schema MatchingSchema Matching

bull Match Takes two schemas as input and produces a mapping between the elements that correspond to each other semantically

bull It is usually performed manually- Tedious- Time Consuming- Error Prone- Expensive

We must automate this process

ExampleExample

bull GTE telecommunications needed to integrate 40 databases with a total of 27000 elements

bull Project planners estimated that manual matching would take 12 person years to integrate

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Various Levels of HeterogenityVarious Levels of Heterogenity

ftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

How to deal with Semantic How to deal with Semantic HeterogenityHeterogenity

1 Standardize agree on a common representation

2 Translate create mappings between different schemas1048766 -requires human input and machine reasoning1048766 -mappings can be difficult and expensive

3 Annotate create relationships between agreed upon conceptualizations

1048766 -requires human input and machine reasoning1048766 -annotation can be difficult and expensive1048766

ftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

ChallengesChallengesbull Actual semantics of the involved elements are typically only from the

creators or documentation ndash so we must use clues in the schema and data instead

bull These clues are often misleading bull Ie lsquoArearsquo can refer to different entitiesbull Ie The same entities can have very different names

bull Clues are often ambiguousbull Ie lsquoContact-agentrsquo Agent name or phone number

bull Matching process can be very costlybull Each element of the schema must be examined to ensure discovery of

the best match

bull Matching is often subjective depending on the application

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Where is Schema Matching Where is Schema Matching usedused

bull Database Application Domains- Data Integration- Data Warehousing- E-Business- Query Processing

bull Semantic Web- XMLHTML to an Ontology- Semantic Web Services

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema IntegrationSchema Integration

Problem Construct a global view from a set of independently constructed schemas

(ie ontologies)

- Different structure and terminologies

Solution Schema Matching is performed to find relationships between concepts in each schema Then the matching elements can be unified

Bernstein P Rahm E A survey of approaches to automatic schema matching

Data WarehousesData Warehouses

Problem Integrating data sources into a data warehouse

- Different formats between the source and warehouse

Solution Use matching to find the elements of the source that are also present in the warehouse Then the details of the semantics can be examined to integrate the two

Bernstein P Rahm E A survey of approaches to automatic schema matching

E-CommerceE-Commerce

Problem Message translation

-Each trading partner uses its own message format

Solution A match operation would reduce the amount of manual work to specify how the formats are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

Query ProcessingQuery Processing

Problem The terms used in the userrsquos query may be different from those in the database

Solution Matching is used to map the user-specified concepts in the query to schema elements

Bernstein P Rahm E A survey of approaches to automatic schema matching

Need for Data Integration on the Need for Data Integration on the Semantic WebSemantic Web

bull Problem Web documents are not in RDF or any form suitable for the SW

bull We must annotate them with concepts from ontologies

bull Solution Use schema matching to map between elements represented in OWL and the different schemas of web documents

Semantic Web ServicesSemantic Web Services

bull Problem Web Services are currently searched for using keywords

bull We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently

bull WSDLs are in XML Ontologies in OWL

bull Solution Use schema matching approaches to map between the two different schemas

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 4: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

ExampleExample

bull GTE telecommunications needed to integrate 40 databases with a total of 27000 elements

bull Project planners estimated that manual matching would take 12 person years to integrate

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Various Levels of HeterogenityVarious Levels of Heterogenity

ftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

How to deal with Semantic How to deal with Semantic HeterogenityHeterogenity

1 Standardize agree on a common representation

2 Translate create mappings between different schemas1048766 -requires human input and machine reasoning1048766 -mappings can be difficult and expensive

3 Annotate create relationships between agreed upon conceptualizations

1048766 -requires human input and machine reasoning1048766 -annotation can be difficult and expensive1048766

ftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

ChallengesChallengesbull Actual semantics of the involved elements are typically only from the

creators or documentation ndash so we must use clues in the schema and data instead

bull These clues are often misleading bull Ie lsquoArearsquo can refer to different entitiesbull Ie The same entities can have very different names

bull Clues are often ambiguousbull Ie lsquoContact-agentrsquo Agent name or phone number

bull Matching process can be very costlybull Each element of the schema must be examined to ensure discovery of

the best match

bull Matching is often subjective depending on the application

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Where is Schema Matching Where is Schema Matching usedused

bull Database Application Domains- Data Integration- Data Warehousing- E-Business- Query Processing

bull Semantic Web- XMLHTML to an Ontology- Semantic Web Services

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema IntegrationSchema Integration

Problem Construct a global view from a set of independently constructed schemas

(ie ontologies)

- Different structure and terminologies

Solution Schema Matching is performed to find relationships between concepts in each schema Then the matching elements can be unified

Bernstein P Rahm E A survey of approaches to automatic schema matching

Data WarehousesData Warehouses

Problem Integrating data sources into a data warehouse

- Different formats between the source and warehouse

Solution Use matching to find the elements of the source that are also present in the warehouse Then the details of the semantics can be examined to integrate the two

Bernstein P Rahm E A survey of approaches to automatic schema matching

E-CommerceE-Commerce

Problem Message translation

-Each trading partner uses its own message format

Solution A match operation would reduce the amount of manual work to specify how the formats are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

Query ProcessingQuery Processing

Problem The terms used in the userrsquos query may be different from those in the database

Solution Matching is used to map the user-specified concepts in the query to schema elements

Bernstein P Rahm E A survey of approaches to automatic schema matching

Need for Data Integration on the Need for Data Integration on the Semantic WebSemantic Web

bull Problem Web documents are not in RDF or any form suitable for the SW

bull We must annotate them with concepts from ontologies

bull Solution Use schema matching to map between elements represented in OWL and the different schemas of web documents

Semantic Web ServicesSemantic Web Services

bull Problem Web Services are currently searched for using keywords

bull We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently

bull WSDLs are in XML Ontologies in OWL

bull Solution Use schema matching approaches to map between the two different schemas

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 5: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Various Levels of HeterogenityVarious Levels of Heterogenity

ftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

How to deal with Semantic How to deal with Semantic HeterogenityHeterogenity

1 Standardize agree on a common representation

2 Translate create mappings between different schemas1048766 -requires human input and machine reasoning1048766 -mappings can be difficult and expensive

3 Annotate create relationships between agreed upon conceptualizations

1048766 -requires human input and machine reasoning1048766 -annotation can be difficult and expensive1048766

ftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

ChallengesChallengesbull Actual semantics of the involved elements are typically only from the

creators or documentation ndash so we must use clues in the schema and data instead

bull These clues are often misleading bull Ie lsquoArearsquo can refer to different entitiesbull Ie The same entities can have very different names

bull Clues are often ambiguousbull Ie lsquoContact-agentrsquo Agent name or phone number

bull Matching process can be very costlybull Each element of the schema must be examined to ensure discovery of

the best match

bull Matching is often subjective depending on the application

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Where is Schema Matching Where is Schema Matching usedused

bull Database Application Domains- Data Integration- Data Warehousing- E-Business- Query Processing

bull Semantic Web- XMLHTML to an Ontology- Semantic Web Services

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema IntegrationSchema Integration

Problem Construct a global view from a set of independently constructed schemas

(ie ontologies)

- Different structure and terminologies

Solution Schema Matching is performed to find relationships between concepts in each schema Then the matching elements can be unified

Bernstein P Rahm E A survey of approaches to automatic schema matching

Data WarehousesData Warehouses

Problem Integrating data sources into a data warehouse

- Different formats between the source and warehouse

Solution Use matching to find the elements of the source that are also present in the warehouse Then the details of the semantics can be examined to integrate the two

Bernstein P Rahm E A survey of approaches to automatic schema matching

E-CommerceE-Commerce

Problem Message translation

-Each trading partner uses its own message format

Solution A match operation would reduce the amount of manual work to specify how the formats are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

Query ProcessingQuery Processing

Problem The terms used in the userrsquos query may be different from those in the database

Solution Matching is used to map the user-specified concepts in the query to schema elements

Bernstein P Rahm E A survey of approaches to automatic schema matching

Need for Data Integration on the Need for Data Integration on the Semantic WebSemantic Web

bull Problem Web documents are not in RDF or any form suitable for the SW

bull We must annotate them with concepts from ontologies

bull Solution Use schema matching to map between elements represented in OWL and the different schemas of web documents

Semantic Web ServicesSemantic Web Services

bull Problem Web Services are currently searched for using keywords

bull We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently

bull WSDLs are in XML Ontologies in OWL

bull Solution Use schema matching approaches to map between the two different schemas

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 6: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

How to deal with Semantic How to deal with Semantic HeterogenityHeterogenity

1 Standardize agree on a common representation

2 Translate create mappings between different schemas1048766 -requires human input and machine reasoning1048766 -mappings can be difficult and expensive

3 Annotate create relationships between agreed upon conceptualizations

1048766 -requires human input and machine reasoning1048766 -annotation can be difficult and expensive1048766

ftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

ChallengesChallengesbull Actual semantics of the involved elements are typically only from the

creators or documentation ndash so we must use clues in the schema and data instead

bull These clues are often misleading bull Ie lsquoArearsquo can refer to different entitiesbull Ie The same entities can have very different names

bull Clues are often ambiguousbull Ie lsquoContact-agentrsquo Agent name or phone number

bull Matching process can be very costlybull Each element of the schema must be examined to ensure discovery of

the best match

bull Matching is often subjective depending on the application

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Where is Schema Matching Where is Schema Matching usedused

bull Database Application Domains- Data Integration- Data Warehousing- E-Business- Query Processing

bull Semantic Web- XMLHTML to an Ontology- Semantic Web Services

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema IntegrationSchema Integration

Problem Construct a global view from a set of independently constructed schemas

(ie ontologies)

- Different structure and terminologies

Solution Schema Matching is performed to find relationships between concepts in each schema Then the matching elements can be unified

Bernstein P Rahm E A survey of approaches to automatic schema matching

Data WarehousesData Warehouses

Problem Integrating data sources into a data warehouse

- Different formats between the source and warehouse

Solution Use matching to find the elements of the source that are also present in the warehouse Then the details of the semantics can be examined to integrate the two

Bernstein P Rahm E A survey of approaches to automatic schema matching

E-CommerceE-Commerce

Problem Message translation

-Each trading partner uses its own message format

Solution A match operation would reduce the amount of manual work to specify how the formats are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

Query ProcessingQuery Processing

Problem The terms used in the userrsquos query may be different from those in the database

Solution Matching is used to map the user-specified concepts in the query to schema elements

Bernstein P Rahm E A survey of approaches to automatic schema matching

Need for Data Integration on the Need for Data Integration on the Semantic WebSemantic Web

bull Problem Web documents are not in RDF or any form suitable for the SW

bull We must annotate them with concepts from ontologies

bull Solution Use schema matching to map between elements represented in OWL and the different schemas of web documents

Semantic Web ServicesSemantic Web Services

bull Problem Web Services are currently searched for using keywords

bull We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently

bull WSDLs are in XML Ontologies in OWL

bull Solution Use schema matching approaches to map between the two different schemas

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 7: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

ChallengesChallengesbull Actual semantics of the involved elements are typically only from the

creators or documentation ndash so we must use clues in the schema and data instead

bull These clues are often misleading bull Ie lsquoArearsquo can refer to different entitiesbull Ie The same entities can have very different names

bull Clues are often ambiguousbull Ie lsquoContact-agentrsquo Agent name or phone number

bull Matching process can be very costlybull Each element of the schema must be examined to ensure discovery of

the best match

bull Matching is often subjective depending on the application

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Where is Schema Matching Where is Schema Matching usedused

bull Database Application Domains- Data Integration- Data Warehousing- E-Business- Query Processing

bull Semantic Web- XMLHTML to an Ontology- Semantic Web Services

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema IntegrationSchema Integration

Problem Construct a global view from a set of independently constructed schemas

(ie ontologies)

- Different structure and terminologies

Solution Schema Matching is performed to find relationships between concepts in each schema Then the matching elements can be unified

Bernstein P Rahm E A survey of approaches to automatic schema matching

Data WarehousesData Warehouses

Problem Integrating data sources into a data warehouse

- Different formats between the source and warehouse

Solution Use matching to find the elements of the source that are also present in the warehouse Then the details of the semantics can be examined to integrate the two

Bernstein P Rahm E A survey of approaches to automatic schema matching

E-CommerceE-Commerce

Problem Message translation

-Each trading partner uses its own message format

Solution A match operation would reduce the amount of manual work to specify how the formats are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

Query ProcessingQuery Processing

Problem The terms used in the userrsquos query may be different from those in the database

Solution Matching is used to map the user-specified concepts in the query to schema elements

Bernstein P Rahm E A survey of approaches to automatic schema matching

Need for Data Integration on the Need for Data Integration on the Semantic WebSemantic Web

bull Problem Web documents are not in RDF or any form suitable for the SW

bull We must annotate them with concepts from ontologies

bull Solution Use schema matching to map between elements represented in OWL and the different schemas of web documents

Semantic Web ServicesSemantic Web Services

bull Problem Web Services are currently searched for using keywords

bull We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently

bull WSDLs are in XML Ontologies in OWL

bull Solution Use schema matching approaches to map between the two different schemas

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 8: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Where is Schema Matching Where is Schema Matching usedused

bull Database Application Domains- Data Integration- Data Warehousing- E-Business- Query Processing

bull Semantic Web- XMLHTML to an Ontology- Semantic Web Services

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema IntegrationSchema Integration

Problem Construct a global view from a set of independently constructed schemas

(ie ontologies)

- Different structure and terminologies

Solution Schema Matching is performed to find relationships between concepts in each schema Then the matching elements can be unified

Bernstein P Rahm E A survey of approaches to automatic schema matching

Data WarehousesData Warehouses

Problem Integrating data sources into a data warehouse

- Different formats between the source and warehouse

Solution Use matching to find the elements of the source that are also present in the warehouse Then the details of the semantics can be examined to integrate the two

Bernstein P Rahm E A survey of approaches to automatic schema matching

E-CommerceE-Commerce

Problem Message translation

-Each trading partner uses its own message format

Solution A match operation would reduce the amount of manual work to specify how the formats are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

Query ProcessingQuery Processing

Problem The terms used in the userrsquos query may be different from those in the database

Solution Matching is used to map the user-specified concepts in the query to schema elements

Bernstein P Rahm E A survey of approaches to automatic schema matching

Need for Data Integration on the Need for Data Integration on the Semantic WebSemantic Web

bull Problem Web documents are not in RDF or any form suitable for the SW

bull We must annotate them with concepts from ontologies

bull Solution Use schema matching to map between elements represented in OWL and the different schemas of web documents

Semantic Web ServicesSemantic Web Services

bull Problem Web Services are currently searched for using keywords

bull We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently

bull WSDLs are in XML Ontologies in OWL

bull Solution Use schema matching approaches to map between the two different schemas

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 9: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Where is Schema Matching Where is Schema Matching usedused

bull Database Application Domains- Data Integration- Data Warehousing- E-Business- Query Processing

bull Semantic Web- XMLHTML to an Ontology- Semantic Web Services

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema IntegrationSchema Integration

Problem Construct a global view from a set of independently constructed schemas

(ie ontologies)

- Different structure and terminologies

Solution Schema Matching is performed to find relationships between concepts in each schema Then the matching elements can be unified

Bernstein P Rahm E A survey of approaches to automatic schema matching

Data WarehousesData Warehouses

Problem Integrating data sources into a data warehouse

- Different formats between the source and warehouse

Solution Use matching to find the elements of the source that are also present in the warehouse Then the details of the semantics can be examined to integrate the two

Bernstein P Rahm E A survey of approaches to automatic schema matching

E-CommerceE-Commerce

Problem Message translation

-Each trading partner uses its own message format

Solution A match operation would reduce the amount of manual work to specify how the formats are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

Query ProcessingQuery Processing

Problem The terms used in the userrsquos query may be different from those in the database

Solution Matching is used to map the user-specified concepts in the query to schema elements

Bernstein P Rahm E A survey of approaches to automatic schema matching

Need for Data Integration on the Need for Data Integration on the Semantic WebSemantic Web

bull Problem Web documents are not in RDF or any form suitable for the SW

bull We must annotate them with concepts from ontologies

bull Solution Use schema matching to map between elements represented in OWL and the different schemas of web documents

Semantic Web ServicesSemantic Web Services

bull Problem Web Services are currently searched for using keywords

bull We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently

bull WSDLs are in XML Ontologies in OWL

bull Solution Use schema matching approaches to map between the two different schemas

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 10: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Schema IntegrationSchema Integration

Problem Construct a global view from a set of independently constructed schemas

(ie ontologies)

- Different structure and terminologies

Solution Schema Matching is performed to find relationships between concepts in each schema Then the matching elements can be unified

Bernstein P Rahm E A survey of approaches to automatic schema matching

Data WarehousesData Warehouses

Problem Integrating data sources into a data warehouse

- Different formats between the source and warehouse

Solution Use matching to find the elements of the source that are also present in the warehouse Then the details of the semantics can be examined to integrate the two

Bernstein P Rahm E A survey of approaches to automatic schema matching

E-CommerceE-Commerce

Problem Message translation

-Each trading partner uses its own message format

Solution A match operation would reduce the amount of manual work to specify how the formats are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

Query ProcessingQuery Processing

Problem The terms used in the userrsquos query may be different from those in the database

Solution Matching is used to map the user-specified concepts in the query to schema elements

Bernstein P Rahm E A survey of approaches to automatic schema matching

Need for Data Integration on the Need for Data Integration on the Semantic WebSemantic Web

bull Problem Web documents are not in RDF or any form suitable for the SW

bull We must annotate them with concepts from ontologies

bull Solution Use schema matching to map between elements represented in OWL and the different schemas of web documents

Semantic Web ServicesSemantic Web Services

bull Problem Web Services are currently searched for using keywords

bull We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently

bull WSDLs are in XML Ontologies in OWL

bull Solution Use schema matching approaches to map between the two different schemas

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 11: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Data WarehousesData Warehouses

Problem Integrating data sources into a data warehouse

- Different formats between the source and warehouse

Solution Use matching to find the elements of the source that are also present in the warehouse Then the details of the semantics can be examined to integrate the two

Bernstein P Rahm E A survey of approaches to automatic schema matching

E-CommerceE-Commerce

Problem Message translation

-Each trading partner uses its own message format

Solution A match operation would reduce the amount of manual work to specify how the formats are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

Query ProcessingQuery Processing

Problem The terms used in the userrsquos query may be different from those in the database

Solution Matching is used to map the user-specified concepts in the query to schema elements

Bernstein P Rahm E A survey of approaches to automatic schema matching

Need for Data Integration on the Need for Data Integration on the Semantic WebSemantic Web

bull Problem Web documents are not in RDF or any form suitable for the SW

bull We must annotate them with concepts from ontologies

bull Solution Use schema matching to map between elements represented in OWL and the different schemas of web documents

Semantic Web ServicesSemantic Web Services

bull Problem Web Services are currently searched for using keywords

bull We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently

bull WSDLs are in XML Ontologies in OWL

bull Solution Use schema matching approaches to map between the two different schemas

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 12: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

E-CommerceE-Commerce

Problem Message translation

-Each trading partner uses its own message format

Solution A match operation would reduce the amount of manual work to specify how the formats are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

Query ProcessingQuery Processing

Problem The terms used in the userrsquos query may be different from those in the database

Solution Matching is used to map the user-specified concepts in the query to schema elements

Bernstein P Rahm E A survey of approaches to automatic schema matching

Need for Data Integration on the Need for Data Integration on the Semantic WebSemantic Web

bull Problem Web documents are not in RDF or any form suitable for the SW

bull We must annotate them with concepts from ontologies

bull Solution Use schema matching to map between elements represented in OWL and the different schemas of web documents

Semantic Web ServicesSemantic Web Services

bull Problem Web Services are currently searched for using keywords

bull We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently

bull WSDLs are in XML Ontologies in OWL

bull Solution Use schema matching approaches to map between the two different schemas

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 13: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Query ProcessingQuery Processing

Problem The terms used in the userrsquos query may be different from those in the database

Solution Matching is used to map the user-specified concepts in the query to schema elements

Bernstein P Rahm E A survey of approaches to automatic schema matching

Need for Data Integration on the Need for Data Integration on the Semantic WebSemantic Web

bull Problem Web documents are not in RDF or any form suitable for the SW

bull We must annotate them with concepts from ontologies

bull Solution Use schema matching to map between elements represented in OWL and the different schemas of web documents

Semantic Web ServicesSemantic Web Services

bull Problem Web Services are currently searched for using keywords

bull We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently

bull WSDLs are in XML Ontologies in OWL

bull Solution Use schema matching approaches to map between the two different schemas

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 14: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Need for Data Integration on the Need for Data Integration on the Semantic WebSemantic Web

bull Problem Web documents are not in RDF or any form suitable for the SW

bull We must annotate them with concepts from ontologies

bull Solution Use schema matching to map between elements represented in OWL and the different schemas of web documents

Semantic Web ServicesSemantic Web Services

bull Problem Web Services are currently searched for using keywords

bull We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently

bull WSDLs are in XML Ontologies in OWL

bull Solution Use schema matching approaches to map between the two different schemas

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 15: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Semantic Web ServicesSemantic Web Services

bull Problem Web Services are currently searched for using keywords

bull We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently

bull WSDLs are in XML Ontologies in OWL

bull Solution Use schema matching approaches to map between the two different schemas

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 16: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 17: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Term DefinitionsTerm Definitionsbull Schema a set of elements connected by some

structure

bull Mapping a set of mapping elements each of which indicates that certain elements of schema s1 are mapped to certain elements in s2

bull Mapping Expression Tells how s1 and s2 elements are related

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 18: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

ExampleExample

A mapping between s1 and s2 might contain these elementsbull CustC=CustomerCustIDbull Concatenate(CustFirstName CustLastName) = Customercontactbull CustCName = CustomerCompany

S1 Elements S2 Elements

Cust Customer

C CustID

CName Company

FirstName Contact

LastName Phone

Bernstein P Rahm E A survey of approaches to automatic schema matching

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 19: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

ExampleExample

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 20: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

bull Instance vs Schema matching approaches can consider instance data or schema-level information

bull Element vs Structure matching match can be performed for individual schema elements or combinations of elements

bull Language vs Constraint linguistic (names) or constraint-based (keys and relationships)

bull Matching Cardinality match result may relate one or more elements of one schema to one or more elements of another

bull Auxiliary Information matcher relies on other information besides the input schemas such as dictionaries user input global schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 21: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Classification of Schema Matching Classification of Schema Matching ApproachesApproaches

Schema Matching Approaches

Individual Matchers Combining Matchers

Schema-only

Structure LevelElement Level

InstanceContents

ConstraintLinguistic Constraint

hellip hellip hellip

Element Level

ConstraintLinguistic

hellip hellip

Hybrid Matchers Composite Matchers

Manual Composition Automatic Composition

Further Criteria -Match Cardinality -Auxiliary information usedhellip

bullName SimilaritybullDescription SimilaritybullGlobal Namespaces

bullWord Frequency

bullGroup Matching

bullType SimilaritybullKey Properties

bullValue Pattern and Ranges

Sample Approaches

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 22: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Schema Level MatchersSchema Level Matchers

bull Consider schema information instead of instance data Name Description Data Type Relationship Types Constraints Structure

bull Often produces multiple candidates and estimates a degree of similarity for each

1 Granularity of match (element level vs structure level)2 Match Cardinality3 Linguistic Approaches Name or Description Matching4 Constraint-Based Approaches5 Reusing Schema and Matching Information

Bernstein P Rahm E A survey of approaches to automatic schema matching

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 23: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Element-LevelElement-Level

bull Element-Level Identifies all elements of S1 that are the same or similar to elements of S2

bull The match comparison can be based on name description or data type of the element

bull Example of name-based element-level matching Address = CustomerAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 24: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Structure-Level Structure-Level bull Structure-Level Matches combinations of elements that appear together in S1

with combinations of elements that appear together in S2bull Full Structure Match

bull Partial Structure Match

bull Equivalence Patterns Can enhance structure matching by considering known equivalence patterns stored in a library

S1 Elements S2 Elements

Address CustAddress

Street Street

City City

State USState

Zip PostalCode

S1 Elements S2 Elements

AccountOwner Customer

Name Cname

Address CAddress

Birthdate CPhone

TaxExempt

Bernstein P Rahm E A survey of approaches to automatic schema matching

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 25: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Match CardinalityMatch Cardinalitybull One or more S1 elements can match one or

more S2 elementsbull Complex matches

Examples of the four local cardinality cases for individual mapping elements

Local Match Cardinalities

S1 Element(s) S2 Element(s) Matching Expression

11 element level Price Amount Amount = Price

n1 element level Price Tax Cost Cost = Price(1+Tax100)

1n element level Name FirstName

LastName

FirstName LastName = Name

nm element level

also

n1 structure level

BTitle

BPuNo

PPuNo

PName

ABook

APublisher

ABook APublisher = Select BTitle PName From B P

Where BPuNo = PPuNo

Bernstein P Rahm E A survey of approaches to automatic schema matching

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 26: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Complex MatchesComplex Matches

bull 11 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema

bull Only a few works on complex matching have been donebull Some hard code complex matches into rulesbull Some rely on a domain specific ontology

bull We need domain knowledge to accurately perform complex matching

bull The best match isnrsquot always the top match returned by the matcher ndash so human involvement is still needed

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 27: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Linguistic ApproachesLinguistic Approaches

bull Language based matchers use names and text (ie words or sentences) to find semantically similar schema elements

bull Name Matching match elements with similar namesbull Description Matching match comments in the schemas

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 28: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Linguistic ApproachesLinguistic ApproachesName MatchingName Matching

bull Matches schema elements with equal or similar namesbull How similarity is defined 1 Equality of names 2 Equality of names after stemming deals with prefixessuffixes 3 Equality of synonyms 4 Equality of hypernyms (suv is a type of car) 5 Similarity of names based on common substrings soundex pronunciation

(ShipTo = Ship2) 6 User provided name matches

bull Can be element or structure-levelbull Cardinality is not limited to 11

Bernstein P Rahm E A survey of approaches to automatic schema matching

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 29: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Linguistic ApproachesLinguistic ApproachesDescription MatchingDescription Matching

bull Schemas can contain comments in natural language that express the intended semantics of the schema elements

bull Example

S1 empn employee name

S2 name name of employee

bull Can be as simple as keyword extraction and synonym matching or as complex as using natural language understanding technology

Bernstein P Rahm E A survey of approaches to automatic schema matching

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 30: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Constraint BasedConstraint Based

bull Schemas often contain constraints to define data types and value ranges optionality relationship types cardinalities etc

Bernstein P Rahm E A survey of approaches to automatic schema matching

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 31: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Reusing Schema and Mapping Reusing Schema and Mapping InformationInformation

bull The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings

bull Many schemas are often very similar to each other and previously matched schemas

ie In E-Commerce substructures often repeat within different message formats (address fields name fields)

bull A schema library should be created and the schema editors should access the library to use predefined terms and definitions

Bernstein P Rahm E A survey of approaches to automatic schema matching

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 32: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Schema Mapping ReuseSchema Mapping Reuse

bull Example

bull Problems

1 Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself

2 Similarity values may depend on the domain ie Salary and income may be identical in payroll application but not in a tax reporting application

Schema S1 Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone

Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address

POrder Article Payee BillAddress Recipient ShipAddress

Bernstein P Rahm E A survey of approaches to automatic schema matching

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 33: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Instance Level ApproachesInstance Level Approachesbull Why 1 Little or no schema information available 2 Enhancement of schema-level matchers Instance data gives insight to

the contents and meaning of schema elements 3 To match instance-level data

bull How 1 Preferred Method Linguistic Characterization 2 Constraint-based Characterization ie Ranges 3 Auxiliary Information 4 Also uses both rule-based and learner-based techniques

bull Main Problem When comparing data at the instance-level it is likely that there will be a ton of possible match combinations a lot of which are irrelevant

Bernstein P Rahm E A survey of approaches to automatic schema matching

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 34: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Rule Based SolutionsRule Based Solutions

bull Rule-Based hand crafted rules to exploit schema informationbull element names data types structures and

subelementsbull Ie two elements match if they have the same

name and the same number of subelements

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 35: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Learner Based SolutionsLearner Based Solutions

bull Learner-Based exploit both schema and data

bull Requires a lot of training data but can exploit data

bull Rule and learner based techniques combined provide an effective matching solution

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 36: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Combining Different MatchersCombining Different Matchersbull The ideal matching system must exploit many different types of

information and technique for maximum accuracy

bull More match candidates will be produced if the previous approaches are combined

bull Two Combination Methods 1 Hybrid integrates multiple matching criteria Better performance 2 Composite combine the results of independently executed matchers More flexible Can be done automatically or manually

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 37: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 38: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

LSD (Univ of Washington)LSD (Univ of Washington)

bull Learning Source Descriptions

bull Uses machine learning techniques to match a new data source against a previously determined global schema

bull Uses a name matcher and several instance-level matchers

bull System is trained with sample user inputs and it learns patterns and matching rules

bull Mostly instance-oriented but can use schema information too

bull Also supports user input domain constraints on the global schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 39: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

SKAT (Stanford University)SKAT (Stanford University)

bull Semantic Knowledge Articulation Toolbull Follows a rule-based approach to semi-automatically determine

matches between two ontologies

bull User input required The user must provide application specific matchmismatch relations The user must approve or reject matches

bull SKAT matching is used within the ONION architecture for ontology integration

bull In ONION an ldquoarticulation ontologyrdquo is constructed from the rules Matching is based on is-a relationships between the articulation ontology and the source ontology

Bernstein P Rahm E A survey of approaches to automatic schema matching

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 40: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

TransScm (Tel Aviv University)TransScm (Tel Aviv University)

bull Uses schema matching to derive an automatic data translation between schema instances

bull Schemas are transformed into labeled graphs

bull Matching is performed node by node (element-level 11) starting at the top

bull Requires user intervention if no match is found (ie to provide a new rule)

Bernstein P Rahm E A survey of approaches to automatic schema matching

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 41: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

DIKE (Univ of Reggio DIKE (Univ of Reggio Calabria Univ of Calabria)Calabria Univ of Calabria)

bull Compares pairs of objects by their attributes and the is-a relationships that they are involved in

bull These pairs are given a match score between 0 and 1

bull User must specify synonyms homonyms and inclusion properties

Bernstein P Rahm E A survey of approaches to automatic schema matching

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 42: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Cupid (Microsoft Research)Cupid (Microsoft Research)bull Hybrid matcherbull Element and Structural-Level matches

Phase 1 Linguistic Element-Level - categorizes elements based on name data types and domains - calculates a linguistic similarity coefficient Phase 2 - transform the original schema into a tree then perform a bottom-up structure

matching - calculates a similarity value - calculates a weighted mean of linguistic and structural similarity of pairs of

elements

Phase 3 - uses the mean from phase 2 to decide on a mapping

Bernstein P Rahm E A survey of approaches to automatic schema matching

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 43: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Clio (IBM Almaden and Univ Clio (IBM Almaden and Univ of Toronto)of Toronto)

bull Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema

bull Three Components Schema Readers read schema and translate it into an

internal representation Correspondence Engine is used to identify matching parts

of the schemas or databases Mapping Generator generates view definitions to map data

in the source schema to data in the target schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 44: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Similarity flooding (Stanford Similarity flooding (Stanford Univ and Univ of Leipzig)Univ and Univ of Leipzig)

bull Graph Matching Algorithm

bull Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs

bull Uses a name matcher to get an initial element-level match that is then given to the structural matcher

Bernstein P Rahm E A survey of approaches to automatic schema matching

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 45: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Delta (Mitre)Delta (Mitre)

bull Uses attribute descriptions to determine attribute matches

bull The method is to group the metadata about an attribute into a text string which is presented as a document The user is then presented with other lsquodocumentsrsquo with matching attributes and can chose from those

Bernstein P Rahm E A survey of approaches to automatic schema matching

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 46: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Tess (Univ of Massachusetts Tess (Univ of Massachusetts Amherst)Amherst)

bull System for helping to cope with schema evolution

bull Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema

Bernstein P Rahm E A survey of approaches to automatic schema matching

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 47: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 48: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

LSDIS Lab UGALSDIS Lab UGAbull What is it

A tool for semi-automatically marking up web service descriptions with ontologies

It helps in describing services semantically and aids in efficient web service discovery and composition

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 49: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

MWSAF Annotation ToolMWSAF Annotation Tool

bull Input WSDL File

1 Individual elements of the WSDL are matched to concepts in the domain

2 The WSDL is classified into a domain3 The Matches are given to the user to accept or reject4 Upon the userrsquos acceptance the annotations are written

to the WSDL

bull Output WSDL File with semantic annotations

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 50: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

MWSAF ArchitectureMWSAF Architecture

Main Components of the System

1 Ontology Store stores the DAML and RDF ontologies that will be used to annotate the WSDL files Ontologies are categorized by domain

2 Parser Library consists of the parsers used to generate the SchemaGraphs

3 Matcher Library provides schema matching algorithm

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 51: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

MWSAFMWSAFSchema GraphsSchema Graphs

PROBLEM The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly

MWSAF converts both models to a commonrepresentation format called SchemaGraph

A SchemaGraph is a set of nodes connected by edges that are created using conversion functions

Then it applies a matching algorithm to find themappings between them

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 52: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

MWSAF Meteor-S Web Service Annotation MWSAF Meteor-S Web Service Annotation FrameworkFramework

XML to SchemaGraph conversion rulesXML to SchemaGraph conversion rules

ltxsdcomplexType name=Directiongt

ltxsdsequencegt

ltxsdelement maxOccurs=1 minOccurs=1

name=compass nillable=true

type=xsd1DirectionCompass gt

ltxsdelement maxOccurs=1 minOccurs=1

name=degrees type=xsdint gt

ltxsdsequencegt

ltxsdcomplexTypegt Direction

degreesDirectionCompass

hasElementcompass

SchemaNode representation of XML schema

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 53: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

MWSAF Meteor-S Web Service Annotation FrameworkMWSAF Meteor-S Web Service Annotation FrameworkOntology to SchemaGraph conversion rulesOntology to SchemaGraph conversion rules

ltdamlClass rdfID=WindEventgt ltrdfscommentgtSuperclass for all events dealing with windltrdfscommentgt ltrdfslabelgtWind eventltrdfslabelgt ltrdfssubClassOf rdfresource=WeatherEvent gt ltdamlClassgtltdamlProperty rdfID=windDirectiongt ltrdfslabelgtWind directionltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource = httpwwww3org200010XMLSchemastring gt ltdamlPropertygtltdamlProperty rdfID=windSpeedgt ltrdfslabelgtWind speedltrdfslabelgt ltrdfsdomain rdfresource=WindEvent gt ltrdfsrange rdfresource=Speed gt ltdamlPropertygt

WindEvent

windDirection Speed

hasProperty windSpeed

SchemaGraph representation of part of ontologyPatil A Oundhakar S Sheth A Verma K METEOR-S Web service

Annotation Framework

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 54: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

MappingMapping

bull Measures of the Match Score

-Element Level Match linguistic similarity of two concepts based on names Uses WordNet to check for synonyms Abbreviations are even checked

-Schema Match structural similarity sub-concept similarities

bull The getBestMapping function then looks at the Match Scores and determines a map set

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 55: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

MWSAF Matching TechniquesMWSAF Matching TechniquesElemMatchElemMatch

bull Name and String Matching algorithms

-NGram considers the number of qgrams that the names have in common

-CheckSynonym uses Wordnet to find synonyms -CheckAbbreviations uses an abbreviation dictionary -TokenMatcher uses Porter Stemmer tonkenization and

substring matching techniques bull Each algorithm returns a value between 0 and 1 These

values are used in an equation for the final match score

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 56: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

MatchingMatching

bull Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology

Then two measures are derived from the mapping

-Average Concept Match tells the user about the degree of similarity between matched concepts of the WSDL and ontology

-Average Service Match helps to categorize the service

We have a machine learning alternative for categorization

Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 57: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

OutlineOutline

bull Introductionbull Application Domainsbull Classification of Schema Matching Approachesbull Current Workbull MWSAF Matchingbull Open Research Directoriesbull Conclusion

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 58: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

Current and Future IssuesCurrent and Future Issuesbull User Interaction minimize user input but maximize impact of the

feedback

bull Real World Analysis can the current matching techniques be used in real world situations

bull P2P data management

bull Mapping Maintenance what happens when you map between two schemas and then one changes

bull Developing global schemas (or ontologies) for domains

bull Dealing with inconsistent data values for a schema elementDoan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 59: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

More IssuesMore Issues

bull If we require user acceptance for our matches then what happens if our matcher returns thousands or hundreds of matches

bull Is it unrealistic to think that we will eventually perfect our matchers

Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 60: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

ConclusionConclusionbull It is necessary to automate the matching process

bull Schema matching is very difficult and expensive

bull We have looked at a taxonomy and the descriptions of the existing approaches for matching

-Schema vs Instance-level

-Element vs Structure-level

-Language and Constraint based matchers

bull We also discussed several implementations of the matching techniques

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 61: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

ReferencesReferencesbull Bernstein P Rahm E A survey of approaches to automatic schema matching

wwwresearchmicrosoftcom~philbeVLDBJ-Dec2001pdf

bull Doan A Halevy A Semantic Integration Research in the Database Community A Brief Survey httpanhaicsuiucedupublicdb-review14pdf

bull Patil A Oundhakar S Sheth A Verma K METEOR-S Web service Annotation Framework POSV-WWW2004pdf

bull Vassilis C Integrating XML Data Sources using RDFS Schemas The ICS-FORTH Semantic Web Integration Middleware (SWIM) Dagsthul SeminarftpftpdagstuhldepubProceedings040439104391ChristophidesVassilisSlidespdf

QuestionsQuestions

Page 62: Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

QuestionsQuestions