63
iMAP: Discovering iMAP: Discovering Complex Semantic Matches Complex Semantic Matches Between Database Schemas Between Database Schemas Ohad Edry Ohad Edry January 2009 January 2009 Seminar in Databases Seminar in Databases

iMAP: Discovering Complex Semantic Matches Between Database Schemas

  • Upload
    george

  • View
    23

  • Download
    1

Embed Size (px)

DESCRIPTION

iMAP: Discovering Complex Semantic Matches Between Database Schemas. Ohad Edry January 2009 Seminar in Databases. Motivation. Consider a union of databases of two banks. We need to generate a mapping between the schemas. Bank A tables. Bank B tables. Introduction. - PowerPoint PPT Presentation

Citation preview

Page 1: iMAP: Discovering Complex Semantic Matches Between Database Schemas

iMAP: Discovering Complex iMAP: Discovering Complex Semantic Matches Between Semantic Matches Between

Database SchemasDatabase Schemas

Ohad EdryOhad Edry

January 2009January 2009

Seminar in DatabasesSeminar in Databases

Page 2: iMAP: Discovering Complex Semantic Matches Between Database Schemas

MotivationMotivation

Consider a union of databases of two banks.Consider a union of databases of two banks.

We need to generate a mapping between the schemasWe need to generate a mapping between the schemas

IdIdNameNameCityCityStreetStreetHouse House NumberNumber

IdIdAccount Account numbernumber

Account statusAccount status

IdIdFirst First namename

Last Last namename

AddressAddressAccountAccountAccount Account statusstatus

Bank A tables

Bank B tables

Page 3: iMAP: Discovering Complex Semantic Matches Between Database Schemas

IntroductionIntroduction Semantic mappingsSemantic mappings specify the relationships specify the relationships

between data stored in disparate sources.between data stored in disparate sources. A mapping between attribute of target schema to A mapping between attribute of target schema to

attributes of source schema According to the attributes of source schema According to the semanticssemantics

Page 4: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Motivation – Example continueMotivation – Example continue

IdIdNameNameCityCityStreetStreetHouse House NumberNumber

IdIdAccount Account numbernumber

Account statusAccount status

IdIdFirst First namename

Last Last namename

AddressAddressAccountAccountAccount Account statusstatus

Bank A tables

Bank B tables

Page 5: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Motivation – Example continueMotivation – Example continue

IdIdNameNameCityCityStreetStreetHouse House NumberNumber

IdIdAccount Account numbernumber

Account statusAccount status

IdIdFirst First namename

Last Last namename

AddressAddressAccountAccountAccount Account statusstatus

Bank A tables

Bank B tables

Semantic Mapping!

Page 6: iMAP: Discovering Complex Semantic Matches Between Database Schemas

IntroductionIntroduction

Most of the work in this field focused on Most of the work in this field focused on Matching ProcessMatching Process..

The types of matches can be split to 2:The types of matches can be split to 2: 1 – 1 matching1 – 1 matching.. Complex matchingComplex matching – Combination of – Combination of

attributes in one schema corresponds to a attributes in one schema corresponds to a combination in other schemacombination in other schema

Match CandidateMatch Candidate – each matching of attributes – each matching of attributes from source and target schemas.from source and target schemas.

Page 7: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Motivation – Example continueMotivation – Example continue

IdIdNameNameCityCityStreetStreetHouse House NumberNumber

IdIdAccount Account numbernumber

Account statusAccount status

IdIdFirst First namename

Last Last namename

AddressAddressAccountAccountAccount Account statusstatus

Bank A tables

Bank B tables

Semantic Mapping!

1-1 matching candidate

Complex matching candidate

Page 8: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Introduction - examples:Introduction - examples: Example 1:Example 1:

Example 2:Example 2:

NameNameAddressAddressPhonePhone

OhadOhadHaifaHaifa1234512345

DavidDavidTel-AvivTel-Aviv1357913579

StudentStudentLocationLocationCellularCellular

EyalEyalHaifaHaifa12345671234567

MiriMiriTel-AvivTel-Aviv23456782345678

Product Product IDID

Product Product namename

PricePrice

Product Product IDID

DiscountDiscount

Product Product IDID

NameNameProduct Product PricePrice

Company A

Company B

Page 9: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Introduction - examples:Introduction - examples: Example 1:Example 1:

Example 2:Example 2:

NameNameAddressAddressPhonePhone

OhadOhadHaifaHaifa1234512345

DavidDavidTel-AvivTel-Aviv1357913579

StudentStudentLocationLocationCellularCellular

EyalEyalHaifaHaifa12345671234567

MiriMiriTel-AvivTel-Aviv23456782345678

Product Product IDID

Product Product namename

PricePrice

Product Product IDID

DiscountDiscount

Product Product IDID

NameNameProduct Product PricePrice

Company A

Company B

Page 10: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Introduction - examples:Introduction - examples: Example 1:Example 1:

Example 2:Example 2:

NameNameAddressAddressPhonePhone

OhadOhadHaifaHaifa1234512345

DavidDavidTel-AvivTel-Aviv1357913579

StudentStudentLocationLocationCellularCellular

EyalEyalHaifaHaifa12345671234567

MiriMiriTel-AvivTel-Aviv23456782345678

Product Product IDID

Product Product namename

PricePrice

Product Product IDID

DiscountDiscount

Product Product IDID

NameNameProduct Product PricePrice

Company A

Company B

Page 11: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Introduction - examples:Introduction - examples: Example 1:Example 1:

1 – 1 matching: Name = Student, Address = Location, Phone = Cellular.1 – 1 matching: Name = Student, Address = Location, Phone = Cellular. Example 2:Example 2:

NameNameAddressAddressPhonePhone

OhadOhadHaifaHaifa1234512345

DavidDavidTel-AvivTel-Aviv1357913579

StudentStudentLocationLocationCellularCellular

EyalEyalHaifaHaifa12345671234567

MiriMiriTel-AvivTel-Aviv23456782345678

Product Product IDID

Product Product namename

PricePrice

Product Product IDID

DiscountDiscount

Product Product IDID

NameNameProduct Product PricePrice

Company A

Company B

Page 12: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Introduction - examples:Introduction - examples: Example 1:Example 1:

1 – 1 matching: Name = Student, Address = Location, Phone = Cellular.1 – 1 matching: Name = Student, Address = Location, Phone = Cellular. Example 2:Example 2:

NameNameAddressAddressPhonePhone

OhadOhadHaifaHaifa1234512345

DavidDavidTel-AvivTel-Aviv1357913579

StudentStudentLocationLocationCellularCellular

EyalEyalHaifaHaifa12345671234567

MiriMiriTel-AvivTel-Aviv23456782345678

Product Product IDID

Product Product namename

PricePrice

Product Product IDID

DiscountDiscount

Product Product IDID

NameNameProduct Product PricePrice

Product Price = Price*(1-Discount)

Company A

Company B

Page 13: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Difficulties in Generating MatchingsDifficulties in Generating Matchings

Difficult to find the matches becauseDifficult to find the matches because Finding Finding complex matchescomplex matches is not trivial at all is not trivial at all

• How the system should know: How the system should know:

Product Price = Price*(1-Discount) The The number of candidatesnumber of candidates for Complex Matches is for Complex Matches is

large.large. Sometimes tables should be Sometimes tables should be joinedjoined::

Product Product IDID

Product Product namename

PricePrice

Product Product IDID

DiscountDiscount

Product Product IDID

Product Product NameName

Product Product PricePrice

Product Price = Price*(1-Discount)

Page 14: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Main Parts of the iMAP SystemMain Parts of the iMAP System

GeneratingGenerating Matching Candidates Matching Candidates PruningPruning matching candidates matching candidates

By exploiting Domain Knowledge By exploiting Domain Knowledge ExplainingExplaining Match Predictions Match Predictions

Provides an explanation to selected predicted Provides an explanation to selected predicted matching matching

Causes the system to be semi automatically.Causes the system to be semi automatically.

Page 15: iMAP: Discovering Complex Semantic Matches Between Database Schemas

iMAP System ArchitectureiMAP System Architecture

Consists three main modules:Consists three main modules: Match GeneratorMatch Generator – generates the matching – generates the matching

candidates using special searchers for candidates using special searchers for target target schemaschema and and source schemasource schema. .

Similarity EstimatorSimilarity Estimator – generates matrix that – generates matrix that stores the similarity score of pairs (target stores the similarity score of pairs (target attribute, match candidate)attribute, match candidate)

Match SelectorMatch Selector – examines the score matrix – examines the score matrix and outputs the best matches under certain and outputs the best matches under certain conditions.conditions.

Page 16: iMAP: Discovering Complex Semantic Matches Between Database Schemas

iMAP System Architecture – cont.iMAP System Architecture – cont.

To each attribute t of T iMAP generates match candidates from S

Similarity Estimator: receives match candidates and outputs similarity matrix

Match Selector: receives similarity matrix and output final match candidates

Page 17: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Part 1: Match Generation - Part 1: Match Generation - searcherssearchers

The key in match generation is to The key in match generation is to SEARCHSEARCH through the through the space of possible match candidatesspace of possible match candidates.. Search space – all attributes and data in source Search space – all attributes and data in source

schemasschemas

Searchers work based on knowledge of Searchers work based on knowledge of operators and attributes types such as operators and attributes types such as numeric, numeric, textual textual and some heuristic methods.and some heuristic methods.

Page 18: iMAP: Discovering Complex Semantic Matches Between Database Schemas

The Internal of SearchersThe Internal of Searchers Search StrategySearch Strategy

Facing the large space using the standard Facing the large space using the standard beambeam searchsearch..

Match EvaluationMatch Evaluation Giving score which approximates the distance between the Giving score which approximates the distance between the

candidate and the target.candidate and the target.

Termination ConditionTermination Condition Search should be stopped because of a large search space.Search should be stopped because of a large search space.

Page 19: iMAP: Discovering Complex Semantic Matches Between Database Schemas

The Internal of Searchers – The Internal of Searchers – ExampleExample

ii Iterations which limited by Iterations which limited by kk results: results:

Product Product IDID

Product Product namename

PricePrice

Product Product IDID

DiscountDiscount

Product Product IDID

NameNameProduct Product PricePrice

1. Product Price = Price*(1-Discount)

2. Product Price = Product ID

k. …

MAXi

MAXi+1

Stop: MAXi-MAXi+1<delta

Return first k candidates

Page 20: iMAP: Discovering Complex Semantic Matches Between Database Schemas

The Internal of Searchers – Join The Internal of Searchers – Join PathsPaths

Find matches in Join Paths in two steps: Find matches in Join Paths in two steps:

Product Product IDID

Product Product namename

PricePrice

Product Product IDID

DiscountDiscount

Product Product IDID

NameNameProduct Product PricePrice

Product Price = Price*(1-Discount)

Company A Company B

First Step - Join paths between tables: Join(T1,T2)

Second Step – search process use the join paths

Page 21: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Implemented searchers in iMAPImplemented searchers in iMAP

Contains the following searchers:Contains the following searchers: TextText NumericNumeric CategoryCategory Schema MismatchSchema Mismatch Unit ConversionUnit Conversion DateDate Overlap versions of Text, Numeric, Category, Schema Overlap versions of Text, Numeric, Category, Schema

Mismatch, Unit ConversionMismatch, Unit Conversion

Page 22: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Implemented searchers – Text Implemented searchers – Text Searcher exampleSearcher example

Text searcher:Text searcher:Purpose:Purpose: finds matching candidates that are finds matching candidates that are concatenations of text attributes. concatenations of text attributes. Method:Method:

• Target attributeTarget attribute: Name: Name• Search SpaceSearch Space: attributes in source : attributes in source

Schemas which have textual propertiesSchemas which have textual properties• Searcher Searcher searchsearch in the Search Space in the Search Space

attributes or concatenations of attributesattributes or concatenations of attributes

IdIdNameName

IdIdFirst First namename

Last Last namename

Page 23: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Implemented searchers – Numeric Implemented searchers – Numeric Searcher exampleSearcher example

Numeric Searcher :Numeric Searcher : Purpose:Purpose: best matches best matches for numeric attributes. for numeric attributes.

Issues:Issues:• Compute the similarity Compute the similarity

score of complex score of complex matchmatch• Value distributionValue distribution

• Type of matchesType of matches• +,-,*,/+,-,*,/• 2 Columns2 Columns

dim1dim1dim2dim2

1133

2244

1122

4411

3311

sizesize

33

77

22

44

44

dim1*dim2=size

Page 24: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Implemented searchers in iMAP – Implemented searchers in iMAP – cont.cont.

Category Searcher:Category Searcher:Purpose:Purpose: find matches between categorical attributes in find matches between categorical attributes in the source and in the schema.the source and in the schema.

Schema Mismatch Searcher:Schema Mismatch Searcher:Purpose:Purpose: relating the data of a schema with the schema relating the data of a schema with the schema of the other. Occurs very often.of the other. Occurs very often.

Unit Conversion Searcher:Unit Conversion Searcher:Purpose:Purpose: find matches between different types of units. find matches between different types of units.

Date Searcher:Date Searcher:Purpose:Purpose: finds complex matches for date attributes. finds complex matches for date attributes.

Page 25: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Part 2: Similarity EstimatorPart 2: Similarity Estimator

Receives from the Match Generator candidate matches Receives from the Match Generator candidate matches which based on the which based on the score that each searcher assignsscore that each searcher assigns..

Problem:Problem: each searcher can give each searcher can give different scoredifferent score Solution: Solution: Final scoreFinal score, more accurate, to each match by , more accurate, to each match by

using additional types of information.using additional types of information. iMAP system uses iMAP system uses evaluator modules:evaluator modules:

• Name-based evaluator – computes score basing on similarity of Name-based evaluator – computes score basing on similarity of namesnames

• Naive Bayes evaluatorNaive Bayes evaluator

Why not to perform this phaseWhy not to perform this phase during the search phase?during the search phase?

Very Very Expensive!Expensive!

Page 26: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Module example - Naive Bayes Module example - Naive Bayes evaluatorevaluator

Consider the machConsider the mach

agent-address = locationagent-address = location Building model: Data instance in target Building model: Data instance in target

attribute will be attribute will be positivepositive otherwise the otherwise the data will be data will be negativenegative

Naïve Bayes ClassifierNaïve Bayes Classifier learn the learn the model model

Applied the trained classifier on the Applied the trained classifier on the source attribute datasource attribute data

Each data instance receive scoreEach data instance receive score Return an average on all score as Return an average on all score as

result result

Agent Agent AddressAddress

(Target)(Target)

LoactionLoaction

(Source)(Source)

HaifaHaifaT.A.T.A.

T.A.T.A.EilatEilat

JerusalemJerusalemNahariyaNahariya

EilatEilatNesherNesher

Page 27: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Part 3: Match SelectorPart 3: Match Selector

Receives from the Similarity Estimator the scored Receives from the Similarity Estimator the scored suggested for matching candidatessuggested for matching candidates

Problem: Problem: These matches may These matches may violateviolate certain domain certain domain integrity constraints.integrity constraints.

For example: mapping 2 source attributes to the same target For example: mapping 2 source attributes to the same target attributes.attributes.

Solution: Solution: set of set of domain constraintsdomain constraints Defined by domain experts or usersDefined by domain experts or users

Page 28: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Constraint ExampleConstraint Example

Constraint: Price and Club members price are Constraint: Price and Club members price are unrelatedunrelated

Match Selector delete this match candidateMatch Selector delete this match candidate

Product Product IDID

Product Product namename

PricePrice

Product IDProduct IDClub members Club members PricePrice

Product Product IDID

Product Product NameName

Product Product PricePrice

Match Selector receives list of candidates:

k. Product Price = Price+club members price

Page 29: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Exploiting Domain KnowledgeExploiting Domain Knowledge

iMAP system uses 4 different types of iMAP system uses 4 different types of knowledgeknowledge:: Domain Domain ConstraintsConstraints PastPast matches matches OverlapOverlap data data ExternalExternal data data

iMAP uses its knowledge at all levels of the iMAP uses its knowledge at all levels of the system and early as it can in match generation.system and early as it can in match generation.

Page 30: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Types of knowledgeTypes of knowledge

Domain constraintsDomain constraints Three cases:Three cases:

• Name and ID are unrelated - Attributes from the Source schema are Name and ID are unrelated - Attributes from the Source schema are unrelatedunrelated

searcherssearchers

• Account < 10000 - Constraint on single attribute Account < 10000 - Constraint on single attribute t t Similarity Estimator and SearchersSimilarity Estimator and Searchers

• Account and ID are unrelated - Attributes from the Target Schema Account and ID are unrelated - Attributes from the Target Schema are unrelatedare unrelated

Match SelectorMatch Selector

IdIdNameNameIdIdAccount Account

numbernumberAccount Account statusstatus

IdIdFirst First namename

Last Last namename

AccountAccountAccount Account statusstatus

Source:

Target:

Page 31: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Types of knowledge – cont.Types of knowledge – cont.

Past Complex MatchesPast Complex Matches Numeric Searcher can use past expression template:Numeric Searcher can use past expression template:

• Price=Price*(1-Discount) generates Price=Price*(1-Discount) generates

VARIABLE*(1-VARIABLE)VARIABLE*(1-VARIABLE)

External DataExternal Data – – using external sources for using external sources for learning about attributes and their data.learning about attributes and their data. Given a target attribute and useful feature of that Given a target attribute and useful feature of that

attribute, iMAP learn about value distributionattribute, iMAP learn about value distribution • Example: number of cities in stateExample: number of cities in state

Page 32: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Types of knowledge – cont.Types of knowledge – cont.

Overlap Data – Provide information for the mapping Overlap Data – Provide information for the mapping process.process.

contains searchers which can exploit overlap data.contains searchers which can exploit overlap data.

Overlap Text, Category & Schema Mismatch searchersOverlap Text, Category & Schema Mismatch searchers S and T share a state listingS and T share a state listing Matches: city=state , country=stateMatches: city=state , country=state Re-evaluating results: city=state is 0 and country=state is 1Re-evaluating results: city=state is 0 and country=state is 1

Overlap Numeric SearcherOverlap Numeric Searcher – using the overlap data and – using the overlap data and using using equation discovery system (LAGRMGE) equation discovery system (LAGRMGE) the best the best arithmetic expression for arithmetic expression for tt is found. is found.

Page 33: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Generating ExplanationsGenerating Explanations

One goal is to provide design environment which the One goal is to provide design environment which the user will user will inspect the matches predicted by the systeminspect the matches predicted by the system, , modified them manuallymodified them manually and and the system will have a the system will have a feedbackfeedback..

The system uses complex algorithms so it needs to The system uses complex algorithms so it needs to explain the user the matches. explain the user the matches.

Explanations are good for the user as wellExplanations are good for the user as well Correct matches quickly Correct matches quickly Tells the system where its mistake.Tells the system where its mistake.

Page 34: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Generating Explanations – so, what do you Generating Explanations – so, what do you want to know about the matches?want to know about the matches?

iMAP system defines 3 main questions:iMAP system defines 3 main questions: Explain the existing matchExplain the existing match – why a certain match X is presented – why a certain match X is presented

in the output of iMAP? Why the match survive the all process?in the output of iMAP? Why the match survive the all process? Explain absent matchExplain absent match - why a certain match Y is not presented - why a certain match Y is not presented

in the output of iMAP?in the output of iMAP? Explain match rankingExplain match ranking – why match X is ranked higher than – why match X is ranked higher than

match Y?match Y?

Each of these questions can be asked for each module Each of these questions can be asked for each module of iMAP. of iMAP.

Question can be reformulated recursively to underlying Question can be reformulated recursively to underlying components.components.

Page 35: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Generating Explanations - ExampleGenerating Explanations - Example Suppose we have 2 real-estate schemas:Suppose we have 2 real-estate schemas:

iMAP produces the ranked matches:iMAP produces the ranked matches: (1) List-price=price*(1+monthly-fee-rate)(1) List-price=price*(1+monthly-fee-rate) (2) List-price=price(2) List-price=price

List-priceList-priceMonth-Month-postedposted

…… PricePriceMonthly-Monthly-fee-ratefee-rate

……

iMAP explanation: both matches were generated by the iMAP explanation: both matches were generated by the numeric searcher and the similarity estimator also numeric searcher and the similarity estimator also

agreed to the ranking.agreed to the ranking.

Page 36: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Generating Explanations - ExampleGenerating Explanations - Example Suppose we have 2 real-estate schemas:Suppose we have 2 real-estate schemas:

The current order:The current order:

(1)(1) List-price=price*(1+monthly-fee-rate)List-price=price*(1+monthly-fee-rate)

(2)(2) List-price=priceList-price=price Match selector have 2 constraints: (1) month-Match selector have 2 constraints: (1) month-

posted=month-fee-rate, (2) month-posted and price posted=month-fee-rate, (2) month-posted and price don’t share common attributesdon’t share common attributes

List-priceList-priceMonth-Month-postedposted

…… PricePriceMonthly-Monthly-fee-ratefee-rate

……

List-price=price match is selected by the match List-price=price match is selected by the match generatorgenerator

Page 37: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Generating Explanations - ExampleGenerating Explanations - Example Suppose we have 2 real-estate schemas:Suppose we have 2 real-estate schemas:

The current order:The current order:

(1)(1) List-price=priceList-price=price

(2)(2) List-price=price*(1+monthly-fee-rate)List-price=price*(1+monthly-fee-rate) iMAP explains that the source for month-posted=month-iMAP explains that the source for month-posted=month-

fee-rate is the date searcherfee-rate is the date searcher

List-priceList-priceMonth-Month-postedposted

…… PricePriceMonthly-Monthly-fee-ratefee-rate

……

The user correct the iMAP that month-fee-rate is The user correct the iMAP that month-fee-rate is not type of date.not type of date.

Page 38: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Generating Explanations - ExampleGenerating Explanations - Example Suppose we have 2 real-estate schemas:Suppose we have 2 real-estate schemas:

List-price=price*(1+monthly-fee-rate) is again the chosen List-price=price*(1+monthly-fee-rate) is again the chosen match match

The Final order:The Final order:

(1)(1) List-price=price*(1+monthly-fee-rate)List-price=price*(1+monthly-fee-rate)

(2)(2) List-price=priceList-price=price

List-priceList-priceMonth-Month-postedposted

…… PricePriceMonthly-Monthly-fee-ratefee-rate

……

Page 39: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Example cont. – generated Example cont. – generated dependency graphdependency graph

Dependency Graph is small!!!Dependency Graph is small!!!

Searchers produce only k best matches

iMAP goes through three stages

Page 40: iMAP: Discovering Complex Semantic Matches Between Database Schemas

What do you want to know about the What do you want to know about the matches?matches?

Why a certain match X is presented in the output of Why a certain match X is presented in the output of iMAP?iMAP? Returns the part in the graph that describes the Returns the part in the graph that describes the

match.match.

Page 41: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Example cont. – generated Example cont. – generated dependency graphdependency graph

Page 42: iMAP: Discovering Complex Semantic Matches Between Database Schemas

What do you want to know about the What do you want to know about the matches?matches?

Why a certain match X is presented in the output of iMAP?Why a certain match X is presented in the output of iMAP? Returns the part in the graph that describes the match.Returns the part in the graph that describes the match.

Why match X is ranked higher than match Y?Why match X is ranked higher than match Y? Return the comparing part in the graph between the 2 Return the comparing part in the graph between the 2

matches.matches.

Page 43: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Example cont. – generated Example cont. – generated dependency graphdependency graph

Page 44: iMAP: Discovering Complex Semantic Matches Between Database Schemas

What do you want to know about the What do you want to know about the matches?matches?

Why a certain match X is presented in the output of iMAP?Why a certain match X is presented in the output of iMAP? Returns the part in the graph that describes the match.Returns the part in the graph that describes the match.

Why match X is ranked higher than match Y?Why match X is ranked higher than match Y? Return the comparing part in the graph between the 2 matches.Return the comparing part in the graph between the 2 matches.

Why a certain match Y is not presented in the Why a certain match Y is not presented in the output of iMAP?output of iMAP? If the has been eliminated during the process the part If the has been eliminated during the process the part

that responsible for the eliminating explains whythat responsible for the eliminating explains why Otherwise the iMAP ask the searcher to check if they Otherwise the iMAP ask the searcher to check if they

can generate the match and to explain why it was not can generate the match and to explain why it was not generated generated

Page 45: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Example cont. – generated Example cont. – generated dependency graphdependency graph

Page 46: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Evaluating iMAP on real world Evaluating iMAP on real world domainsdomains

iMAP was evaluated on 4 real-word domains:iMAP was evaluated on 4 real-word domains:

For the Cricket domain they used 2 independently For the Cricket domain they used 2 independently developed databasesdeveloped databases

For the other 3 they used one real-world source For the other 3 they used one real-world source database and target schema which created by database and target schema which created by volunteers.volunteers.

Databases with Databases with overlapoverlap domains and databases with domains and databases with disjoint disjoint domainsdomains

Page 47: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Evaluating iMAP on real world Evaluating iMAP on real world domains – cont.domains – cont.

Data Processing:Data Processing: removing data such as “unknown” removing data such as “unknown” and adding the most obvious constraints.and adding the most obvious constraints.

Experiments:Experiments: there are actually 8 experimental domains there are actually 8 experimental domains 2 domains for each one – overlap domain and disjoint domain.2 domains for each one – overlap domain and disjoint domain.

Performance measure:Performance measure: 1 matching accuracy1 matching accuracy 3 matching accuracy3 matching accuracy Complex matchComplex match Partial complex matchPartial complex match

Page 48: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Results (1)Results (1)Overall and 1-1 matching accuracy:Overall and 1-1 matching accuracy:

Not in the figure, but according to the article the top-3 Not in the figure, but according to the article the top-3 accuracy is even higher and iMAP also achieves top-1 accuracy is even higher and iMAP also achieves top-1 and top-3 accuracy of 77%-100% for 1-1 matching and top-3 accuracy of 77%-100% for 1-1 matching

(a) Exploiting domain (a) Exploiting domain constraints and overlap constraints and overlap data improve accuracydata improve accuracy

(b) Disjoint domains (b) Disjoint domains achieves lower achieves lower accuracy than overlap accuracy than overlap data domainsdata domains

Page 49: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Results (2)Results (2)Complex matching accuracy – Top 1 and Top 3:Complex matching accuracy – Top 1 and Top 3:

Page 50: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Results (2) – Cont.Results (2) – Cont.Complex matching accuracy – Top 1:Complex matching accuracy – Top 1:

Low results for default iMAP (for example: inventory=9%) Low results for default iMAP (for example: inventory=9%) both in overlap domains and disjoint domainsboth in overlap domains and disjoint domains

(a) Exploiting domain constraints and overlap data (a) Exploiting domain constraints and overlap data improve accuracyimprove accuracy

(b) iMAP achieves lower accuracy than in overlap data (b) iMAP achieves lower accuracy than in overlap data domainsdomains No overlap data decreases the accuracy of Numeric No overlap data decreases the accuracy of Numeric

Searcher and Text Searcher.Searcher and Text Searcher.

Page 51: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Results (2) – complex matches low Results (2) – complex matches low resultsresults

Smaller components – example: apt-numberSmaller components – example: apt-number Suggested solution: adding format learning techniquesSuggested solution: adding format learning techniques

Small noise components – example: agent-idSmall noise components – example: agent-id Suggested solution: more aggressive match cleaning and more Suggested solution: more aggressive match cleaning and more

constraints.constraints.

Disjoint databases – difficult for numeric searcherDisjoint databases – difficult for numeric searcher Suggested solution: using past numeric matchesSuggested solution: using past numeric matches

Top–k – many results are not in top 1Top–k – many results are not in top 1 Increasing k to 10 will increase accuracyIncreasing k to 10 will increase accuracy

Page 52: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Results (2)Results (2)Complex matching accuracy – Top 1 and Top 3:Complex matching accuracy – Top 1 and Top 3:

Page 53: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Results (2) – Cont.Results (2) – Cont.

Complex matching accuracy – Top 3:Complex matching accuracy – Top 3: Low results for default iMAP (for example: inventory=9%) Low results for default iMAP (for example: inventory=9%)

both in overlap domains and disjoint domainsboth in overlap domains and disjoint domains Same reasons as in Top 1Same reasons as in Top 1

(c) Improvement in accuracy compared to (a) when (c) Improvement in accuracy compared to (a) when using overlap and constraintsusing overlap and constraints

This is a outcome of correct complex matches in the top This is a outcome of correct complex matches in the top 3 matches3 matches

Page 54: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Results (3)Results (3)

Partial Complex matching accuracy – Top 1 and Top 3:Partial Complex matching accuracy – Top 1 and Top 3:

Page 55: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Results (3) – cont.Results (3) – cont.

Partial Complex matching accuracy – Top 1 and Top 3:Partial Complex matching accuracy – Top 1 and Top 3:

The accuracy is measured in finding only the right The accuracy is measured in finding only the right attributesattributes

For example: wrong numeric template but right attributesFor example: wrong numeric template but right attributes

Much more accuracy than full complex matching Much more accuracy than full complex matching accuracy.accuracy.

Partial Complex Matches can be very useful when the Partial Complex Matches can be very useful when the user want to fix wrong matchesuser want to fix wrong matches

Page 56: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Performance & EfficiencyPerformance & EfficiencyPerformance:Performance:

iMAP is stable after 100 data tuplesiMAP is stable after 100 data tuples If we run it on fewer examples first we can reduce iMAP If we run it on fewer examples first we can reduce iMAP

running timerunning time

Data tupels

Accuracy

Page 57: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Performance & Efficiency – Cont.Performance & Efficiency – Cont.

Efficiency:Efficiency: Unoptimized iMAP versionUnoptimized iMAP version ran for 5 – 20 minutes on the ran for 5 – 20 minutes on the

experimental domainsexperimental domains

Several techniques are suggested in the article to Several techniques are suggested in the article to improve this time:improve this time:

For example breaking the schemas into independent chunksFor example breaking the schemas into independent chunks

Page 58: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Explaining match predictionsExplaining match predictions Example for explaining match prediction:Example for explaining match prediction:

Conclusion: the Name Based evaluator has more Conclusion: the Name Based evaluator has more influence – last lineinfluence – last line

The user can use this information to reduce the influence The user can use this information to reduce the influence of the Name Based evaluatorof the Name Based evaluator

Searcher Level: Concat(first-Searcher Level: Concat(first-name,last-name) was ranked name,last-name) was ranked higher than last-namehigher than last-name

Similarity Estimator:Similarity Estimator:• Name based was Name based was wrongwrong• Naïve Bayes was Naïve Bayes was rightright

Match Selector: didn’t Match Selector: didn’t influenceinfluence

Page 59: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Related workRelated work

L. Xu and D. Embley. Using domain ontologies to L. Xu and D. Embley. Using domain ontologies to discover direct and in direct matches for schema discover direct and in direct matches for schema elements:elements:

Mapping the schema to domain ontology and searching in this Mapping the schema to domain ontology and searching in this domain.domain.

Can be added to as additional searcherCan be added to as additional searcher

Clio System:Clio System: Sophisticated set of user-interface techniques to improve Sophisticated set of user-interface techniques to improve

matchesmatches

Page 60: iMAP: Discovering Complex Semantic Matches Between Database Schemas

ConclusionsConclusions

Most of the work in that field until now was about 1-1 Most of the work in that field until now was about 1-1 matchingmatching

This article focused on complex matching. This article focused on complex matching.

iMAP key is the use of:iMAP key is the use of: SearchersSearchers Domain knowledgeDomain knowledge

Providing the user the possibility to affect the matchesProviding the user the possibility to affect the matches

Page 61: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Any Questions?Any Questions?

Page 62: iMAP: Discovering Complex Semantic Matches Between Database Schemas

Thank you!Thank you!

Page 63: iMAP: Discovering Complex Semantic Matches Between Database Schemas

BibliographyBibliography

Robin Dhamankar, Yoonkyong Lee, AnHai Doan,Alon Robin Dhamankar, Yoonkyong Lee, AnHai Doan,Alon Halevy, Pedro Domingos. iMAP: Discovering Complex Halevy, Pedro Domingos. iMAP: Discovering Complex Semantic Matches between Database Schemas.Semantic Matches between Database Schemas.

http://en.wikipedia.org/wiki/Beam_searchhttp://en.wikipedia.org/wiki/Beam_search