39
Methods for Data Integration Amit Shvarchenberg and Rafi Sayag

Methods for Data Integration

  • Upload
    nitsa

  • View
    20

  • Download
    2

Embed Size (px)

DESCRIPTION

Methods for Data Integration. Amit Shvarchenberg and Rafi Sayag. iMAP: Discovering Complex Semantic Matches Between Database Schemas. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois, Urbana-Champaign, IL, USA - PowerPoint PPT Presentation

Citation preview

iMAP: Discovering Complex Semantic Matches between Database Schemas by Rafi Sayag and Amit Shvarchenberg

Methods for Data IntegrationAmit ShvarchenbergandRafi Sayag

iMAP: Discovering Complex Semantic MatchesBetween Database SchemasBased on a paper by:Robin Dhamankar, Yoonkyong Lee, AnHai DoanDepartment of Computer ScienceUniversity of Illinois, Urbana-Champaign, IL, USAfdhamanka,ylee11,[email protected]

Alon Halevy, Pedro DomingosDepartment of Computer Science and EngineeringUniversity of Washington, Seattle, WA, USAfalon,[email protected] Today there are a lot of databases around the world, and many times it is required to combine two or more similar databases into a single databaseIn the past, many of this integrations were made manuallyThe iMAP system offers a semi-automatic method of matching information from different sourcesA simple example

3The Real-Estate-Agents ExamplelocationpriceAgent-idRaleigh, NC360,00032Atlanta, GA430,00015areaList-priceAgent- addressAgent- nameDenver,CO550000Boulder , COLaura SmithAtlanta,GA370800Athens , GAMike BrownIdNamecityStateFee-rate32Mike brownAthensGA0.0315Jean LaupRaleighNC0.04Schema TSchema SHOUSESAGENTSLISTINGThe Big Merge

Making Tuples Using SQL area= SELECT location from HOUSES agent-address= SELECT concat(city, state) FROM AGENTS list-price= SELECT price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = idHow Do We Match ?The process of creating mappings typically proceeds in two steps.first step: schema matching, we find matches between elements of the two schemas. second step :we elaborate the matches to create query expressions that enable automated data translation or exchange.Schema MatchesThere are two kinds of schema matches. 1-1 matches.locationpriceAgent-idRaleigh, NC360,00032Atlanta, GA430,00015IdNamecityStateFee-rate32Mike brownAthensGA0.0315Jean LaupRaleighNC0.04areaList-priceAgent- addressAgent- nameDenver,CO550000Boulder , COLaura SmithAtlanta,GA370800Athens , GAMike BrownSchema MatchesThere are two kinds of schema matches. 1-1 matches.locationpriceAgent-idRaleigh, NC360,00032Atlanta, GA430,00015IdNamecityStateFee-rate32Mike brownAthensGA0.0315Jean LaupRaleighNC0.04areaList-priceAgent- addressAgent- nameDenver,CO550000Boulder , COLaura SmithAtlanta,GA370800Athens , GAMike BrownComplex Matches

specify that some combination of attributes in one schema corresponds to a combination in the other.locationpriceAgent-idRaleigh, NC360,00032Atlanta, GA430,00015IdNamecityStateFee-rate32Mike brownAthensGA0.0315Jean LaupRaleighNC0.04areaList-priceAgent- addressAgent- nameDenver,CO550000Boulder , COLaura SmithAtlanta,GA370800Athens , GAMike BrownComplex Matches

specify that some combination of attributes in one schema corresponds to a combination in the other.locationpriceAgent-idRaleigh, NC360,00032Atlanta, GA430,00015IdNamecityStateFee-rate32Mike brownAthensGA0.0315Jean LaupRaleighNC0.04areaList-priceAgent- addressAgent- nameDenver,CO550000Boulder , COLaura SmithAtlanta,GA370800Athens , GAMike BrownComplex Matches

specify that some combination of attributes in one schema corresponds to a combination in the other.locationpriceAgent-idRaleigh, NC360,00032Atlanta, GA430,00015IdNamecityStateFee-rate32Mike brownAthensGA0.0315Jean LaupRaleighNC0.04areaList-priceAgent- addressAgent- nameDenver,CO550000Boulder , COLaura SmithAtlanta,GA370800Athens , GAMike BrownComplex Matches

specify that some combination of attributes in one schema corresponds to a combination in the other.locationpriceAgent-idRaleigh, NC360,00032Atlanta, GA430,00015IdNamecityStateFee-rate32Mike brownAthensGA0.0315Jean LaupRaleighNC0.04areaList-priceAgent- addressAgent- nameDenver,CO550000Boulder , COLaura SmithAtlanta,GA370800Athens , GAMike BrownThe Solution The iMAP SystemWe will describe the iMAP system which semi-automatically discovers complex matches for relational data in a single table. In some cases iMAP able to find matches that combine attributes from multiple tables.We will also show how the iMAP faces with the problem that the space of possible match candidates is unbounded.

14The iMAP Architecture

Match GeneratorInput: target schema and source schema.Output: match candidates .

How Match Generator WorksMatch generator uses a searching method that goes through all possible match candidates.The searchers uses a prior knowledge of possible match types and heuristic methods.

The Internals of a SearcherApplying search to candidate generation involve three major issues: Search strategyEvaluation of candidate matches Termination conditionSearch Strategy

The space search can be very large or even unbounded.We need to efficiently search such spaces.iMAP address this problem using a search technique called beam search.

Beam SearchBeam search uses a scoring function to evaluate each match candidateAt each level of the search tree, it keeps only k highest-scoring match.By that the searcher can conduct a very efficient search in any type of search space.Implemented Searchers on iMAP

Example: Unit Conversion SearcherThe unit conversion searcher can identify a conversion between two different types of measurement unit.It can do so By looking in the name and data of the attributes. (e.g., hours", kg", $", etc.)

The searcher finds the best conversion from a set of conversion functions between the units.In this case weight_kg = 2.2 * weight_pounds.

productpoundsapple10Fruits and vegetables kgbanna5Fruits and vegetables kgbanna5apple

22Example: Unit Conversion Searcher (cont.)Similarity EstimatorInput: Match candidates.Output: Similarity matrix .

Similarity matrix stores the similarity score of pairs

Similarity EstimatorThe similarity estimator gets the results from all the searchers .Then it gathers the data and calculates a final score for each match

Similarity Estimator (cont.)The similarity estimator uses two methods to score match pairs:Name based evaluatorNave Bayese evaluatorName based evaluator Returns a score based on the name of the source and target attributes, and the name of the tablesNave Bayese evaluator Returns a score based on the contents of the attributes26Match SelectorInput: Similarity matrix .Output: 1-1 and complex matches .

Match SelectorMatch Selector examines the score matrix and outputs the best matches under certain conditions.

Exploiting Domain KnowledgeExploiting domain knowledge was shown to be beneficial on 1-1 matchingOn complex matching, it can be even more crucial, since it can save valuable processing by early detection of unlikely matches

Domain ConstraintsConstraints are either present in the schema, or provided by an expert or the useriMAP considers 3 kinds of constraints:Two attributes are un-relatedConstraint on a single attributeMultiple schema attributes are un-related

Any searcher can use the 2 attributes unrelated constraintAny searcher can use the 1 attribute constraint, but if the check is expensive it might move to the similarity estimator levelMultiple attributes un-related can only be used at the match selector level30Sources For Domain ConstraintsPast Complex MatchesOverlap dataExternal DataPast Complex MatchesWe often find that we map the same or similar schemas repeatedlyiMAP can extract a template expression from such matches ExampleGiven the past match: price = pr * (1+0.6) iMAP will extract: VAR * (1 + CONST) and ask the numeric searcher to look for matches for that templateOverlap DataIn some cases, both the source and the target share the same dataThis can be used as information for the matching processSearchers that exploit overlap data:Overlap text searcherOverlap numeric searcherOverlap category and schema mismatch searcherExternal DataExternal data is used as additional constraints on the attributes of a schemaUsually provided by expertsCan be very useful in schema matchingExplanations in iMAPWhy do we need it?Generating Explanations in iMAPiMAPs goal is to provide a design environment where a human user can quickly generate a mapping between a pair of schemasFor a user to know what match to choose, it is necessary to supply an explanation for each of the matchesUser QuestionsiMAP considers 3 questions that might be asked by a user:Why the match exist?Why the match doesnt exist?Why is one match better than the other?

Explanation GenerationiMAP keeps track of the decision making progress as a dependency graph:Each node is either a schema attribute, an assumption, candidate matches or domain knowledgeAn edge between two nodes means that one node lead to another

Explanation Generation Example