31
CS742 – Distributed & Parallel DBMS Page 9.1 M. Tamer Özsu Outline Introduction & architectural issues Data distribution Distributed query processing Distributed query optimization Distributed transactions & concurrency control Distributed reliability Data replication Parallel database systems Database integration & querying Schema matching Schema mapping Peer-to-Peer data management Stream data management MapReduce-based distributed data management

Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

Embed Size (px)

Citation preview

Page 1: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.1 M. Tamer Özsu

Outline n  Introduction & architectural issues n Data distribution n Distributed query processing n Distributed query optimization n Distributed transactions & concurrency control n Distributed reliability n Data replication n Parallel database systems q Database integration & querying

q Schema matching q Schema mapping

q Peer-to-Peer data management q Stream data management q MapReduce-based distributed data management

Page 2: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.2 M. Tamer Özsu

Problem Definition

n Given existing databases with their Local Conceptual Schemas (LCSs), how to integrate the LCSs into a Global Conceptual Schema (GCS) l GCS is also called mediated schema

n Bottom-up design process

Page 3: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.3 M. Tamer Özsu

Integration Alternatives

n Physical integration l Source databases integrated and the integrated database is

materialized l Data warehouses

n Logical integration l Global conceptual schema is virtual and not materialized l Enterprise Information Integration (EII)

Page 4: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.4 M. Tamer Özsu

Data Warehouse Approach

Page 5: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.5 M. Tamer Özsu

Bottom-up Design

n GCS (also called mediated schema) is defined first l Map LCSs to this schema l As in data warehouses

n GCS is defined as an integration of parts of LCSs l Generate GCS and map LCSs to this GCS

Page 6: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.6 M. Tamer Özsu

GCS/LCS Relationship

n Local-as-view l The GCS definition is assumed to exist, and each LCS is

treated as a view definition over it

n Global-as-view l The GCS is defined as a set of views over the LCSs

Page 7: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.7 M. Tamer Özsu

Database Integration Process

Page 8: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.8 M. Tamer Özsu

Recall Access Architecture

Page 9: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.9 M. Tamer Özsu

Database Integration Issues

n Schema translation l Component database schemas translated to a common

intermediate canonical representation

n Schema generation l  Intermediate schemas are used to create a global

conceptual schema

Page 10: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.10 M. Tamer Özsu

Schema Translation

n What is the canonical data model? l Relational l Entity-relationship

u DIKE l Object-oriented

u ARTEMIS l Graph-oriented

u DIPE, TranScm, COMA, Cupid u Preferable with emergence of XML u No common graph formalism

n Mapping algorithms l These are well-known

Page 11: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.11 M. Tamer Özsu

Schema Generation

n Schema matching l Finding the correspondences between multiple schemas

n Schema integration l Creation of the GCS (or mediated schema) using the

correspondences

n Schema mapping l How to map data from local databases to the GCS

n Important: sometimes the GCS is defined first and schema matching and schema mapping is done against this target GCS

Page 12: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.12 M. Tamer Özsu

Running Example

EMP(ENO, ENAME, TITLE) PROJ(PNO, PNAME, BUDGET, LOC, CNAME) ASG(ENO, PNO, RESP, DUR) PAY(TITLE, SAL)

Relational

E-R Model

Page 13: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.13 M. Tamer Özsu

Schema Matching

n Schema heterogeneity l Structural heterogeneity

u Type conflicts u Dependency conflicts u Key conflicts u Behavioral conflicts

l Semantic heterogeneity u More important and harder to deal with u Synonyms, homonyms, hypernyms u Different ontology u  Imprecise wording

Page 14: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.14 M. Tamer Özsu

Schema Matching (cont’d)

n Other complications l  Insufficient schema and instance information l Unavailability of schema documentation l Subjectivity of matching

n Issues that affect schema matching l Schema versus instance matching l Element versus structure level matching l Matching cardinality

Page 15: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.15 M. Tamer Özsu

Schema Matching Approaches

Page 16: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.16 M. Tamer Özsu

Linguistic Schema Matching

n Use element names and other textual information (textual descriptions, annotations)

n May use external sources (e.g., Thesauri) n 〈SC1.element-1 ≈ SC2.element-2, p,s〉

l Element-1 in schema SC1 is similar to element-2 in schema SC2 if predicate p holds with a similarity value of s

n Schema level l Deal with names of schema elements l Handle cases such as synonyms, homonyms, hypernyms,

data type similarities

n Instance level l Focus on information retrieval techniques (e.g., word

frequencies, key terms) l  “Deduce” similarities from these

Page 17: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.17 M. Tamer Özsu

Linguistic Matchers

n Use a set of linguistic (terminological) rules n Basic rules can be hand-crafted or may be

discovered from outside sources (e.g., WordNet) n Predicate p and similarity value s

l hand-crafted ⇒ specified, l discovered ⇒ may be computed or specified by an expert

after discovery

n Examples l  〈uppercase names ≈ lower case names, true, 1.0〉 l  〈uppercase names ≈ capitalized names, true, 1.0〉 l  〈capitalized names ≈ lower case names, true, 1.0〉 l  〈DB1.ASG ≈ DB2.WORKS_IN, true, 0.8〉

Page 18: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.18 M. Tamer Özsu

Automatic Discovery of Name Similarities

n Affixes l Common prefixes and suffixes between two element name strings

n N-grams l Comparing how many substrings of length n are common between

the two name strings

n Edit distance l Number of character modifications (additions, deletions,

insertions) that needs to be performed to convert one string into the other

n Soundex code l Phonetic similarity between names based on their soundex codes

n Also look at data types l Data type similarity may suggest relationship

Page 19: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.19 M. Tamer Özsu

N-gram Example

n 3-grams of string “Responsibility” are the following:

lRes l sib

libi l esp

lbip l spo

lili l pon

llit l ons

lity l nsi

n 3-grams of string “Resp” are l Res

l  esp

n 3-gram similarity: 2/12 = 0.17

Page 20: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.20 M. Tamer Özsu

Edit Distance Example

n Again consider “Responsibility” and “Resp”

n To convert “Responsibility” to “Resp”

l Delete characters “o”, “n”, “s”, “i”, “b”, “i”, “l”, “i”, “t”, “y”

n To convert “Resp” to “Responsibility”

l Add characters “o”, “n”, “s”, “i”, “b”, “i”, “l”, “i”, “t”, “y”

n The number of edit operations required is 10

n Similarity is 1 − (10/14) = 0.29

Page 21: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.21 M. Tamer Özsu

Constraint-based Matchers

n Data always have constraints – use them l Data type information l Value ranges l …

n Examples l RESP and RESPONSIBILITY: n-gram similarity = 0.17,

edit distance similarity = 0.19 (low) l  If they come from the same domain, this may increase their

similarity value l ENO in relational, WORKER.NUMBER and

PROJECT.NUMBER in E-R l ENO and WORKER.NUMBER may have type INTEGER

while PROJECT.NUMBER may have STRING

Page 22: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.22 M. Tamer Özsu

Constraint-based Structural Matching

n If two schema elements are structurally similar, then there is a higher likelihood that they represent the same concept

n Structural similarity: l Same properties (attributes)

l  “Neighborhood” similarity

u Using graph representation

u The set of nodes that can be reached within a particular path length from a node are the neighbors of that node

u  If two concepts (nodes) have similar set of neighbors, they are likely to represent the same concept

Page 23: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.23 M. Tamer Özsu

Learning-based Schema Matching

n Use machine learning techniques to determine schema matches

n Classification problem: classify concepts from various schemas into classes according to their similarity. Those that fall into the same class represent similar concepts

n Similarity is defined according to features of data instances

n Classification is “learned” from a training set

Page 24: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.24 M. Tamer Özsu

Learning-based Schema Matching

Page 25: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.25 M. Tamer Özsu

Combined Schema Matching Approaches

n Use multiple matchers

l Each matcher focuses on one area (name, etc)

n Meta-matcher integrates these into one prediction

n Integration may be simple (take average of similarity values) or more complex (see Fagin’s work)

Page 26: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.26 M. Tamer Özsu

Schema Integration

n Use the correspondences to create a GCS n Mainly a manual process, although rules can help

Page 27: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.27 M. Tamer Özsu

Binary Integration Methods

Page 28: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.28 M. Tamer Özsu

N-ary Integration Methods

Page 29: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.29 M. Tamer Özsu

Schema Mapping

n Mapping data from each local database (source) to GCS (target) while preserving semantic consistency as defined in both source and target.

n Data warehouses ⇒ actual translation

n Data integration systems ⇒ discover mappings that can be used in the query processing phase

n Mapping creation

n Mapping maintenance

Page 30: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.30 M. Tamer Özsu

Mapping Creation

Given l A source LCS

l A target GCS

l A set of value correspondences discovered during schema matching phase

Produce a set of queries that, when executed, will create GCS data instances from the source data.

We are looking, for each Tk, a query Qk that is defined on a (possibly proper) subset of the relations in S such that, when executed, will generate data for Ti from the source relations

[S = {Si}][T = {Ti}]

[V = {Vi}]

Page 31: Outline - University of Waterlootozsu/courses/CS742/Course Notes/9a... · "Database integration & querying "Schema matching ... Schema generation ... PROJECT.NUMBER in E-R

CS742 – Distributed & Parallel DBMS Page 9.31 M. Tamer Özsu

Mapping Creation Algorithm

General idea: n Consider each Tk in turn. Divide Vk into

subsets such that each specifies one possible way that values of Tk can be computed.

n Each can be mapped to a query that, when executed, would generate some of Tk’s data.

n Union of these queries gives

{V 1k , . . . , V

nk } V j

k

V jk qjk

Qk(= [jqjk)