22
1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

Embed Size (px)

Citation preview

Page 1: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

1

Distributed Database Concepts8:30-10:00AM

Thursday, July 21st 2005CSIG05

Chaitan Baru

Page 2: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

2

What is the issue?• Ability to access data stored in multiple, different

databases using a single request, e.g.– Get geologic information from multiple geologic

databases– Get employee information from all branches

• Ability to update data stored in multiple databases, e.g.– Transfer salary amount from University to my bank

account – Transfer funds from Visa account to vendor’s account

Page 3: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

3

Distributed data accessClient

Database 1 Database 2 Database 3

Homogeneous: mySQL mySQL mySQLHeterogeneous: mySQL Oracle DB2

How about creating a “cached” local copy?

mySQL Excel ASCII flat file

Page 4: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

4

Data WarehousingClient

Data Source 1 Data Source 2 Data Source 3

Data Warehouse(common schema)

ETL

– Extract– Transform– Load ETL ETL

1. Load data from sources to warehouse

2. Query processing interaction only between client and warehouse

But, warehouse data could be “stale”, i.e. out of synch with source data…

Page 5: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

5

Data integration via middlewareClient

Database 1 Database 2 Database 3

Data integration Middleware

(aka Mediator)

1. Each client request goes to sources, via middleware 2. Result collected by

middleware and returned to client

Page 6: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

6

Warehousing vs Mediation• Warehousing: User ETL to “massage” local data

to fit into a common global, warehouse schema • Mediation: Modify user query to match schemas

exported by each source– But, which schema does the user query?– The Integrated View Schema– Sources “export” a view (the export schema)

• Federated databases– Local sources belong to different “administrative

domains”, i.e. different owners.– Local autonomy

Page 7: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

7

The Canonical Mediator / Wrapper Architecture

Client Application

Wrapper Wrapper Wrapper Wrapper

Mediator(Integrated view in mediator data model, e.g. relational, XML)

Local viewin local data model

Export viewin mediator data model

Q1

Q11 Q12 Q13 Q14

Cacheddata

Wrapper processes could execute at sources, at mediator, or elsewhere

q14Data source 1

Local schema

Data source 2

Local schema

Data source 3

Local schema

Data source 4

Local schema

Page 8: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

8

Example: A Relational Mediator

Client Application

Mediator(Relational data model)

Wrapper Wrapper

Relational DBMSe.g. PostGIS

Shape file

Page 9: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

9

Example: A Shape-file Based Mediator

Client Application

Mediator(Shape file-based data model)

Wrapper Wrapper

Relational DBMSe.g. PostGIS

Shape file

Page 10: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

10

Example: An XML Mediator

User / Applications

Mediator(XML-based data model, e.g. GML)

Wrapper Wrapper

Relational DBMSe.g. PostGIS

Shape file

Wrapper

XML filee.g. ArcXML

Page 11: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

11

User Authentication and Access Control

Client Application

Mediator

Wrapper Wrapper

Data source 1

Data source 2

2. User connects to mediator (passes credentials to mediator)

1. User authenticates to system

3. Mediator connects to sourcesa) Using original user credentialsb) Or, mapped credentials (role-based access)

4. Need to define users or roles in sources

Page 12: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

12

Different types of heterogeneity in data integration

• Platform heterogeneity: different OS platforms

• DBMS heterogeneity: different database systems, e.g. SQLServer, mySQL, DB2

• Data type heterogeneity• Schema heterogeneity• Heterogeneity in units, accuracy, resolution• Semantic heterogeneity

Page 13: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

13

• A long standing Computer Science problem• Simple case

– Mediator View: (SampleID varchar, Rock_Type varchar, Age int) – In Source2 Table, map Age to int

Wrapper: convert between int and varchar for Age

WrapperSample ID: Rock type: Age: … varchar varchar int

Schema Integration

Sample ID: Rock type: Age: … varchar varchar varchar

Source 1Table

Source 2Table

Page 14: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

14

Another integration scenario

– Mediator View:(SampleID varchar, Rock_Type varchar, Age varchar, Era varchar, Period varchar)

– In Source 2 Table, parse Age to obtain sub-components of the field

Sample ID: Rock type: Eon: Era: Period:varchar varchar varchar varchar varchar

Phanerozoic Mesozoic Jurassic

“Phanerozoic/mesozoic;jur”

Source 1Table

Sample ID: Rock type: Age:varchar varchar varchar

Source 2Table

Page 15: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

15

A more advanced integration scenario

• Mediator View: (SampleID varchar, Rock_Type varchar, Eon varchar, Era varchar, Period varchar)– Same as Source1 table schema

• Query: Get rock types for all rocks from the Jurassic period

Sample ID: Rock type: Eon: Era: Period:varchar varchar varchar varchar varchar

Phanerozoic Mesozoic Jurassic

150

Source 1Table

Sample ID: Rock type: Age:varchar varchar int

Source 2Table

Page 16: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

16

Doing the integration• Query sent to mediator:

SELECT DISTINCT(Rock_Type) FROM Mediator_View WHERE Period=‘Jurrasic’

• Query to Source 1:

SELECT DISTINCT(Rock_Type) FROM Source1_Table WHERE Period=‘Jurrasic’

• For Source2, need to map Period=“Jurassic” to Age values

Sample ID: Rock type: Age:varchar varchar int

Source 2 TableEon: Era: Period: Min Maxvarchar varchar varchar int int

Geologic_Time Table

Page 17: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

17

Query “fragment” sent to Source 2

• SELECT DISTINCT (S2.Rock_Type)

FROM

Source2_Table S2,

Geologic_Time_Table GT

WHERE

GT.Period = ‘Jurrasic’ AND

(S2.Age >= GT.Min) AND

(S2.Age <= GT.Max)

Where is the Geologic_Timetable stored ?

Page 18: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

18

Another complex query

• Query: Get rock types for all rocks from the mesozoic era– Easy to do for Source 1: Era = “Mesozoic”– For Source 2:

• Need to find numeric age range for Mesozoic– Find age range across all subclasses of Mesozoic

(Cretaceous, Jurassic, Triassic)

• Select all Source 2 Table records whose age range falls within the Mesozoic age range

Page 19: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

19

Data Integration Carts©

• Integrating data sets without explicitly creating views• An example request:

Plot all gravity data points that fall within the spatial extent of rocks of a given type, in the Rocky Mountain testbed region– Use GEONsearch to find all gravity and geologic data using

bounding box for “Rocky Mountain testbed region”• Need gazeteer / spatial ontology to determine Rocky Mountain region• Need to know classification of datasets (as gravity and geology)• Intersect extent of gravity and geologic datasets (from metadata) with

extent of Rocky Mountain region– Plot gravity point data that fall within polygons of rocks of given

type

Page 20: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

20

Ad hoc integration

GEONsearch Plot mapMap

Data Integration Cart© Query

Search MetadataCatalog

“Geologic and gravitydata in Rocky Mountains”

Page 21: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

21

Data Registration

Igneous

Granite Quartzmonzonite

Rock Classification Ontology

Gravitydataset

(X, Y)Metadata

Geologicdataset

Lat, Long, RockType Metadata

Item DetailRegistration

Item Registration(Schema registration)

Location

Latitude Longitude

Spatial Ontology

Point Polygon

Page 22: 1 Distributed Database Concepts 8:30-10:00AM Thursday, July 21 st 2005 CSIG05 Chaitan Baru

22

Data Registration is Important!