Upload
ali-usman
View
473
Download
2
Embed Size (px)
Citation preview
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/1
Outline• Introduction
➡ What is a distributed DBMS
➡ Distributed DBMS Architecture
• Background
• Distributed Database Design
• Database Integration
• Semantic Data Control
• Distributed Query Processing
• Multidatabase query processing
• Distributed Transaction Management
• Data Replication
• Parallel Database Systems
• Distributed Object DBMS
• Peer-to-Peer Data Management
• Web Data Management
• Current Issues
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/2
File Systems
program 1
data description 1
program 2
data description 2
program 3
data description 3
File 1
File 2
File 3
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/3
Database Management
database
DBMS
Applicationprogram 1(with datasemantics)
Applicationprogram 2(with datasemantics)
Applicationprogram 3(with datasemantics)
descriptionmanipulation
control
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/4
Motivation
DatabaseTechnology
ComputerNetworks
integration distribution
integration
integration ≠ centralization
DistributedDatabaseSystems
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/5
Distributed Computing
•A number of autonomous processing elements (not necessarily homogeneous) that are interconnected by a computer network and that cooperate in performing their assigned tasks.
•What is being distributed?
➡ Processing logic
➡ Function
➡ Data
➡ Control
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/6
What is a Distributed Database System?A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network.
A distributed database management system (D–DBMS) is the software that manages the DDB and provides an access mechanism that makes this distribution transparent to the users.
Distributed database system (DDBS) = DDB + D–DBMS
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/7
What is not a DDBS?
•A timesharing computer system
•A loosely (separate primary memory and shared secondary memory) or tightly coupled (shared memory) multiprocessor system
•A database system which resides at one of the nodes of a network of computers - this is a centralized database on a network node
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/8
Centralized DBMS on a Network
Site 5
Site 1
Site 2
Site 3Site 4
CommunicationNetwork
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/9
Distributed DBMS Environment
Site 5
Site 1
Site 2
Site 3Site 4
CommunicationNetwork
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/10
Implicit Assumptions
•Data stored at a number of sites each site logically consists of a single processor.
•Processors at different sites are interconnected by a computer network not a multiprocessor system
➡ Parallel database systems
•Distributed database is a database, not a collection of files data logically related as exhibited in the users’ access patterns
➡ Relational data model
•D-DBMS is a full-fledged DBMS
➡ Not remote file system, not a TP system
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/11
Data Delivery Alternatives
•We characterize the data delivery alternatives along three orthogonal dimensions:
•Delivery modes
•Frequency
•Communication Methods
•Note: not all combinations make sense
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/12
Data delivery•Delivery modes
➡ Pull-only {the transfer of data from servers to clients is initiated by a client pull}
•Push-only {the transfer of data from servers to clients is initiated by a server push in the absence of any specific request from clients. periodic, irregular, or conditional}
➡ Hybrid (mix of pull and push)
•Frequency
•Periodic (A client request for IBM’s stock price every week is an example of a periodic pull.)
•Conditional (An application that sends out stock prices only when they change is an example of conditional push.)
➡ Ad-hoc or irregular
•Communication Methods {Unicast, One-to-many}
Ch.1/12
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/13
Distributed DBMS Promises
Transparent management of distributed, fragmented, and replicated data
Improved reliability/availability through distributed transactions
Improved performance
Easier and more economical system expansion
Ch.x/13
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/14
Transparency
•Transparency is the separation of the higher level semantics of a system from the lower level implementation issues.
•Fundamental issue is to providedata independence
in the distributed environment
➡ Network (distribution) transparency
➡ Replication transparency
➡ Fragmentation transparency✦ horizontal fragmentation: selection✦ vertical fragmentation: projection✦ hybrid
Ch.x/14
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/15
Example
SELECT ENAME,SAL
FROM EMP,ASG,PAY
WHERE DUR > 12
AND EMP.ENO = ASG.ENO
AND PAY.TITLE = EMP.TITLE
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/16
Transparent Access
SELECT ENAME,SAL
FROM EMP,ASG,PAY
WHERE DUR > 12
AND EMP.ENO = ASG.ENO
AND PAY.TITLE = EMP.TITLE
Paris projectsParis employees
Paris assignmentsBoston employees
Montreal projectsParis projects
New York projects with budget > 200000
Montreal employeesMontreal assignments
Boston
CommunicationNetwork
Montreal
Paris
NewYork
Boston projectsBoston employees
Boston assignments
Boston projectsNew York employees
New York projectsNew York assignments
Tokyo
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/17
Distributed Database - User View
Distributed Database
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/18
Distributed DBMS - Reality
CommunicationSubsystem
DBMSSoftware
UserApplicationUser
Query
DBMSSoftware
DBMSSoftware
DBMSSoftware
UserQuery
DBMSSoftware
UserQuery
UserApplication
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/19
Types of Transparency
•Data independence {It refers to the immunity of user applications to changes in the definition and organization of data, and vice versa.
•Logical data independence and physical data independence}
•Network transparency (or distribution transparency)
➡ Location transparency
➡ Fragmentation transparency
•Replication transparency
•Fragmentation transparency
{global queries to fragment gueries}
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/20
Who Should Provide Transparency?•Nevertheless, the level of transparency is inevitably a
compromise between ease of use and the difficulty and overhead cost of providing high levels of transparency.
•Gray argues that full transparency makes the management of distributed data very difficult and claims that “applications coded with transparent access to geographically distributed databases have: poor manageability, poor modularity, and poor message performance”.
•He proposes a remote procedure call mechanism between the requestor
•users and the server DBMSs whereby the users would direct their queries to a specific DBMS.
•Application level {code of application, little transperancy}
•Operating system {device drivers within the operating system}
•DBMS Ch.1/20
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/21
Reliability Through Transactions•Replicated components and data should make distributed DBMS
more reliable. {eliminate single points of failure}
•Distributed transactions provide
•Concurrency transparency {sequence of database operations executed as an atomic action. consistent db transformed to another consistent db state}
➡ Failure atomicity {update salary by 10%}
• Distributed transaction support requires implementation of
➡ Distributed concurrency control protocols
➡ Commit protocols
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/22
Potentially Improved Performance•Proximity of data to its points of use {data localization}
➡ Requires some support for fragmentation and replication
1. Since each site handles only a portion of the database, contention for CPU and I/O services is not as severe as for centralized databases.
2. Localization reduces remote access delays that are usually involved in wide area networks (for example, the minimum round-trip message propagation delay in satellite-based systems is about 1 second).
•Parallelism in execution
➡ Inter-query parallelism
➡ Intra-query parallelism
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/23
•Parallelism Requirements
• read only queries
Have as much of the data required by each application at the site where the application executes
➡ Full replication
•How about updates?
➡ Mutual consistency
➡ Freshness of copies
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/24
System Expansion
• Issue is database scaling
•Emergence of microprocessor and workstation technologies
➡ Demise of Grosh's law
➡ Client-server model of computing
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/25
Complications Introduced by Distribution•data may be replicated, the distributed database system is
responsible for
(1) choosing one of the stored copies of the requested data for access in case of retrievals, and
(2) making sure that the effect of an update is reflected on each and every copy of that data item.
• if some sites fail or communication fail,
DBMS will ensure update for fail site as soon as system recovers
• the synchronization of transactions on multiple sites is considerably harder than for a centralized system.
Ch.1/25
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/26
Distributed DBMS Issues
•Distributed Database Design {chapter 3}➡ How to distribute the database {portioned and replicated}
➡ Replicated (partial dupliacated or fully duplicated) & non-replicated database distribution
➡ Fragmentation
➡ {research area to minimize cost of storing, processing transactions and communication is NP hard. Proposed solution are based on heuristics}
•Distributed directory management {chapter 3}
➡ Contains information such as description and location about items in db.
➡ Global directory to entire DDBS or local to each site, centralized or distributed, single copy or multiple copy
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/27
Distributed DBMS Issues
•Query Processing {chapter 6-8}➡ Convert user transactions to data manipulation instructions
➡ Optimization problem✦ min{cost = data transmission + local processing}
➡ General formulation is NP-hard
•Concurrency Control {chapter 11}
➡ Synchronization of concurrent accesses
➡ The condition that requires all the values of multiple copies of every data item to converge to the same value is called mutual consistency.
➡ Consistency and isolation of transactions' effects
➡ Deadlock management
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/28
• Distributed deadlock management
•The deadlock problem in DDBSs is similar in nature to that encountered in operating systems. The competition among users for access to a set of resources (data, in this case) can result in a deadlock if the synchronization mechanism is based on locking.
•The well-known alternatives of prevention, avoidance, and detection/recovery also apply to DDBSs.
•Reliability and availability
➡ How to make the system resilient to failures
➡ Atomicity and durability
Ch.1/28
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/29
DirectoryManagement
Relationship Between Issues
Reliability
DeadlockManagement
QueryProcessing
ConcurrencyControl
DistributionDesign
The design ofdistributed databases affects many areas. It affects
directory management, because thedefinition of fragments and their placement
determine the contents of the directory(or directories) as well as the strategies that may
be employed to manage them.
The same information (i.e., fragment structure and placement) is used by the
queryprocessor to determine the query
evaluation strategy.
On the other hand, the accessand usage patterns that are determined
by the query processor are used as inputs to
the data distribution and fragmentation algorithms. Similarly, directory placementand contents influence the processing of
queries.
The replication of fragments
when they are distributed affects the
concurrencycontrol
strategies that might be
employed.
There is a strong relationship among the concurrency control problem, the deadlock
management problem, and reliability issues. This is to be expected, since together
they are usually called the transaction management problem. The concurrency
control algorithm that is employed will determine whether or not a separate deadlock
management facility is required. If a locking-based algorithm is used, deadlocks will
occur, whereas they will not if timestamping is the chosen alternative.
Finally, the need for replication protocols arise if data distribution involves
replicas.As indicated above, there is a strong
relationship between replication protocols and
concurrency control techniques, since both deal with the consistency of data,
but fromdifferent perspectives.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/30
Architecture
•Defines the structure of the system
➡ components identified
➡ functions of each component defined
➡ interrelationships and interactions between components defined
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/31
ANSI/SPARC Architecture
ExternalSchema
ConceptualSchema
InternalSchema
Internal view
Users
External view
Conceptual view
External view
External view
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/32
32
Differences between Three Levels of ANSI-SPARC Architecture
© Pearson Education Limited 1995, 2005
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/33
33
Data Independence
•Logical Data Independence
➡ Refers to immunity of external schemas to changes in conceptual schema.
➡ Conceptual schema changes (e.g. addition/removal of entities).
➡ Should not require changes to external schema or rewrites of application programs.
© Pearson Education Limited 1995, 2005
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/34
34
Data Independence
•Physical Data Independence
➡ Refers to immunity of conceptual schema to changes in the internal schema.
➡ Internal schema changes (e.g. using different file organizations, storage structures/devices).
➡ Should not require change to conceptual or external schemas.
© Pearson Education Limited 1995, 2005
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/35
35
Data Independence and the ANSI-SPARC Three-Level Architecture
© Pearson Education Limited 1995, 2005
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/36
Generic DBMS ArchitectureThe interface layer manages the
interface to the applications.View management consists of
translating the user query from external data toconceptual data.The control layer controls the
query by adding semantic integrity predicates andauthorization predicates.The query processing (or
compilation) layer maps the query into an optimizedsequence of lower-level
operations.
decomposes the query into a tree of algebra operations and tries to
find the “optimal”ordering of the operations. The
result is stored in an access plan. The output of this
layer is a query expressed in lower-level code (algebra
operations).The execution layer directs the execution of the access plans,
including transactionmanagement (commit, restart) and synchronization of algebra
operations
The data access layer manages the data structures
that implement the files, indices,
etc. It also manages the buffers by caching the most frequently accessed data.
Finally, the consistency layer manages concurrency control and logging for
updaterequests. This layer allows transaction, system, and
media recovery after failure.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/37
DBMS Implementation Alternatives
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/38
38
Autonomy
Autonomy is a function of a number of factors such as whether the component systems (i.e., individual DBMSs) exchange information, whether they can independently
execute transactions, and whether one is allowed to modify them.
Degree to which member databases can operate independently
1. The local operations of the individual DBMSs are not affected by their participation in the distributed system.
2. The manner in which the individual DBMSs process queries and optimize them should not be affected by the
execution of global queries that access multiple databases.3. System consistency or operation should not be
compromised when individual DBMSs join or leave the distributed system.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/39
Dimensions of the Problem
• Distribution➡ Whether the components of the system are located on the same
machine or not
• Heterogeneity➡ Various levels (hardware, communications, operating system)➡ DBMS important one
✦ data model, query language, transaction management algorithms
• Autonomy➡ Not well understood and most troublesome➡ Various versions
✦ Design autonomy: Ability of a component DBMS to decide on issues related to its own design.
✦ Communication autonomy: Ability of a component DBMS to decide whether and how to communicate with other DBMSs.
✦ Execution autonomy: Ability of a component DBMS to execute local operations in any manner it wants to.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/40
Ch.1/40
In Figure 1.10, we have identified three alternative architectures that are the focus of this book and that we discus in more detail in the next three subsections: (A0,D1, H0) that corresponds to client/server distributed DBMSs, (A0, D2, H0) that is a peer-to-peer distributed DBMS and (A2, D2, H1) which represents a (peer-topeer)distributed, heterogeneous multidatabase system. Note that we discuss the heterogeneity issues within the context of one system architecture, although the issuearises in other models as well.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/41
Client/Server Architecture
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/42
Advantages of Client-Server Architectures•More efficient division of labor
•Horizontal and vertical scaling of resources
•Better price/performance on client machines
•Ability to use familiar tools on client machines
•Client access to remote data (via standards)
•Full DBMS functionality provided to client workstations
•Overall better system price/performance
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/43
Database Server
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/44
Distributed Database Servers
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/45
Datalogical Distributed DBMS Architecture
...
...
...
ES1 ES2 ESn
GCS
LCS1 LCS2 LCSn
LIS1 LIS2 LISn
We first note that the physical data organization on each
machine may be, andprobably is, different. This
means that there needs to be an individual internal schema
definition at each site, which we call the local internal schema
(LIS). The enterpriseview of the data is described by the global conceptual schema
(GCS), which is globalbecause it describes the logical structure of the data at all the
sites.
To handle data fragmentation and replication, the logical
organization of dataat each site needs to be
described. Therefore, there needs to be a third layer in the
architecture, the local conceptual schema (LCS). In the
architectural model we havechosen, then, the global
conceptual schema is the union of the local conceptual
schemas.
Finally, user applications and user access to the database is
supported byexternal schemas (ESs), defined
as being above the global conceptual schema.
Data independence is supported since the model is an
extension of ANSI/SPARC, which provides such
independence naturally. Location
and replication transparencies are supported by the definition
of the local and globalconceptual schemas and the
mapping in between. Network transparency, on the
other hand, is supported by the definition of the global
conceptual schema.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/46
Peer-to-Peer Component Architecture
Database
DATA PROCESSORUSER PROCESSOR
USER
Userrequests
Systemresponses
ExternalSchema
User
Inte
rface
Han
dle
r
GlobalConceptual
Schema
Sem
an
tic D
ata
Con
troller
Glo
bal
Execu
tion
Mon
itor
SystemLog
Local R
ecovery
Man
ag
er
LocalInternalSchema
Ru
nti
me
Su
pp
ort
Pro
cessor
Local Q
uery
Pro
cessor
LocalConceptual
Schema
Glo
bal Q
uery
Op
tim
izer
GD/D
The detailed components of a distributed DBMS are shown. One
component handles the interaction with users, and another deals with the storage.
Thefirst major component, which we call the user processor, consists of four elements:
The user interface handler is responsible for interpreting
user commands asthey come in, and formatting the result data as it is sent to
the user.
The semantic data controller uses the integrity constraints and
authorizationsthat are defined as part of the global
conceptual schema to check if the user
query can be processed.
The global query optimizer and decomposer determines an execution strategy to minimize a cost function, and translates the global queries into local ones using the
global and local conceptual schemas as well as the global directory.
The distributed execution monitor coordinates the distributed execution of the user request. The execution monitor is also called the distributed transaction manager. In executing queries in a
distributed fashion, the execution monitorsat various sites may, and usually do, communicate
with one another.
The second major component of a
distributed DBMS is the data processor and
consists of three elements:
The local query optimizer, which actually acts as the access path
selector,is responsible for choosing the best access path5 to access any
data item
The local recovery manager is responsible for making
sure that the localdatabase remains consistent
even when failures occur
The run-time support processor physically accesses the database according to the physical
commands in the schedule generated by the query optimizer. The run-time support processor
is the interface to the operating system and contains the database buffer (or cache) manager,
which is responsible for maintaining the main memory buffers and managing the data accesses.
It is important to note, at this point, that our use of the terms “user processor”
and “data processor” does not imply a functional division similar to client/server
systems. These divisions are merely organizational and there is no suggestion thatthey should be placed on different machines.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/47
Ch.1/47
Multidatabase systems (MDBS)•Multidatabase systems (MDBS) represent the case where
individual DBMSs (whether distributed or not) are fully autonomous and have no concept of cooperation; they may not even “know” of each other’s existence or how to talk to each other.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/48
Datalogical Multi-DBMS Architecture
...
GCS… …
GES1
LCS2 LCSn…
…LIS2 LISn
LES11 LES1n LESn1 LESnm
GES2 GESn
LIS1
LCS1
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/49
MDBS Components & Execution
Multi-DBMSLayer
DBMS1 DBMS3DBMS2
GlobalUser
Request
LocalUser
RequestGlobal
SubrequestGlobal
SubrequestGlobal
Subrequest
LocalUser
Request
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.1/50
Mediator/Wrapper Architecture