58
Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database Alexandria Dodd Janelle Toungett

Embed Size (px)

Citation preview

Page 1: Distributed Database Alexandria Dodd Janelle Toungett

Distributed DatabaseAlexandria DoddJanelle Toungett

Page 2: Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database• What is a distributed database?

• Distributed database concepts

• Types of distributed databases

• Basic architecture

• Data fragmentation, replication, and allocation techniques

• Query processing and optimization

• Transaction management

Page 3: Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database• Concurrency control and recovery

• Distributed catalog management

• Current trends

• Distributed databases in Oracle

• Advantages of distributed databases

• Disadvantages of distributed databases

Page 4: Distributed Database Alexandria Dodd Janelle Toungett

What is a Distributed Database?A major motivation behind the development of database systems is the desire to integrate the operational data of an organization and provide controlled access to the data.

• Computer networks promote a decentralized mode of work • Logically distributed• Divisions, departments, projects

• Physically distributed• Offices, plants, factories

A distributed database system makes all the data in each unit accessible and stores it to the location where it is most frequently used.

Page 5: Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database Concepts• Distributed database (DDB) – A collection of multiple, logically

interrelated databases distributed over a computer network.

• Distributed database management system (DDBMS) – A software system that manages a distributed database while making the distribution transparent to the user.

Page 6: Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database Concepts• A Distributed Database

Management System (DDBMS) consists of a single logical database that is split into a number of fragments. • Each fragment is stored on

computers that are under the control of a separate DBMS connected by a network

• Each site is capable of processing data stored on other computers in the network.

• Users access the distributed database on applications

Page 7: Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database Concepts• A Distributed Database Management System (DDBMS) has the

following characteristics• A collection of logically related shared data

• The data is split into a number of fragments

• Fragments may be replicated

• Fragments are allocated to sites

• The sites are linked by a communications network

• The data at each site is under the control of a DBMS

• The DBMS at each site can handle the local applications autonomously

• Each DBMS participates in at least one global application

Page 8: Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database Concepts• Distributed processing - The operation that occurs when an

application distributes its tasks among different computers in a network

• Distributed database system vs distributed processing

• Replicated databases – The operation of copying and maintaining database objects in multiple databases belonging to a distributed system

• Distributed database system vs Replicated database

Page 9: Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database Concepts• Transparency – the idea of hiding implementation details from

end users• Data organization

• Location

• Naming

• Replication

• Fragmentation

• Autonomy – the extent to which individual nodes in a connected DDB can operate independently• Design

• Communication

• Execution

Page 10: Distributed Database Alexandria Dodd Janelle Toungett

Types of Distributed Databases• Homogeneous

• all servers use identical software

• all users use identical software• Ex. each database is an Oracle Database

• Otherwise it is heterogeneous• Ex. at least one of the databases is not an Oracle Database

Page 11: Distributed Database Alexandria Dodd Janelle Toungett

Types of Distributed Databases• Local Autonomy

• There is a provision for the local site to function as a standalone DBMS.

• Otherwise the system has no local autonomy• It can have some degree of local autonomy, if direct access by local

transactions to a server is permitted

Page 12: Distributed Database Alexandria Dodd Janelle Toungett

Types of Distributed Databases

Page 13: Distributed Database Alexandria Dodd Janelle Toungett

Types of Distributed Databases• Federated database system (FDBS)

• There is some global view or schema of the federation of databases that is shared by the applications

• Multi-database system• Full local autonomy

• Does not have a global schema but constructs one as needed• This is not very feasible in practice for enterprise databases

Page 14: Distributed Database Alexandria Dodd Janelle Toungett

Types of Distributed Databases• Federated database system (FDBS) and Multi-database system

• In these systems, each server is an independent and autonomous centralized DBMS.

• They have their own local users, local transactions, and DBA, and a high degree of local autonomy.

• Both are hybrids between distributed and centralized systems

Page 15: Distributed Database Alexandria Dodd Janelle Toungett

Types of Distributed DatabasesIssues with FDBS

• Differences in data models• In these systems, each server is an independent and autonomous

centralized DBMS.

• They have their own local users, local transactions, and DBA, and a high degree of local autonomy.

• Both are hybrids between distributed and centralized systems

• Differences in constraints• Different systems have different limitations

• Global schema must deal with potential conflicts among constraints

Page 16: Distributed Database Alexandria Dodd Janelle Toungett

Types of Distributed DatabasesIssues with FDBS

• Differences in query languages• Even with same data model, the languages and their versions may

vary

• Ex. SQL has multiple versions and each has its own set of data types, operators, manipulation features, etc.

• Semantic Heterogeneity• Occurs when there are differences in the meaning, interpretation, ad

intended use of the same or related data.

• This is the biggest problem in designing global schemas of heterogeneous databases.

Page 17: Distributed Database Alexandria Dodd Janelle Toungett

Types of Distributed DatabasesIssues with FDBS

• Freedom of choosing from these design parameters(below) which could affect the complexity of the FDBS• The universe of discourse from which the data is drawn• Ex. US and Japan have different sets of attributes about customer accounts• Currency rate fluctuations could also be a problem• Relations in these two DBs that have identical names may have some

common and some entirely distinct information

Page 18: Distributed Database Alexandria Dodd Janelle Toungett

Types of Distributed DatabasesIssues with FDBS

• Representation and naming• Elements and structure of the model may be pre-specified for each local

database

• The understanding, meaning, and subjective interpretation of data

• Transaction and policy constraints

• Derivation of summaries• data processing features and operations supported by the system

Page 19: Distributed Database Alexandria Dodd Janelle Toungett

Types of Distributed DatabasesIssues with FDBS

• Semantic heterogeneity problems are being faced by all major multinational and governmental organizations in all application areas

• Most enterprises are resorting to heterogeneous FDBSs, having heavily invested in the development of individual database systems

Page 20: Distributed Database Alexandria Dodd Janelle Toungett

Types of Distributed DatabasesIssues with FDBS

• Local component databases strive to preserve autonomy. • Communication autonomy • Ability to decide whether to communicate with another component DBS

• Execution autonomy• Ability of a component DBS to execute local operations without interference

from external operations by another component DBSs• Ability to decide order in which to execute

• Association autonomy• Ability to decide whether and how much to share its functionality and

resources with other component DBSs

Page 21: Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database Architectures• Distributed database architecture strives to provide the

ultimate transparency

Page 22: Distributed Database Alexandria Dodd Janelle Toungett

• Shared memory – Multiple processors within a single system share system memory

• Shared disk – each processor accesses all disks, but each has its own memory

• Shared nothing – each processor is part of a complete system with its own memory and disk storage

Distributed Database Architectures

Page 23: Distributed Database Alexandria Dodd Janelle Toungett

• This shows the logical structure of underlying data across all nodes

• It is represented by the global conceptual schema (GCS) which provides network transparency

• Each node has its own local internal schema (LIS) to accommodate potential heterogeneity

• The local conceptual schema (LCS) specifies the logical organization of data at each site

Distributed Database Architectures

Page 24: Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database ArchitecturesThree-tier Client-Server Architecture

• A database server is the Oracle software managing a database

• A client is an application that requests information from a server.

• Each computer in a network is a node that can host one or more databases.

• Each node in a distributed database system can act as a client, a server, or both, depending on the situation.

Page 25: Distributed Database Alexandria Dodd Janelle Toungett

1. Presentation layer (client)• Provides the user interface

• Handles user input, output, and navigation by accepting user commands and displaying the information in the form of static or dynamic web pages

2. Application Layer (business logic)• Programs the application logic

• Queries formulated from client – results formatted and sent to client

Security checks, verification, and other functions

3. Database server• Handles query and update requests from the

application layer, processes requests, and sends results

• Query results may be formatted into XML

Distributed Database Architectures

Page 26: Distributed Database Alexandria Dodd Janelle Toungett

Data Fragmentation, Replication, and Allocation TechniquesA relation may be divided into a number of sub-relations called fragments which are then distributed. There are three types of fragmentation:

• Horizontal• Subsets of tuples

• Vertical• Subsets of attributes

• Mixed (Hybrid)

Page 27: Distributed Database Alexandria Dodd Janelle Toungett

Data Fragmentation, Replication, and Allocation TechniquesReplication – the DDBMS may maintain a copy of a fragment at several different sites, or in extreme cases, replicate the whole database at every site

• Advantages• Improves availability and performance

• Disadvantages• Slows update operations and is expensive

Page 28: Distributed Database Alexandria Dodd Janelle Toungett

Data Fragmentation, Replication, and Allocation TechniquesAllocation – each fragment is stored at the site with ‘optimal’ distribution

• Centralized

• Fragmented

• Complete replication

• Selective replication

Page 29: Distributed Database Alexandria Dodd Janelle Toungett

Query Processing and Optimization 1. Query mapping

• Input query is specified formally using a query language

• Then translated into an algebraic query on global relations

• Does not take into account actual distribution and replication of data

• This process is identical to a centralized DBMS

• It is normalized, analyzed for semantic errors, simplified, and restructured into an algebraic query

2. Localization• Fragmentation results in relations being stored in separate sites,

possibly replicated

• Maps the distributed query on the global scheme to separate queries on individual fragments using data distribution and replication information

Page 30: Distributed Database Alexandria Dodd Janelle Toungett

Query Processing and Optimization 3. Global Query Optimization

• Selects a strategy that is closest to optimal

• Time is the preferred unit for measuring cost

• Total cost is a weighted combination of costs such as CPU cost, I/O costs, and communication costs

• Communication costs over the network are the most significant, especially when the sites are connected through a wide area network (WAN) .

4. Local Query Optimization• Common to all sites in the DDB

• Techniques similar to those in centralized systems

• Stages 1, 2, and 3 are performed at a central control site, while this one is performed locally

Page 31: Distributed Database Alexandria Dodd Janelle Toungett

Query Processing and Optimization • Data transfer costs

• Cost of transferring data over the network • This is a factor that complicates query processing• Data includes intermediate files transferred to other sites for processing and final

result files to be transferred to the site where the query result is needed

• Costs may not be very high if sites are connected by a high performance local area network

DDBMS query optimization algorithms consider the goal of reducing the amount of data transfer as criterion in choosing a distributed query execution strategy

Page 32: Distributed Database Alexandria Dodd Janelle Toungett

Query Processing and Optimization

Page 33: Distributed Database Alexandria Dodd Janelle Toungett

Query Processing and Optimization

Consider the query: for each department, retrieve the department name and the name of the department manager

1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site, and perform the join at site 3. A total of 1,000,000 + 3,500 = 1,003,500 bytes must be transferred

2. Transfer the EMPLOYEE relation to site 2, execute join at 2, and send result to 3. The size of the query result is 40 * 100 = 4,000 bytes, so 4,000 + 1,000,000 = 1,004,000 bytes must be transferred

3. Transfer the DEPARTMENT relation to site 1, execute join at 1, and send result to 3. 4,000 + 3,500 = 7,500 bytes must be transferred

In this case we would choose strategy number 3

Page 34: Distributed Database Alexandria Dodd Janelle Toungett

Query Processing and Optimization Using Semi-join

• More complex, works better than the previous strategies

• The idea is to reduce the number of tuples in a relation before transferring it to another site

In other words: Send the joining column of one relation R to the site where the other relation S is located – column joined with S – join attributes along with attributes required for result are projected out and shipped back to original site and joined with R

Page 35: Distributed Database Alexandria Dodd Janelle Toungett

Query Processing and Optimization

Considering our previous example:

1. Project the join attributes of DEPARTMENT at site 2, transfer them to site 1.

We transfer F = Mgr_ssn(DEPARTMENT), sizes 9 * 100 = 900 bytes

2. Join the transferred file with the employee at site 1, and transfer the required attributes to site 2.

We transfer R = Mgr_ssn, Fname, Lname(F ⋈Mgr_ssn=Ssn EMPLOYEE), size is 39 * 100 = 3,900 bytes

3. Execute the query by joining the transferred file R with DEPARTMENT, and present the result to the user at site 2

Using semijoin we transferred 4,800 bytes. We limited the attributes and tuples transmitted to only those that will actually be joined.

Page 36: Distributed Database Alexandria Dodd Janelle Toungett

Transaction Management• The transaction manager coordinates transactions on behalf of

the application programs and communicates to the scheduler

• The recover manager ensures the database is restored to the state it was in before the start of the transaction if there is a failure

• There is a global transaction manager at each site to coordinate the execution of both global and local transactions

Page 37: Distributed Database Alexandria Dodd Janelle Toungett

Transaction Management

Page 38: Distributed Database Alexandria Dodd Janelle Toungett

Concurrency Control and Recovery• Problems occur in a distributed DBMS environment for

concurrency control and recovery that are not encountered in a centralized DBMS environment. These are:• Dealing with multiple copies of the data items• Concurrency control is responsible for maintaining consistency among

these copies• Recovery is responsible for making a copy consistent with other copies if

the site on which the copy is stored fails and recovers later

• Failure of individual sites• The DDBMS should continue to operate with its running sites when one or

more individual sites fail• When site recovers it must be brought up to date with the rest of the sites

before it rejoins the system

Page 39: Distributed Database Alexandria Dodd Janelle Toungett

Concurrency Control and Recovery

• Failure of communication links• Must be able to deal with failure of 1+ communication links• Network partitioning may occur – this breaks up the sites into 2+ partitions,

the sites within each partition can only communicate with one another and not with sites in other partitions

• Distributed commit• Problems arise with committing a transaction that is accessing databases

stored on multiple sites if some sites fail during process – use two-phase commit protocol

• Distributed deadlock• May occur among several sites• Must take this into account

Page 40: Distributed Database Alexandria Dodd Janelle Toungett

Concurrency Control and RecoveryDistinguished copy of a data item

• This deals with replicated data items

• Idea is to designate a particular copy of each data item as a distinguished copy

• Locks for this data item are associated with the distinguished copy and all locking and unlocking requests are sent to the site that contains that copy

Page 41: Distributed Database Alexandria Dodd Janelle Toungett

Concurrency Control and Recovery• Primary site technique

• All distinguished copies are kept at the same site

• A single primary site is designated to be the coordinator site for all database items

• All locks kept at that site, all requests for locking/unlocking are sent there

• Advantage: not overly complex, simple extension of the centralized approach

• Disadvantages: (1) All locking requests are sent to a single site, chance for overloading and causing a system bottleneck and (2) Failure of the primary site paralyzes the system since all locking info is kept at that site. This can limit system reliability and availability.

Page 42: Distributed Database Alexandria Dodd Janelle Toungett

Concurrency Control and Recovery• Primary Site with Backup Site

• This fixes second disadvantage of the primary site by designating a second site to be a backup site. All locking information is maintained at both sites.

• If primary site failure, backup site takes over as primary site, new backup site chosen

• This technique slows down the process of acquiring locks because all requests/granting of locks must be recorded at both sites before a response is sent to the requesting transaction

Page 43: Distributed Database Alexandria Dodd Janelle Toungett

Concurrency Control and Recovery• Primary Copy Technique

• Attempts to distribute the load of lock coordination among various sites by having the distinguished copies of different data items stored at different sites

• Can also use backup sites to enhance reliability and availability

Page 44: Distributed Database Alexandria Dodd Janelle Toungett

Concurrency Control and Recovery• Choosing A New Coordinator Site in Case of Failure

• When a coordinator site fails, the sites still running must choose a new coordinator

• Primary site with no backup – executing transactions must be aborted and restarted

• With backup sites – transaction processing is suspended while backup becomes new primary and new backup chosen and locking information copies are sent to new primary

Page 45: Distributed Database Alexandria Dodd Janelle Toungett

Concurrency Control and Recovery• Election - If no backup site existed or if the primary and backup

sites are down this is used to choose new coordinator site

• Any site that attempts to communicate with the coordinator site continuously and fails starts the election process.

• Messages are sent to all running sites proposing which site should become the new coordinator

Page 46: Distributed Database Alexandria Dodd Janelle Toungett

Concurrency Control and Recovery• Voting method

• No distinguished copy

• Lock request is sent to all sites that includes a copy of the data item

• Each copy maintains its own lock and can grant or deny the request for it

• Lock requests are granted/denied by a majority vote from the sites involved

Page 47: Distributed Database Alexandria Dodd Janelle Toungett

Concurrency Control and Recovery• Recovery process quite involved.

• It is difficult to determine whether a site is down without exchanging numerous messages with other sites.

• Another problem is with distributed commit• When transaction is updating data at several sites, it cannot commit

until it is sure that the effect of the transaction on every site cannot be lost

Page 48: Distributed Database Alexandria Dodd Janelle Toungett

Distributed Catalog ManagementCatalogs are databases containing metadata about the DDS

• Centralized Catalog – The entire catalog is stored on a single site• Easy to implement but loses reliability, availability, and autonomy

• Fully Replicated Catalog – Identical copies of the complete catalog are present at each site• Faster reads but updates must be broadcasted to each site

• Partially Replicated Catalogs – Each site is permitted to cache entries received from remote sites• Autonomy not restricted but cached copies will not be the most recent

Page 49: Distributed Database Alexandria Dodd Janelle Toungett

Current Trends• Cloud Computing

• Infrastructure, platforms, and software over the internet

• Peer-to-Peer Database Systems• Nodes are autonomous and are linked only to a small number of

peers

Page 50: Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database in Oracle• Homogeneous Distributed Database

Page 51: Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database in Oracle• Connectivity

• Oracle Net Services supports communications between clients and servers

• Global database names• Forms by prefixing the database’s network domain name

Page 52: Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database in Oracle• Database links

• Communication from one Oracle database to another database (heterogeneous) to make remote data available for queries and updates

• Heterogeneous Services (HS) is an integrated component within the Oracle Database server and the enabling technology for the current version of Oracle

• Distributed query optimization• Decomposed by the local Oracle DBMS and sent to the remote

DBMS

Page 53: Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database in OracleRemote SQL statement:

SELECT * FROM [email protected];

Update SQL statement:

UPDATE [email protected]

SET loc = “DESOTO”

WHERE deptno = 10;

Page 54: Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database in OracleDistributed SQL statement:

SELECT fname, lname

FROM mis.teacher e, [email protected] d

WHERE e.deptno = d.deptno;

Page 55: Distributed Database Alexandria Dodd Janelle Toungett

Distributed Database in OracleDistributed Update PL/SQL:

BEGIN

UPDATE [email protected]

SET loc = ‘DESOTO’

WHERE deptno = 10;

UPDATE mis.teacher

SET deptno = 11

WHERE deptno = 10;

END;

COMMIT;

Page 56: Distributed Database Alexandria Dodd Janelle Toungett

Advantages of Distributed Databases• Improved shareability and local autonomy

• Improved availability

• Improved reliability

• Improved performance

• Modular growth

• Integration

Page 57: Distributed Database Alexandria Dodd Janelle Toungett

Disadvantages of Distributed Databases• Complexity

• Cost

• Security

• Integrity control is difficult

Page 58: Distributed Database Alexandria Dodd Janelle Toungett

Questions?