Physical Database Design The last phase of database design. It is to determine how to store the database. RDBMSs usually support a number of alternative

Physical Database Design

The last phase of database design.

It is to determine how to store the database.

RDBMSs usually support a number of alternative physical representations.

The designer need to know the advantage and disadvantage of each representation.

Physical Database Design

Objectives:

Data may be accessed with acceptable speed;

The data does not use up too much computer’s storage;

The database is reasonably robust:

Be possible to recover a damaged database system;

Still be possible to run the reminder if part of system fails.

Physical Design Process

Design decisions should be based on the following knowledge:

The logical database design: which relations are to be included.

Quantities and volatility of data: the number of tuples, the frequency with which each relation will be altered, the rate at which each relation will grow.

The way in which the data is to be used: the frequency of runing the application; then longest time for the application to execute.

Costs for storing and accessing data: how the representation affects the speed to access, insert and delete the records.

Physical Design Process

The design process is a ‘implement-test-improve’ process:

Step 1: analysis of the database to generate a initial design;

Step 2: using test database to test the initial design.

Step 3: modification of initial design by removing Bottle-necks.

Step 4: monitoring the performance of database systems. Modification should be made to correct inappropriate design.

Physical Representation

The requirements of any strategy for representation:

a) It must make it possible to access all data without having to specify where tuples are stored;

b) It must be possible to apply the relational algebra operators restrict, project, join etc.

c) It must be possible to display the relation as a table of values.

Physical Representation

Example of file representation for data

P_No WareHouse Bin_No Quantity

P1 WH1 B1 100

P1 WH1 B3 200

P4 WH3 B2 3000

P2 WH4 B9 50

P5 WH4 B10 50

P5 WH4 B11 50

P1 WH1 B1 100 P1 WH1 B3 200 P4 WH3 B2 3000P2 WH4 B9 50 P5 WH4 B10 50P5 WH4 B11 50

Filepage

records

File Structure and access techniques

1. Heap file, serial search;

2. Sorted file, binary search;

3. Hash file, hashing function;

4. Index: B-tree;

4. Clustering: attribute grouping.

Heap Files

The simplest file structures.

A heap file is constructed as a list of pages.

A new record is inserted in the last page of the file.

Advantage of using heap files:

1) Fast record insertion: just insert news record in the last page.

2) Economic use of store: to store the data records only.

Heap Files

Disadvantage of using heap files:

1) Can only use the serial searching method, which is the slowest searching method;

2) Unable to reclaim the space of deleted records.

When to use heap files:

1)batch records is to be inserted;

2)a few pages long only;

3)used as a part of some other structures.

Access keys

Select *

FROM STOCK

WHERE P_NO= ‘p1’ AND Quantity > 100;

When a file is organized to provide direct access

to records on the basis of values of specific

attributes, then those attributes are called

access keys.

P_NO and Quantity is used as the access keys

Access Key

The most appropriate access keys can be selected on the basis of how tight they are.

A tight access key is one where there are relatively few tuples containing specific access key values.

If there are many such tuples, the access key is said to be loose.

Extreme examples: a primary key is an extremely tight; attribute ‘SEX’ is an extremely loose access key.

Sorted Files

In sorted files, the records are sorted in some order.For sorted file, the binary search can be used to access

the specific tuples.

SELECT *

FROM CUSTOMER

WHERE CUSTOMER_NO=‘C9’;

C1 C2 C3 C5 C7 C9 C12 C15 C19

Sorted Files

When to use sorted files: When tuples are normally accessed in some specific s

equence;

The main issues:

How to maintain the sequence when new records are inserted.

Hash Files

Hashing is the process of calculating the location of a record (page address) from the value of an access key.

The access key is also called the hash key.

Hash files are sometimes called random files as the records appear to be randomly distributed across the file space.

Hashing potentially provides the fastest access to a record via an access key – a record may be retrieved by reading just one page.

Hash Functions

Example:

Hushing function

Modulo-5:

Access key values:

k=12;

l =13;

m=24

<Record Key> [Hushing Function] <Home Address>k ----- f(k) ---- 2l ----- f(l) ---- 3m---- f(m) ---- 4

0

1

234

i Record(k)

j

When to use Hash Files

When retrieval is always on the basis of the value of a single access key.

Inappropriate situations to use the hash files:

1) When retrieval is on the basis of pattern matching.

2) When retrieval is on the basis of range value;

3) When access is on the basis of values only PART of the access key.

Indexes

Indexes are an alternative to hashing as a mechanism for direct access to records.

An index is a table of access key values, along with the address of the records to store the associating value.

P_No_Index

P_No Address

P1 *

P1 **

P2 ***

P4 ****

P5 *****

P5 ******

P6 *******

P_No_Index

P_No Warehouse Bin_No Quantity

P1 WH1 B1 100

P1 WH1 B3 200

P4 WH3 B2 3000

P2 WH4 B9 50

P5 WH4 B10 50

P5 WH4 B11 50

P6 WH5 B1 20

Indexes

Advantage of using indexes:

1) Indexes provide access to sequences of records;

2) Indexes may be used to implement many access keys for one relation, whereas there can be only one hash key.

Index are managed by:

CREATE INDEX

DROP INDEX

Multi-level Indexes

The shorter the index, the faster the search. When an index is large the search time can become significant.

A solution to this problem is to split the index up into a number of shorter indexes and to provide an index to the indexes.

Data Records

1st_Index

2nd_Index

P1 *

P5 **

P1 *

P2 **

P5 *

P6 **

P1 WH1 B1 100

P1 WH1 B3 200

P4 WH3 B2 3000

P2 WH4 B9 50

P5 WH4 B10 50

P5 WH4 B11 50

P6 WH5 B1 20

B-trees

A B-tree is a type of multi-level full index.

B-trees are widely used because they are largely self-maintaining.

A B-tree keeps itself balanced such that it always takes approximately the same time to access any data record.

B-trees

Example:

C_No Name Area

C1 Nippers Ltd W Yorks

C2 Tot-Gear Middl

C9 Kid-Naps Middl

C10 Boys Hats London

C11 Play Time London

C15 School Kit London

C19 Smart Kids Anglia

C23 Bed Socks London

C25 Slugs Anglia

C27 Kids Stuff Middl

C32 Play Ground Middl

C34 Way In W Yorks

* C19 *

* C9 * C11 * * C25 * C32 *

*C

1*

C2

*

*C

9*

C10

*

*C

11*

C15

*

*C

19*

C23

*

*C

25*

C27

*

*C

25*

C27

*

When to use indexes and B-tree

Indexes are suitable for the following situations:

1) Pattern matching based retrieval;

2) Retrieval with rang of access key;

3) Retrieval based on multi-attribute access key.

B-tree is more suitable for the following situation:

1) When the relation is frequently updated;

2) When the relation is so large that costly to recreate;

3) When sorting the access key is required.

Clustering

Clustering is the technique of storing related records physically close together.

The advantage of doing this is that it reduces the number of page accesses necessary to process a group of related records.

When to use clustering:

When an application accesses groups of tuples which have some common attribute value.

Summary

Physical database design is the process of deciding how to store relations;

The decision is based on the logical database design, the volume and volatility of data, the ways to manipulate the data, and the costs of representation.

A physical database design is performed by ‘implement-test-improve’ process.

Typical methods: heap file, sorted files, hash files, and indexes (like B-tree).

Summary

A heap file is a chain of pages where new records are simply added to the end.

Other file organization make it possible to access records speedily, given the value of some access key.

Sorting files could improve the access speed. Binary searching may be used to access records with a specific value of the access key.

Hashing is the process of locating a record by calculating its address from the access key value.

Summary

An index is a table of access key values and associated record addresses. Indexed are slower than hashing but more flexible.

B-tree is a typical index method. Clustering is the technique of storing related

physically close together.

Documents

Physical Database Design The last phase of database design. It is to determine how to store the database. RDBMSs usually support a number of alternative