Upload
antony-ramsey
View
220
Download
0
Embed Size (px)
Citation preview
Physical Database Design
The last phase of database design.
It is to determine how to store the database.
RDBMSs usually support a number of alternative physical representations.
The designer need to know the advantage and disadvantage of each representation.
Physical Database Design
Objectives:
Data may be accessed with acceptable speed;
The data does not use up too much computer’s storage;
The database is reasonably robust:
Be possible to recover a damaged database system;
Still be possible to run the reminder if part of system fails.
Physical Design Process
Design decisions should be based on the following knowledge:
The logical database design: which relations are to be included.
Quantities and volatility of data: the number of tuples, the frequency with which each relation will be altered, the rate at which each relation will grow.
The way in which the data is to be used: the frequency of runing the application; then longest time for the application to execute.
Costs for storing and accessing data: how the representation affects the speed to access, insert and delete the records.
Physical Design Process
The design process is a ‘implement-test-improve’ process:
Step 1: analysis of the database to generate a initial design;
Step 2: using test database to test the initial design.
Step 3: modification of initial design by removing Bottle-necks.
Step 4: monitoring the performance of database systems. Modification should be made to correct inappropriate design.
Physical Representation
The requirements of any strategy for representation:
a) It must make it possible to access all data without having to specify where tuples are stored;
b) It must be possible to apply the relational algebra operators restrict, project, join etc.
c) It must be possible to display the relation as a table of values.
Physical Representation
Example of file representation for data
P_No WareHouse Bin_No Quantity
P1 WH1 B1 100
P1 WH1 B3 200
P4 WH3 B2 3000
P2 WH4 B9 50
P5 WH4 B10 50
P5 WH4 B11 50
P1 WH1 B1 100 P1 WH1 B3 200 P4 WH3 B2 3000P2 WH4 B9 50 P5 WH4 B10 50P5 WH4 B11 50
Filepage
records
File Structure and access techniques
1. Heap file, serial search;
2. Sorted file, binary search;
3. Hash file, hashing function;
4. Index: B-tree;
4. Clustering: attribute grouping.
Heap Files
The simplest file structures.
A heap file is constructed as a list of pages.
A new record is inserted in the last page of the file.
Advantage of using heap files:
1) Fast record insertion: just insert news record in the last page.
2) Economic use of store: to store the data records only.
Heap Files
Disadvantage of using heap files:
1) Can only use the serial searching method, which is the slowest searching method;
2) Unable to reclaim the space of deleted records.
When to use heap files:
1)batch records is to be inserted;
2)a few pages long only;
3)used as a part of some other structures.
Access keys
Select *
FROM STOCK
WHERE P_NO= ‘p1’ AND Quantity > 100;
When a file is organized to provide direct access
to records on the basis of values of specific
attributes, then those attributes are called
access keys.
P_NO and Quantity is used as the access keys
Access Key
The most appropriate access keys can be selected on the basis of how tight they are.
A tight access key is one where there are relatively few tuples containing specific access key values.
If there are many such tuples, the access key is said to be loose.
Extreme examples: a primary key is an extremely tight; attribute ‘SEX’ is an extremely loose access key.
Sorted Files
In sorted files, the records are sorted in some order.For sorted file, the binary search can be used to access
the specific tuples.
SELECT *
FROM CUSTOMER
WHERE CUSTOMER_NO=‘C9’;
C1 C2 C3 C5 C7 C9 C12 C15 C19
Sorted Files
When to use sorted files: When tuples are normally accessed in some specific s
equence;
The main issues:
How to maintain the sequence when new records are inserted.
Hash Files
Hashing is the process of calculating the location of a record (page address) from the value of an access key.
The access key is also called the hash key.
Hash files are sometimes called random files as the records appear to be randomly distributed across the file space.
Hashing potentially provides the fastest access to a record via an access key – a record may be retrieved by reading just one page.
Hash Functions
Example:
Hushing function
Modulo-5:
Access key values:
k=12;
l =13;
m=24
<Record Key> [Hushing Function] <Home Address>k ----- f(k) ---- 2l ----- f(l) ---- 3m---- f(m) ---- 4
0
1
234
i Record(k)
j
When to use Hash Files
When retrieval is always on the basis of the value of a single access key.
Inappropriate situations to use the hash files:
1) When retrieval is on the basis of pattern matching.
2) When retrieval is on the basis of range value;
3) When access is on the basis of values only PART of the access key.
Indexes
Indexes are an alternative to hashing as a mechanism for direct access to records.
An index is a table of access key values, along with the address of the records to store the associating value.
P_No_Index
P_No Address
P1 *
P1 **
P2 ***
P4 ****
P5 *****
P5 ******
P6 *******
P_No_Index
P_No Warehouse Bin_No Quantity
P1 WH1 B1 100
P1 WH1 B3 200
P4 WH3 B2 3000
P2 WH4 B9 50
P5 WH4 B10 50
P5 WH4 B11 50
P6 WH5 B1 20
Indexes
Advantage of using indexes:
1) Indexes provide access to sequences of records;
2) Indexes may be used to implement many access keys for one relation, whereas there can be only one hash key.
Index are managed by:
CREATE INDEX
DROP INDEX
Multi-level Indexes
The shorter the index, the faster the search. When an index is large the search time can become significant.
A solution to this problem is to split the index up into a number of shorter indexes and to provide an index to the indexes.
Data Records
1st_Index
2nd_Index
P1 *
P5 **
P1 *
P2 **
P5 *
P6 **
P1 WH1 B1 100
P1 WH1 B3 200
P4 WH3 B2 3000
P2 WH4 B9 50
P5 WH4 B10 50
P5 WH4 B11 50
P6 WH5 B1 20
B-trees
A B-tree is a type of multi-level full index.
B-trees are widely used because they are largely self-maintaining.
A B-tree keeps itself balanced such that it always takes approximately the same time to access any data record.
B-trees
Example:
C_No Name Area
C1 Nippers Ltd W Yorks
C2 Tot-Gear Middl
C9 Kid-Naps Middl
C10 Boys Hats London
C11 Play Time London
C15 School Kit London
C19 Smart Kids Anglia
C23 Bed Socks London
C25 Slugs Anglia
C27 Kids Stuff Middl
C32 Play Ground Middl
C34 Way In W Yorks
* C19 *
* C9 * C11 * * C25 * C32 *
*C
1*
C2
*
*C
9*
C10
*
*C
11*
C15
*
*C
19*
C23
*
*C
25*
C27
*
*C
25*
C27
*
When to use indexes and B-tree
Indexes are suitable for the following situations:
1) Pattern matching based retrieval;
2) Retrieval with rang of access key;
3) Retrieval based on multi-attribute access key.
B-tree is more suitable for the following situation:
1) When the relation is frequently updated;
2) When the relation is so large that costly to recreate;
3) When sorting the access key is required.
Clustering
Clustering is the technique of storing related records physically close together.
The advantage of doing this is that it reduces the number of page accesses necessary to process a group of related records.
When to use clustering:
When an application accesses groups of tuples which have some common attribute value.
Summary
Physical database design is the process of deciding how to store relations;
The decision is based on the logical database design, the volume and volatility of data, the ways to manipulate the data, and the costs of representation.
A physical database design is performed by ‘implement-test-improve’ process.
Typical methods: heap file, sorted files, hash files, and indexes (like B-tree).
Summary
A heap file is a chain of pages where new records are simply added to the end.
Other file organization make it possible to access records speedily, given the value of some access key.
Sorting files could improve the access speed. Binary searching may be used to access records with a specific value of the access key.
Hashing is the process of locating a record by calculating its address from the access key value.
Summary
An index is a table of access key values and associated record addresses. Indexed are slower than hashing but more flexible.
B-tree is a typical index method. Clustering is the technique of storing related
physically close together.