So far we have studied the DBMS at level of the logical model. The logical model of a database system is the correct level for the database users to focus

Module 1.1

So far we have studied the DBMS at level of the logical model. The logical model of a database system is the correct level for the database users to focus on. The goal of a database system is to simplify and facilitate access to data. As members of the development staff and as potential Database Administrators, we need to understand the physical level better than a typical user.

Overview of Physical Storage Media Storage media are classified by speed of access, cost per unit of data to buy the

media, and by the medium's reliability. Unfortunately, as speed and cost go up, the reliability does down.

1. Cache is the fastest and the most costly for of storage. The type of cache referred to here is the type that is typically built into the CPU chip and is 256KB, 512KB, or 1MB. Thus, cache is used by the operating system and has no application to database, per se.

2. Main memory is the volatile memory in the computer system that is used to hold programs and data. While prices have been dropping at a staggering rate, the increases in the demand for memory have been increasing faster. Today's 32-bit computers have a limitation of 4GB of memory. This may not be sufficient to hold the entire database and all the associated programs, but the more memory available will increase the response time of the DBMS. There are attempts underway to create a system with the most memory that is cost effective, and to reduce the functionality of the operating system so that only the DBMS is supported, so that system response can be increased. However, the contents of main memory are lost if a power failure or system crash occurs.

3. Flash memory is also referred to as electrically erasable programmable read-only memory (EEPROM). Since it is small (1 to 32 GB) and expensive, it has little or no application to the DBMS.

Storage Strategies

4. Magnetic-disk storage is the primary medium for long-term on-line storage today. Prices have been dropping significantly with a corresponding increase in capacity. New disks today are in excess of 1TB. Unfortunately, the demands have been increasing and the volume of data has been increasing faster. The organizations using a DBMS are always trying to keep up with the demand for storage. This media is the most cost-effective for on-line storage for large databases.

5. Optical storage is very popular, especially CD-ROM systems. This is limited to data that is read-only. It can be reproduced at a very low-cost and it is expected to grow in popularity, especially for replacing written manuals. Recently , a new optical format, digit video disk(DVD) has become standard. These disks hold between 4.7 and 17 GB data.

6. Magnetic Tape storage is used for backup and archival data. It is cheaper and slower than all of the other forms, but it does have the feature that there is no limit on the amount of data that can be stored, since more tapes can be purchased. As the tapes get increased capacity, however, restoration of data takes longer and longer, especially when only a small amount of data is to be restored. This is because the retrieval is sequential, the slowest possible method. 8mm tape drive has the highest density, and we store 5GB data on a 350-foot tape.

Storage Strategies

Disks are actually relatively simple. There is normally a collection of platters on a spindle. Each platter is coated with a magnetic material on both sides and the data is stored on the surfaces. There is a read-write head for each surface that is on an arm assembly that moves back and forth. A motor spins the platters at a high constant speed, (60, 90, or 120 revolutions per seconds.)

The surface is divided into a set of tracks (circles). These tracks are divided into a set of sectors, which is the smallest unit of data that can be written or read at one time. Sectors can range in size from 31 bytes to 4096 bytes, with 512 bytes being the most common. A collection of a specific track from both surfaces and from all of the platters is called a cylinder.

Platters can range in size from 1.8 inches to 14 inches. Today, 5 1/4 inches and 3 1/2 inches are the most common, because they have the highest seek times and lowest cost.

A disk controller interfaces the computer system and the actual hardware of the disk drive. The controller accepts high-level command to read or write sectors. The controller then converts the commands in the necessary specific low-level commands. The controller will also attempt to protect the integrity of the data by computing and using checksums for each sector. When attempting to read the data back, the controller recalculates the checksum and makes several attempts to correctly read the data and get matching checksums. If the controller is unsuccessful, it will notify the operating system of the failure.

The controller can also handle the problem of eliminating bad sectors. Should a sector go bad, the controller logically remaps the sector to one of the extra unused sectors that disk vendors provide, so that the reliability of the disk system is higher. It is cheaper to produce disks with a greater amount of sectors than advertised and then map out bad sectors than it is to produce disks with no bad sectors or with extremely limited possibility of sectors going bad.

Magnetic Disks

One other characteristic of disks that provides an interesting performance is the distance from the read-write head to the surface of the platter. The smaller this gap is means that data can be written in a smaller area on the disk, so that the tracks can be closer together and the disk has a greater capacity. Often the distance is measured in microns. However, this means that the possibility of the head touching the surface is increased. When the head touches the surface while the surface is spinning at a high speed, the result is called a "head crash", which scratches the surface and defaces the head. The bottom line to this is that someone must replace the disk.

Storage Access Seek time is the time to reposition the head and increases with the distance that the head must

move. Seek times can range from 2 to 30 milliseconds. Average seek time is the average of all seek times and is normally one-third of the worst-case seek time.

Rotational latency time is the time from when the head is over the correct track until the data rotates around and is under the head and can be read. When the rotation is 120 rotations per second, the rotation time is 8.35 milliseconds. Normally, the average rotational latency time is one-half of the rotation time.

Access time is the time from when a read or write request is issued to when the data transfer begins. It is the sum of the seek time and latency time.

Data-transfer rate is the rate at which data can be retrieved from the disk and sent to the controller. This will be measured as megabytes per second.

Mean time to failure is the number of hours (on average) until a disk fails. Typical times today range from 30,000 to 800,000 hours (or 3.4 to 91 years).

Magnetic Disks

Redundant Array of Independent (or Inexpensive) Disks, a category of disk drives that employ two or more drives in combination for fault tolerance and performance. RAID disk drives are used frequently on servers but are not generally necessary for personal computer. RAID allows you to store the same data redundantly (in multiple places) in a balanced way to improve overall performance.

There are a number of different RAID levels: Level 0 - Stripe disk array without fault tolerance: Provides data striping

(spreading out blocks of each file across multiple disk drives) but no redundancy. This improves performance but does not deliver fault tolerance. If one drive fails then all data in the array is lost.

Level 1 – Mirroring and duplicating (provides disk mirroring): level 1 provides twice the read transaction rate as single disks.

Level 2 – Error-Correcting Coding: Not a typical implementation and rarely used. It stripes data at the bit level rather than the block level.

Level 3 –Byte-level Distribution, Single Parity Drive: Provides byte-level striping with a dedicated parity disk. Level 3, with which cannot service simultaneous multiple request, also is rarely used.

Level 4 – Block-level Distribution, Single Parity Drive: A commonly used implementation of RAID, level 4 provides block-level striping (like Level 0) with a parity disk. If a data disk fails, the parity data is used to create a replacement disk. A disadvantage to Level 4 is that the parity disk can create write bottlenecks.

RAID

There are a number of different RAID levels: Level 5 – Block-level Distribution, Distributed Parity: Provides data striping

at the block level and also stripe error correction information. This results in excellent performance and good fault tolerance. Level 5 is one of the most popular implementations of RAID.

Level 6 – Independent Data Disks with Double Parity: Provides block-level striping with parity data distributed across all disks.

Level 0+1 – A Mirror of Stripes: Not one of the original RAID levels, two RAID 0 stripes are created and a RAID 1 mirror is created over them. Used for both replicating and sharing data among disks.

Level 10 – A Stripe of Mirrors: Not one of the original RAID levels, multiple RAID 1 mirrors are created, and a RAID 0 stripe is created over these.

Level 7 – A trademark of Storage Computer Corporation that adds caching to Levels 3 or 4.

RAID S – (also called Parity RAID) EMC2 Corporation’s proprietary, striped parity RAID system used in its Symmetric storage systems.

Need for RAID An array of multiple disks accessed in parallel will give greater throughput than

a single disk. Redundant data on multiple disks provides fault tolerance.

RAID

An index is a small table having only two columns. The first column contains a copy of the primary or candidate key of a table and the second column contains a set of pointers holding the address of the disk block where that particular key value can be found.

The advantage of using index lies in the fact is that index makes search operation perform very fast. Suppose a table has a several rows of data, each row is 20 bytes wide. If you want to search for the record number 100, the management system must thoroughly read each and every row and after reading 99x20 = 1980 bytes it will find record number 100. If we have an index, the management system starts to search for record number 100 not from the table, but from the index. The index, containing only two columns, may be just 4 bytes wide in each of its rows. After reading only 99x4 = 396 bytes of data from the index the management system finds an entry for record number 100, reads the address of the disk block where record number 100 is stored and directly points at the record in the physical storage device. The result is a much quicker access to the record (a speed advantage of 1980:396).

The only minor disadvantage of using index is that it takes up a little more space than the main table. Additionally, index needs to be updated periodically for insertion or deletion of records in the main table. However, the advantages are so huge that these disadvantages can be considered negligible.

Index

In an ordered index, index entries are stored sorted on the search key value. E.g., author catalog in library.

Primary index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file. The search key of a primary index is usually but not

necessarily the primary key.Secondary index: an index whose search key specifies

an order different from the sequential order of the file.Index-sequential file: ordered sequential file with a

primary index.

Ordered Indices

Types of Index

In primary index, there is a one-to-one relationship between the entries in the index table and the records in the main table. Primary index can be of two types:

Dense primary index: the number of entries in the index table is the same as the number of entries in the main table. In other words, each and every record in the main table has an entry in the index.

Primary Index

Sparse or Non-Dense Primary Index: For large tables the Dense Primary Index itself begins to grow in size. To keep the size of the index smaller, instead of pointing to each and every record in the main table, the index points to the records in the main table in a gap. See the following example.

Primary Index

It may happen sometimes that we are asked to create an index on a non-unique key, such as Dept-id. There could be several employees in each department. Here we use a clustering index, where all employees belonging to the same Dept-id are considered to be within a single cluster, and the index pointers point to the cluster as a whole.

Clustering Index

The previous scheme might become a little confusing because one disk block might be shared by records belonging to different cluster. A better scheme could be to use separate disk blocks for separate clusters.

Clustering Index

While creating the index, generally the index table is kept in the primary memory (RAM) and the main table, because of its size is kept in the secondary memory (Hard Disk). Theoretically, a table may contain millions of records (like the telephone directory of a large city), for which even a sparse index becomes so large in size that we cannot keep it in the primary memory. And if we cannot keep the index in the primary memory, then we lose the advantage of the speed of access. For very large table, it is better to organize the index in multiple levels. See the following example.

Secondary Index

We can use tree-like structures as index as well. For example, a binary search tree can also be used as an index. If we want to find out a particular record from a binary search tree, we have the added advantage of binary search procedure, that makes searching be performed even faster. A binary tree can be considered as a 2-way Search Tree, because it has two pointers in each of its nodes, thereby it can guide you to two distinct ways. Remember that for every node storing 2 pointers, the number of value to be stored in each node is one less than the number of pointers, i.e. each node would contain 1 value each.

Index in a Tree like Structure

B-Tree IndexStandard use index in relational databases in a B-

Tree index.Allows for rapid tree traversal searching through

an upside-down tree structureReading a single record from a very large table

using a B-Tree index, can often result in a few block reads—even when the index and table are millions of blocks in size.

Any index structure other than a B-Tree index is subject to overflow. Overflow is where any changes made to tables will not

have records added into the original index structure, but rather tacked on the end.

What is a B-Tree?B-tree is a specialized multiway tree designed

especially for use on disk. B-Tree consists of a root node, branch nodes

and leaf nodes containing the indexed field values in the ending (or leaf) nodes of the tree.

B-Tree Characteristics In a B-tree each node may contain a large

number of keys B-tree is designed to branch out in a large

number of directions and to contain a lot of keys in each node so that the height of the tree is relatively small

Constraints that tree is always balancedSpace wasted by deletion, if any, never becomes

excessiveInsert and deletions are simple processes

Complicated only under special circumstances-Insertion into a node that is already full or a deletion from a node makes it less then half full

Characteristics of a B-Tree of Order P

Within each node, K1 < K2 < .. < Kp-1

Each node has at most p tree pointerEach node, except the root and leaf nodes,

has at least ceil(p/2) tree pointers, The root node has at least two tree pointers unless it is the only node in the tree.

All leaf nodes are at the same level. Leaf node have the same structure as internal nodes except that all of their tree pointer Pi are null.

B-Tree Insertion1) B-tree starts with a single root node (which is also a leaf

node) at level 0.2) Once the root node is full with p – 1 search key values

and when attempt to insert another entry in the tree, the root node splits into two nodes at level 1.

3) Only the middle value is kept in the root node, and the rest of the values are split evenly between the other two nodes.

4) When a nonroot node is full and a new entry is inserted into it, that node is split into two nodes at the same level, and the middle entry is moved to the parent node along with two pointers to the new split nodes.

5) If the parent node is full, it is also split.6) Splitting can propagate all the way to the root node,

creating a new level if the root is split.

B-Tree Deletion1) If deletion of a value causes a node to be

less than half full, it is combined with it neighboring nodes, and this can also propagate all the way to the root. - Can reduce the number of tree levels.

*Shown by analysis and simulation that, after numerous random insertions and deletions on a B-tree, the nodes are approximately 69 percent full when the number of values in the tree stabilizes. If this happens , node splitting and combining will occur only rarely, so insertion and deletion become quite efficient.

B-tree of Order 5 Example

All internal nodes have at least ceil(5 / 2) = ceil(2.5) = 3 children (and hence at least 2 keys), other then the root node.

The maximum number of children that a node can have is 5 (so that 4 is the maximum number of keys)

each leaf node must contain at least 2 keys

B-Tree Order 5 Insertion

Originally we have an empty B-tree of order 5Want to insert C N G A H E K Q M F W L T Z D P

R X Y S Order 5 means that a node can have a maximum

of 5 children and 4 keys All nodes other than the root must have a

minimum of 2 keys The first 4 letters get inserted into the same node

B-Tree Order 5 Insertion Cont.

When we try to insert the H, we find no room in this node, so we split it into 2 nodes, moving the median item G up into a new root node.


Inserting E, K, and Q proceeds without requiring any splits


Inserting M requires a split


The letters F, W, L, and T are then added without needing any split


When Z is added, the rightmost leaf must be split. The median item T is moved up into the parent node


The insertion of D causes the leftmost leaf to be split. D happens to be the median key and so is the one moved up into the parent node.

The letters P, R, X, and Y are then added without any need of splitting

B-Tree Order 5 Insertion Cont. Finally, when S is added, the node with N, P, Q, and R

splits, sending the median Q up to the parent. The parent node is full, so it splits, sending the median

M up to form a new root node.

B-Tree Order 5 DeletionInitial B-Tree

B-Tree Order 5 Deletion Cont.Delete H Since H is in a leaf and the leaf has more

than the minimum number of keys, we just remove it.

B-Tree Order 5 Deletion Cont. Delete T. Since T is not in a leaf, we find its successor (the next item in

ascending order), which happens to be W. Move W up to replace the T. That way, what we really have to do

is to delete W from the leaf .

B+- Tree Characteristics

Data records are only stored in the leaves.Internal nodes store just keys.Keys are used for directing a search to the

proper leaf. If a target key is less than a key in an

internal node, then the pointer just to its left is followed.

If a target key is greater or equal to the key in the internal node, then the pointer to its right is followed.

B+ Tree combines features of ISAM (Indexed Sequential Access Method) and B Trees.

Disadvantage of indexed-sequential files performance degrades as file grows, since many

overflow blocks get created. Periodic reorganization of entire file is required.

Advantage of B+-tree index files: automatically reorganizes itself with small, local,

changes, in the face of insertions and deletions. Reorganization of entire file is not required to maintain

performance. (Minor) disadvantage of B+-trees:

extra insertion and deletion overhead, space overhead. Advantages of B+-trees outweigh disadvantages

B+-trees are used extensively

B+-Tree Index Files

Typical node

Ki are the search-key values Pi are pointers to children (for non-leaf nodes) or

pointers to records or buckets of records (for leaf nodes).

The search-keys in a node are ordered K1 < K2 < K3 < . . . < Kn–1

Usually the size of a node is that of a block

B+-Tree Node Structure

Example of a B+-tree

All paths from root to leaf are of the same length

Each node that is not a root or a leaf has between n/2 and n children.

A leaf node has between (n–1)/2 and n–1 values

Special cases: If the root is not a leaf, it has at least 2

children.If the root is a leaf (that is, there are no other

nodes in the tree), it can have between 0 and (n–1) values.

B+-Tree Index File (Properties)

For i = 1, 2, . . ., n–1, pointer Pi either points to a file record with search-key value Ki, or to a bucket of pointers to file records, each record having search-key value Ki. Only need bucket structure if search-key does not form a primary key.

If Li, Lj are leaf nodes and i < j, Li’s search-key values are less than Lj’s search-key values

Pn points to next leaf node in search-key order

Leaf Nodes in B+-Trees

Non leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with m pointers:All the search-keys in the subtree to which P1

points are less than K1

For 2 i n – 1, all the search-keys in the subtree to which Pi points have values greater than or equal to Ki–1 and less than Ki

All the search-keys in the subtree to which Pn points have values greater than or equal to Kn–1

Non-Leaf Nodes in B+-Trees

Leaf nodes must have between 2 and 4 values

((n–1)/2 and n –1, with n = 5).Non-leaf nodes other than root must have

between 3 and 5 children ((n/2 and n with n =5).

Root must have at least 2 children.

Example of B+-tree

B+-tree Insertion

B+-tree Insertion

B+-tree Insertion

When splitting leaf lowest value in right part gets inserted into

parent value also stays in leaf

When splitting internal nodelowest value in right part gets inserted into

parentvalue is removed from the right part

Trick... to avoid cascading insertions, simply

redistribute values among neighboring leaves

B+-tree Insertion

B+-tree Deletion

B+-tree Deletion

B+-tree Deletion

B+-tree Deletion

Operation on leaves and internal nodes the same

Need to think a bit about the correct value for the 'middle’ separating key

B+-tree Deletion

A bucket is a unit of storage containing one or more records (a bucket is typically a disk block).

In a hash file organization we obtain the bucket of a record directly from its search-key value using a hash function.

Hash function h is a function from the set of all search-key values K to the set of all bucket addresses B.

Hash function is used to locate records for access, insertion as well as deletion.

Records with different search-key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate a record.

Static Hashing

Hash file organization of account file, using branch_name as key

There are 10 buckets,The binary representation

of the ith character is assumed to be the integer i.

The hash function returns the sum of the binary representations of the characters modulo 10 E.g. h(Perryridge) = 5

h(Round Hill) = 3 h(Brighton) = 3

Example of Hash File Organization

Worst hash function maps all search-key values to the same bucket; this makes access time proportional to the number of search-key values in the file.

An ideal hash function is uniform, i.e., each bucket is assigned the same number of search-key values from the set of all possible values.

Ideal hash function is random, so each bucket will have the same number of records assigned to it irrespective of the actual distribution of search-key values in the file.

Typical hash functions perform computation on the internal binary representation of the search-key.

For example, for a string search-key, the binary representations of all the characters in the string could be added and the sum modulo the number of buckets could be returned. .

Hash Functions

Bucket overflow can occur because of Insufficient buckets Skew in distribution of records. This can occur

due to two reasons:multiple records have same search-key valuechosen hash function produces non-uniform

distribution of key values

Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is handled by using overflow buckets.

Handling of Bucket Overflows

Overflow chaining – the overflow buckets of a given bucket are chained together in a linked list.

Above scheme is called closed hashing. An alternative, called open hashing, which does not use overflow buckets, is

not suitable for database applications. Open hashing

The number of buckets is fixed Overflow is handled by using the next bucket in cyclic order that has space.

This is known as linear probing.

Handling of Bucket Overflows (Cont.)

Hashing can be used not only for file organization, but also for index-structure creation.

A hash index organizes the search keys, with their associated record pointers, into a hash file structure

Hash Indices

In static hashing, function h maps search-key values to a fixed set of B of bucket addresses. Databases grow or shrink with time.

If initial number of buckets is too small, and file grows, performance will degrade due to too much overflows.

If space is allocated for anticipated growth, a significant amount of space will be wasted initially (and buckets will be underfull)

If database shrinks, again space will be wasted.

One solution: periodic re-organization of the file with a new hash function

Expensive, disrupts normal operations

Better solution: allow the number of buckets to be modified dynamically.

Deficiencies of Static Hashing

Good for database that grows and shrinks in size

Allows the hash function to be modified dynamicallyExtendable/extensible hashing Linear Hashing

Dynamic Hashing

Documents

So far we have studied the DBMS at level of the logical model. The logical model of a database system is the correct level for the database users to focus