Disk Storages, Basic File Structures, Hashing, and Modern

Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe

CHAPTER 16

Disk Storages, Basic File Structures,

Hashing, and Modern Storage

Architecture

Slide 16- 2


Chapter Outline

Secondary Storage Devices

Buffering of Blocks

Placing File Records on Disk

Operations on Files

Files of Unordered Records (Heap Files)

Files of Ordered Records (Sorted Files)

Hashing Techniques

Slide 16- 3


Chapter Outline (cont’d)

Other Primary File Organizations

Parallelizing Disk Access Using RAID

Technology

New Storage Systems

Slide 16- 4


Introduction

The collection of data that makes up a computerized database must be stored physically on some computer storage medium.

Category of storage media Primary storage: can be operated directly by CPU

most expensive, highest-speed, smaller capacity

Secondary storage: online storage of database (magnetic disk, solid-state drive [SSD])

Tertiary storage: offline storage for archiving database cost less, slower access, and larger capacity

Memory Hierarchies and Storage Devices Primary storage level: Cache memory, Main memory

Secondary and tertiary storage level: Flash memory, Magnetic disks, SSD, CD-ROM, DVD disks, Magnetic tapes

Main memory database Entire database kept in main memory (e.g., telephone switching)

Particularly useful for real-time application

Slide 16- 5


Types of Storage with Capacity, Access Time, Max

Bandwidth (Transfer Speed), and Commodity Cost

Slide 16- 6


Storage of Databases

Database typically store large amounts of data that must persist over long periods of time (in contrast to transient).

Database are stored permanently on magnetic disk secondary storage.

The secondary storage are referred as nonvolatile storage, whereas the main memory is often called volatile storage.

The process of physical database design involves choosing the particular data organization that best suit the given application requirements with acceptable performance

The data stored on disk is organized as files of records.

There are several primary file organizations, which determine how records of a file are physically placed on the disk, e.g., heap, sorted, hashed, B-tree, and how the records can be accessed.

A secondary organization or auxiliary access structure (i.e., index) allows efficient access to the records of a file than those that have been used for the primary file organization.

Slide 16- 7


Disk Storage Devices

Magnetic disks are used for storing large amounts of data. The

device that holds the disks is referred to as a hard disk drive or HDD.

Preferred secondary storage device for high storage capacity and low

cost.

Data stored as magnetized areas on magnetic disk surfaces.

A disk pack contains several magnetic disks connected to a rotating

spindle.

Disks are divided into concentric circular tracks on each disk surface.

Track capacities vary typically from tens of kilobytes to 150 Kbytes or

more

In disk packs, track with the same diameter on the various surfaces

are called cylinder because of the shape they would form if

connected in space.

Data stored one cylinder can be retrieved much faster than if it were

distributed among different cylinders

Slide 16- 8


Figure 16.1 (a) A single-sided disk with read/write

hardware. (b) A disk pack with read/write hardware.

Slide 16- 9


Disk Storage Devices (cont’d)

A track is divided into smaller blocks or sectors

because a track usually contains a large amount of

information

The division of a track into sectors is hard-codedon the disk surface and cannot be changed.

One type of sector organization calls a portion of a

track that subtends a fixed angle at the center as a

sector (Figure 16.2a)

Other sector organizations are possible, one of which

is to have the sectors subtend smaller angles at the

center as one moves away, maintain a uniform density

of recording (Figure 16.2b)

Slide 16- 10


Figure 16.2 Different sector organizations on disk. (a) Sectors

subtending a fixed angle. (b) Sectors maintaining a uniform

recording density.

Slide 16- 11


Disk Storage Parameters

The division of a track into equal-sized disk

blocks (or pages) is set by the operating system

during disk formatting (or initialization)

A track is divided into blocks

The block size B is fixed for each system

Typical block sizes range from B=512 bytes to B=8192 bytes

Blocks are separated by fixed-size interblock gaps,

which include specially coded control information

written during disk initialization

Whole blocks are transferred between disk and main

memory for processing

Slide 16- 12


Disk Storage Parameters

Slide 16- 13


Disk Storage Parameters (cont’d.)

Slide 16- 14


Disk Storage Devices (cont’d.)

A read-write head moves to the track that contains the

block to be transferred.

Disk rotation moves the block under the read-write head for

reading or writing

Fixed-head disks (as many heads as tracks): fast but costly

Moveable-head disks

Disk controller: controls the disk drive and interfaces it to

the computer system

The disk controller accepts high-level I/O commands and takes

appropriate action to position the arm and cause read/write action

to take place

Standard interfaces used for disk drives

SCSI, SATA, SAS, NL-SAS

Slide 16- 15


Disk Storage Devices (cont’d.)

Transfer of data between main memory and disk takes

place in units of disk blocks

A physical disk block (hardware) address consists of :

a cylinder number (imaginary collection of tracks of same radius

from all recorded surfaces)

the track number or (surface number within the cylinder)

block number (within the track)

Reading or writing a disk block is time consuming

because of the seek time s and rotational delay (latency)

rd

Double buffering can be used to speed up the transfer of

contiguous disk blocks (called cluster)

Slide 16- 16


Hardware Description of Disk

Devices

The total time need to locate and transfer an arbitrary block

Seek time: position the read/write head on the correct track

Rotational delay or latency: the beginning of the desired block rotates

into position under the read/write head

Block transfer time: transfer the data of the desired block

The seek time and rotational delay are usually much larger than the

block transfer time

It is common to transfer several blocks on the same track or cylinder. (eliminate the seek time and rotational delay except for the first block)

Bulk transfer rate: time required to transfer consecutive blocks

Locating data on disk is a major bottleneck in database applications.

It is important to minimize the number of block transfers needed to locate

and transfer the required data from disk to main memory

Placing “related information” on contiguous blocks is the basic goal of

any storage organization on disk

Slide 16- 17


Making Data Access More Efficient

on Disk

The commonly used techniques to make accessing data

more efficient on HDDs

Buffering of data: double buffer strategy

Proper organization of data on disk:

keep related data on continuous blocks

Reading data ahead of request:

When a block is read into the buffer, blocks from the rest of the

track can also be read

Proper scheduling of I/O requests: elevator algorithm

Use of log disks to temporarily hold writes

A single disk may be assigned for logging of writes (eliminate seek time)

Use of SSDs or flash memory for recovery purposes

Writing the updates to a SSD buffer first and then update the data

from buffer during the idle time or when the buffer becomes fullSlide 16- 18


Solid State Device (SSD) Storage

The recent trend is to use flash memories (SSDs) as an

intermediate layer between main memory and HDDs

As opposed to HDD, where related data from the same

relation must be placed on contiguous blocks, preferably

on contiguous cylinders, there is no restriction on

placement of data on an SSD since any address is

directly addressable.

As a result, the data is less likely to be fragmented; hence

no reorganization is needed

Typically, when a write to disk occurs on an HDD, the

same block is overwritten with new data.

In SDDs, the data is written to different NAND cells to attain

wear-leveling, which prolongs the life of SSDSlide 16- 19


Magnetic Tape Storage Devices

Disks are random access. Magnetic tapes are sequential access.

Data records on tape are also stored in blocks

Blocks may substantially larger than those for disks and

Interblock gaps are also quite large

Tapes serve a very important function that of backing upthat database.

Tapes can also be used to store excessively large database files

Archive the historical records

Tapes can also be used to store images and system libraries.

Slide 16- 20


Buffering of Blocks

When several blocks need to be transferred and their

block addresses are known, several buffers can be

reserved in main memory to speed up the transfer.

Concurrency of processes: (see Figure 13.3)

interleaved

parallel

Buffering is most useful when processes can run

concurrently in a parallel fashion, either because a

separate disk I/O processor is available or because

multiple CPU processors exist

Slide 16- 21


Interleaved Concurrency versus

Parallel Execution

Slide 16- 22


Double Buffering

Double buffering The CPU can start processing a block once its transfer

to main memory is completed; at the same time the disk I/O processor can be reading and transferring the next block into a different buffer

Can also be used to write continuous stream of blocks from memory to the disk

Permits continuous reading or writing of data on consecutive disk blocks, which eliminates the seek time and rotational delay for all but the first block transfer

Data is kept ready for processing, thus reducing the waiting time in the programs

Slide 16- 23


Double Buffering (cont’d)

B

Slide 16- 24


Buffer Management

For most large database, files containing millions of pages (or blocks),

it is not possible to bring all of the data into main memory at the same

time

The actual management of buffers (a part of main memory) and

decisions about what buffers to use to place a newly read page

in the buffer is a more complex process

Buffer manager is a software component of a DBMS that responds

to requests for data and decides what buffer to use and what pages to

replace in the buffer to accommodate the newly requested blocks

The buffer manager views the available main memory storage as a

buffer pool

The size of the shared buffer pool is typically a parameter for DBMS

controlled by DBAs

Slide 16- 25


Buffer Management (cont’d.)

Two kinds of buffer managers

Controls the main memory directly as in most RDBMSs

Allocate buffers in virtual memory, which allows the control to transfer to the operating system (OS)

The overall goal of the buffer manager is twofold

To maximize the probability that the requested page is found in main memory

In case of reading a new disk block from disk, to find a page to replace that will cause the least harm in the sense that it will not be required shortly again

The buffer manager keep two types of information on hand about each page in the buffer pool A pin-count: the number of times that page has been requested

or the number of current users of that page If the count falls to 0, the page is considered unpinned;

Increasing the pin-count is called pinning

A dirty-bit: is initially set to 0 but is et to 1 whenever that is page is updated

Slide 16- 26


Buffer Management (cont’d.)

When a certain page is requested, the buffer manager takes following

actions:

it checks if the requested page is already in a buffer in the buffer pool; if

so, it increments its pin-count and release the page

If the page is not in the buffer pool, the buffer manager does the

following

It chooses a page for replacement, using the replacement policy, and

increments its pin-count

If the dirty bit of the replacement page is on, the buffer manager writes that

page to disk by replacing its old copy on disk. If the dirty it is not on, no need

to write

The main memory address of the new page is passed to the requesting

application

If there is no unpinned page available in the buffer pool and the

requested page is not available in the buffer pool, the buffer manger

may have to wait until a page gets released. A transaction requesting this

page may go into a wait state or may even be aborted

Slide 16- 27


Buffer Replacement Strategies

Least recently used (LRU)

Throw out the page that has not been used (read or written) for the longest time

Clock policy (a round-robin variant of the LRU)

Each buffer has a flag with a 0 or 1. Buffers with a 0 are vulnerable and may be

used for replacement. Buffers with a 1 are not vulnerable

When a block is read into a buffer or the buffer is accessed, the flag is set to 1

The clock hand is positioned on a “current buffer.” When a buffer is needed for a

new block, the hand is rotated until a buffer with a 0 is found and use that buffer for

the new block (if the dirty bit is on for the page being replaced, the page will be

written to disk first)

If the clock hand passes buffers with 1s, set them to a zero. Thus, a block is

replaced from its buffer only if it is not accessed until the hand completes a rotation

and returns to it

First-in-first-out (FIFO)

The one that has been occupied the longest by a page is used for replacement

Most recently used (MRU)

A block that is used most recently is not needed until all the remaining blocks in

the relation are processedSlide 16- 28


Records

A record consists of a collection of related data values or

items, where each value is formed of one or more bytes

and corresponding to a particular field of the record.

Fixed and variable length records

Records contain fields which have values of a particular

type

E.g., amount, date, time, age

Fields themselves may be fixed length or variable length

Variable length fields can be mixed into one record:

Separator characters or length fields are needed so that the

record can be “parsed.”

A collection of field names and their corresponding data

types constitutes a record type or record format definition

Slide 16- 29


Records (cont’d)

Example: an employee record defined in C language

struct employee {

char name [30];

char ssn [9];

int salary;

int jobcode;

char department [20]; }

BLOB (binary large object)

used for data items consisting of large unstructured objects

stored separately from its record in a pool of disk blocks

Slide 16- 30


Blocking

Blocking:

Refers to storing a number of records in one block on

the disk.

Blocking factor (bfr) refers to the number of

records per block.

Denoted as bfr = B/R, where B is block size and

R is record size

There may be empty space in a block if an

integral number of records do not fit in one block.

The unused space in each block is B – (bfr * R)

bytesSlide 16- 31


Blocking (cont’d)

Spanned Records:

Refers to records that exceed the size of one

or more blocks and hence span a number of

blocks.

a pointer is used if the rest of the record is not

stored in the next consecutive block

If records are not allowed to cross the block

boundaries, the organization is called

unspanned

Slide 16- 32


Types of Record Organization.

(A) Unspanned. (B) Spanned

Slide 16- 33


Files of Records

A file is a sequence of records, where each

record is a collection of data values (or data

items).

A file descriptor (or file header) includes

information that describes the file, such as

the field names and their data types, and

the addresses of the file blocks on disk.

Records are stored on disk blocks.

Slide 16- 34


Files of Records (cont’d)

The blocking factor bfr for a file is the (average)

number of file records stored in a disk block.

The number of blocks b needed for a file of r

records

b = (r/bfr) blocks

A file can have fixed-length records or variable-

length records.

File records can be unspanned or spanned

Unspanned: no record can span two blocks

Spanned: a record can be stored in more than one

blockSlide 16- 35


Files, Fixed-Length Records, and

Variable-Length Records (1)

Fixed-length records vs. Variable-length records

If every record in the file has exactly the same size or different records in the file have different sizes

The reasons to have variable-length records

File records are of same record type, but one or more of the fields are of varying size

File records are of same record type, but one or more of the fields (repeating field) may have multiple values (a group of values for the repeating field is called a repeating group)

File records are of same record type, but one or more of the fields (optional field) are optional

The file contains records of different record types (mixed file)

Slide 16- 36


Files, Fixed-Length Records, and

Variable-Length Records (2)

Fixed-length records (can identify the position of each field easily)

Optional fields: have every optional field included in every file record

Repeating fields: each record allocates the maximum number of values of the

repeating filed

Variable-length records Variable-length fields

special separator characters are used to terminate variable length fields

store the length of the field in the record, proceeding the field value

Optional fields: can use the <filed-name, field-value> or <field-type, field-value> pair

rather than just the field values

Repeating fields One separator character to separate the repeating values of files and

another separator character to indicate termination of the field

Records of different types Each record is proceeded by a record type indicator

Slide 16- 37


Figure 16.5 Three record storage formats. (a) A fixed-length record with six fields and size of 71 bytes. (b) A record with two variable-length fields and three fixed-length fields. (c) A variable-field record with three types of separator characters.

Slide 16- 38


Files of Records (cont’d)

The physical disk blocks that are allocated to hold the

records of a file can be contiguous, linked, or indexed.

Combination of the Contiguous and Linked allocation Allocate clusters of consecutive blocks and the clusters

are linked

Clusters are sometimes called file segments or extents

In a file of fixed-length records, all records have the same

format. Usually, unspanned blocking is used with such

files.

Files of variable-length records require additional

information to be stored in each record, such as separator

characters and field types. Usually spanned blocking is used with such files.

Slide 16- 39


File Headers

A file header or file descriptor contains information about a file that is needed by the system program that access the file records.

The header includes information

Determine the disk addresses of the file blocks

The record format description including field lengths, order of the fields for fixed-length unspanned records, field type codes, separator characters, and record type codes for variable-length records

If the address of the block that contains the desired record is not known, a linear search through the file blocks is needed to locate the block

Slide 16- 40


Operations on Files

Operations on files are usually grouped into retrieval operations and update operations.

In either cases, a selection condition (or filtering condition) is required to select one or more records.

Search conditions on files are generally based on simple selection conditions.

A complex condition is decomposed to extract simple conditions.

Typically operations for locating and accessing file records

Record-at-a-time operations Open, Reset, Find (or Locate), Read (or Get), FindNext, Delete,

Modify, Insert, Close

Set-at-a-time high-level operations FindAll, Find (or Locate) n, FindOrdered, Reorganize

Slide 16- 41


Operations on Files (cont’d)

Typical file operations :

OPEN: Readies the file for access, and associates a pointer that will refer

to a current file record at each point in time.

FIND (Locate): Searches for the first file record that satisfies a certain

condition, and makes it the current file record.

FINDNEXT: Searches for the next file record (from the current record) that

satisfies a certain condition, and makes it the current file record.

READ (Get): Reads the current file record into a program variable.

INSERT: Inserts a new record into the file & makes it the current file record.

DELETE: Removes the current file record from the file, usually by marking

the record to indicate that it is no longer valid.

MODIFY: Changes the values of some fields of the current file record.

CLOSE: Terminates access to the file.

REORGANIZE: Reorganizes the file records.

For example, the records marked deleted are physically removed from the file or

a new organization of the file records is created.

READ_ORDERED: Read the file blocks in order of a specific field of the file.

Slide 16- 42


Operations on Files (cont’d)

The difference between file organization and access method

A file organization refers to the organization of the data of a file into records, blocks, and access structures

An access method provides a group of operations that can be applied to a file, such as open, reset, find, read, …

It is possible to apply several access methods to a file organization

Static vs dynamic

Some files on which update operations are rarely performed

Some files may change frequently, so update operations are constantly applied to them

A successful file organization should perform efficiently as possible the operations expected to apply frequently to the file

Slide 16- 43


Files of Unordered Files

The simplest and most basic type of organization

Also called a heap or a pile file

New records are inserted at the end of the file

A linear search through the file records is necessary to search for a

record This requires reading and searching half the file blocks on the average,

and is hence quite expensive

Record insertion is quite efficient

A record is deleted by setting its deletion marker to a certain value,

the space of deleted records can be used when inserting new records

Period reorganization of the file to reclaim the unused space

Modifying a variable-length record may require deleting the old record

and inserting a modified record

Reading the records in order of a particular field requires sorting the

file records

Slide 16- 44


Files of Ordered Files

Also called a sequential file

File records are kept sorted by the values of an ordering

field

Insertion is expensive: records must be inserted in the

correct order

It is common to keep a separate unordered overflow (or

transaction) file for new records to improve insertion efficiency;

this is periodically merged with the main ordered file

Modifying a field value of a record depends on two factors

The search condition to locate the record, and

The field to be modified

Slide 16- 45


Files of Ordered Files (cont’d)

A binary search can be used to search for a record on its

ordering field value If the ordering field is a key field, the field is called the ordering

key

This requires reading and searching log2 of the file blocks on the

average, an improvement over linear search

Reading the records in order of the ordering field is quite

efficient

Ordering does not provide any advantages for random or

order access based on the values of other nonordering

fields

Ordered files are rarely used in database applications

unless an additional access path, called a primary index,

is used; this results in an index-sequential file.Slide 16- 46


Files of Ordered

Files (cont’d)

Slide 16- 47


Files of Ordered Files (cont’d)

Average Access Times

average access time to access a specific

record for a given type of file

Slide 16- 48


Hashing Techniques

Hash file (or direct file)

Provides very fast access to records on certain search conditions

The search condition must be an equality condition on a single

field, called the hash field

In most cases, the hash field is also a key field of the file, in which

case it is called the hash key

The idea behind hashing is to provide a function h, called a hash

function or randomizing function, that is applied to the hash

field value of record and yield the address of the disk block in

which the record is stored

Search is very efficient on the hash key. For most records, only

a single-block access is needed to retrieve the record

Slide 16- 49


Internal Hashing

For internal files, hashing is typically implemented as a

hash table through the use of an array of records.

Suppose that the array index ranges 0..M-1, a common hash

function is h(K) = K mod M

The hash field space – the number of possible values a

hash field can take – is usually much larger than the

address space – the number of available addresses for

records.

A collision occurs when two or more hash field values

mapping into the same address (position) for records.

Slide 16- 50


Internal Hashing (cont’d)

Numerous methods for collision resolution:

Open addressing: Proceeding from the occupied position

specified by the hash address, the program checks the

subsequent positions in order until an unused (empty) position is

found.

Chaining: For this method, various overflow locations are kept,

usually by extending the array with a number of overflow positions.

In addition, a pointer field is added to each record location. A

collision is resolved by placing the new record in an unused

overflow location and setting the pointer of the occupied hash

address location to the address of that overflow location.

Multiple hashing: The program applies a second hash function if

the first results in a collision. If another collision results, the

program uses open addressing or applies a third hash function

and then uses open addressing if necessary.

Slide 16- 51


Internal Hashing (cont’d)

To reduce overflow records, a hash file is

typically kept 70-90% full

The hash function h should distribute the

records uniformly among the buckets

Otherwise, search time will be increased

because many overflow records will exist

Slide 16- 52


Figure 16.8Internal hashing data structures. (a) Array of M positions for use in internal hashing. (b) Collision resolution by chaining records.

Slide 16- 53


External Hashing Files

Hashing for disk files is called external hashing.

The file blocks are divided into M equal-sized buckets, numbered bucket0, bucket1, ..., bucketM-1, each of which holds multiple records

Typically, a bucket corresponds to one (or a fixed number of contiguous ) disk block.

The hashing function maps a key into a relative bucket number

A table is maintained in the file header to converts the bucket number into the corresponding disk block address

Collisions occur when a new record hashes to a bucket that is already

full

An overflow file is kept for storing such records

Overflow records that hash to each bucket can be linked together

The pointers in the linked list should be record pointers, including both a

block address and a relative record position within the block

Slide 16- 54


Hashed Files

Figure 16.9 Matching bucket numbers to disk block addresses.

Slide 16- 55


Figure 16.10 Handling overflow for

buckets by chaining.

Slide 16- 56


External Hashing Files (cont’d)

Static hashing A fixed number of buckets is allocated

Main disadvantages of static external hashing:

Fixed number of buckets M is a problem if the number

of records in the file grows or shrinks

Can result in either a lot of unused space or an inefficient

record access due to the long lists of overflow records

Ordered access on the hash key is quite inefficient

(requires sorting the records)

A dynamic file organization allows the number of buckets to vary dynamically with only localized reorganization

Slide 16- 57


Dynamic and Extendible Hashed

Files

Dynamic and Extendible Hashing

Techniques

Hashing techniques are adapted to allow the

dynamic growth and shrinking of the number of

file records

These techniques include the following:

dynamic hashing, extendible hashing, and

linear hashing

Slide 16- 58



Files (cont’d)

Both dynamic and extendible hashing use

the binary representation of the hash

value h(K) in order to access a directory

In dynamic hashing the directory is a binary

tree

In extendible hashing the directory is an array

of size 2d where d is called the global depth

Slide 16- 59



Files (cont’d)

The directories can be stored on disk, and

they expand or shrink dynamically

Directory entries point to the disk blocks that

contain the stored records

An insertion in a disk block that is full

causes the block to split into two blocks

and the records are redistributed among

the two blocks

The directory is updated appropriately

Slide 16- 60


Extendible

Hashing

Figure 16.11 Structure of the extendible hashing scheme.

Slide 16- 61



Files (cont’d)

Examples of extensible hashing (consider the Figure in previous slide)

Insert record into the bucket starting with 01 (the 3rd bucket) The records will be distributed between two buckets: 010 and 011

and both buckets have d’=3

Insert records into the bucket starting with 111 (the last bucket) Two new buckets 1110 and 1111 (d’=4) need a directory with d=4.

Thus, the size of directory is doubled, and each of the other original locations in the directory is also split into two locations, both of which have the same pointer values as did the original location

The main advantage of extensible hashing

The performance of the file does not degrade as the file grows

Additional buckets can be allocated dynamically as needed

The space overhead for the directory table is negligible

Splitting cause only minor (local) reorganization

The disadvantage of extensible hashing

Two block accesses instead of one in static hashing

Slide 16- 62


Dynamic

Hashing

Figure 16.12Structure of the dynamichashing scheme.

Slide 16- 63



Files (cont’d)

Dynamic and extendible hashing do not

require an overflow area

Linear hashing does require an overflow

area but does not use a directory

Blocks are split in linear order as the file

expands

Slide 16- 64


Linear Hashing

The idea behind linear hashing is expand and shrink the

number of buckets dynamically without needing a

directory.

Suppose that the file starts with M buckets numbered

from 0, 1, …, M-1

Let hash function h(K) = K mod M be the initial hash

function hi

Overflow chain for each bucket is still required for collision

resolution

When a collision leads to an overflow records in any file

bucket, the first bucket in the file – bucket 0 – is split into

two buckets: bucket 0 and bucket M

Slide 16- 65


Linear Hashing (cont’d.)

The records originally in bucket 0 are distributed between new buckets 0 and M based on the hash function hi+1(K) = K mod 2M

As further collision lead to overflow records, additional buckets are split in the linear order 1, 2, 3, … and eventually the file has 2M instead of M buckets, and all the buckets use the hash function hi+1(K) = K mod 2M

Hence, the records in overflow are eventually redistributed into regular buckets, using hi+1 via a delayed split of their buckets

A value n, initially set to 0 and is incremented by 1 whenever a split occurs, is needed to determines which buckets have been split.

To retrieve a record with hash key K, first apply hi to K; if hi(K) < n, then apply hi+1 on K

n grows linearly as buckets are split; when n=M, reset n=0

Any new collisions cause overflow lead to the use of a new hashing function hi+2(K) = K mod 4M

Slide 16- 66


Linear Hashing (cont’d.)

In general, a sequence of hashing function hi+j(K) = K mode (2jM) is used, where j =0, 1, 2,…; a new hashing function hi+j+1 is needed whenever all the buckets 0, 1,…, (2jM)1 have been split and n is reset to 0.

Splitting can be controlled by monitoring the file load factor instead of splitting whenever an overflow occurs.

The file load factor l can be defined as l = r/(bfr*N), where r is current number of file records and N is current number of file buckets

Buckets that have been split can also be recombinedlinearly if the load of the file falls below a certain threshold.

The file load can be used to trigger both splits and combinations so that the file load can be kept within a desired range, say 0.7~0.9

Slide 16- 67


Parallelizing Disk Access using RAID

Technology

Secondary storage technology must take steps to

keep up in performance and reliability with

processor technology

A major advance in secondary storage technology is

represented by the development of RAID, which

originally stood for Redundant Arrays of

Inexpensive Disks

The main goal of RAID is to even out the widely

different rates of performance improvement of disks

against those in memory and microprocessors

Slide 16- 68



Technology (cont’d.)

A natural solution is a large array of small independent disks acting as a single higher-performance logical disk.

A concept called data striping is used, which utilizes parallelism to improve disk performance.

Utilize parallelism to improve disk performance

Distribute data transparently over multiple disks to make them appear a single large, fast disk

Provide high overall transfer rate and also accomplish load balancing

Improve reliability by storing redundant information on disks using parity or some other error correction code

Slide 16- 69


Improving Reliability with RAID

Solutions to disk reliability problem

Employ redundancy of data so that disk failure can be tolerated

For an array of n disks, the likelihood of failure is n time as much as that for one disk. An obvious solution is to employ redundant data

The disadvantages include additional I/O, extra computation to maintain redundancy and to do recovery from errors, additional disk capacity to store redundant information

Use mirroring or shadowing technique to introduce redundancy

Store extra information that can be used to reconstruct the lost information in case of disk failure

The incorporation of redundancy must consider two problems:

selecting a technique, such as error correction codes involving parity bits, for computing the redundant information

selecting a method of distributing the redundant information across the disk array (store the redundant information on a small number of disks or distribute it uniformly across all disks)

Slide 16- 70


Improving Performance with RAID

The disk array employ the technique of data striping to

achieve higher transfer rate.

The disk striping can be applied in different granularity

Bit-level data striping

Splitting a byte of data and writing bit j to the jth disk

Can be generalized to a number of disks that is either a

multiple or a factor of eight

Block-level data striping

Multiple independent small requests (that accessing single

blocks) can be serviced parallel by separate disks to reduce

queuing time of I/O request

Large requests (accessing multiple blocks) can be parallelized

to reduce the response time

Slide 16- 71



Technology (cont’d.)

Slide 16- 72

Figure 16.13 Striping of data across multiple disks. (a) Bit-level striping across four disks. (b) Block-level striping across four disks.


RAID Organization and Levels

Different RAID organizations were defined based on

different combinations of the two factors of granularity of

data interleaving (striping) and pattern used to compute

redundant information Raid level 0 has no redundant data and hence has the best write

performance at the risk of data loss

Raid level 1 uses mirrored disks

Raid level 2 uses memory-style redundancy by using Hamming codes,

which contain parity bits for distinct overlapping subsets of components.

Level 2 includes both error detection and correction.

Raid level 3 uses a single parity disk relying on the disk controller to

figure out which disk has failed

Raid Levels 4 and 5 use block-level data striping, with level 5 distributing

data and parity information across all disks

Raid level 6 applies the so-called P + Q redundancy scheme using Reed-

Soloman codes to protect against up to two disk failures by using just two

redundant disksSlide 16- 73


Use of RAID Technology

Different raid organizations are being used under

different situations

Raid level 1 (mirrored disks) is the easiest for rebuild of a

disk from other disks

• It is used for critical applications like logs

Raid level 2 uses memory-style redundancy by using

Hamming codes, which contain parity bits for distinct

overlapping subsets of components

• Level 2 includes both error detection and correction

Raid level 3 (single parity disks relying on the disk

controller to figure out which disk has failed) and level 5

(block-level data striping) are preferred for Large volume

storage, with level 3 giving higher transfer ratesSlide 16- 74


Use of RAID Technology (cont’d.)

Most popular uses of the RAID technology currently are:

Level 0 (with striping), Level 1 (with mirroring) and Level

5 with an extra drive for parity

Design Decisions for RAID include:

Level of RAID, number of disks, choice of parity schemes,

and grouping of disks for block-level striping

Slide 16- 75



Slide 16- 76



Figure 16.14 Some popular levels of RAID. (a) RAID level 1: Mirroring of data on two disks. (b) RAID level 5: Striping of data with distributed parity across four disks.

Slide 16- 77


Storage Area Networks

The demand for higher storage has risen considerably in

recent times

The growth of e-commerce, ERP, data warehouse makes the

demand of data storage increase substantially

The total cost of managing all data is growing rapidly that in many

cases the cost of managing server attached storage exceeds the

cost of server itself

Organizations have a need to move from a static fixed

data center oriented operation to a more flexible and

dynamic infrastructure for information processing

Many users of RAID systems cannot use the capacity

effectively because it has to be attached in a fixed manner

to one or more server

Slide 16- 78


Storage Area Networks (cont’d.)

Thus they are moving to a concept of Storage Area

Networks (SANs)

In a SAN, online storage peripherals are configured as

nodes on a high-speed network and can be attached

and detached from servers in a very flexible manner

This allows storage systems to be placed at longer

distances from the servers and provide different

performance and connectivity options

Slide 16- 79


Storage Area Networks (cont’d.)

Advantages of SANs are:

Flexible many-to-many connectivity among servers and

storage devices using fiber channel hubs and switches

Up to 10km separation between a server and a storage

system using appropriate fiber optic cables

Better isolation capabilities allowing non-disruptive

addition of new peripherals and servers

SANs face the problem of combining storage options from

multiple vendors and dealing with evolving standards of

storage management software and hardware

Slide 16- 80


Network-Attached Storage

With the phenomenal growth in digital data, particularly generated from multimedia and other enterprise applications, the need for high performance storage solutions at low cost has become extremely important

Network-Attached Storage (NAS)

NAS allows the addition of storage for file sharing. NAS devices allow vast amounts of hard disk storage space to be added to a network and can make that space available to multiple servers without shutting them down for maintenance and upgrades

NAS devices are being deployed as a replacement to traditional file servers

NAS system strive for reliable operation and easy administration

Slide 16- 81


Network-Attached Storage (cont’d.)

SAN vs. NAS

SAN often utilize Fiber Channel rather than

Ethernet

SAN often incorporates multiple network devices

or endpoints on a self-contained or private LAN

NAS relies on individual devices connected

directly to the existing public LAN

NAS systems claim greater operating system

independence of clients

Slide 16- 82


iSCSI Storage Systems

iSCSI (Internet SCSI)

A new protocol

When a DBMS needs to access data, the operating system generates the

appropriate SCSI commands and data request. A package header is added

before the resulting IP packages are transmitted over Ethernet. The package

is disassembled, separating the SCSI commands and request, after received.

Allows clients (called initiators) to send SCSI commands to SCSI storage

devices on remote channels

Not require the special cabling needed by Fiber Channel and can run

over longer distances using existing network infrastructure

By carrying SCSI commands over IP networks, iSCSI facilitates data

transfer over intranets and manages storage over long distance

iSCSI storage has mainly impacted small- and medium-sized

businesses because of its combination of simplicity, low cost, and the

functionality of iSCSI devices.

Slide 16- 83


iSCSI Storage Systems (cont’d.)

Two approaches to storage data transmission over IP network

iSCSI

Fiber Channel over IP (FCIP)

Translate Fiber Channel control codes and data into IP packages for

transmission

The protocol is known also as Fiber Channel tunneling or storage

tunneling

Fiber Channel over Ethernet (FCoE)

Can be thought of as iSCSI without the IP

Use many elements of SCSI and FC (just like iSCSI), but does not

include TCP/IP components

Promise excellent performance, especially on 10 Gigabit Ethernet

(10GbE) and is relatively easy for vendors to add to their products

Slide 16- 84


Summary

Disk Storage Devices

Files of Records

Operations on Files

Unordered Files

Ordered Files

Hashed Files

Dynamic and Extendible Hashing Techniques

RAID Technology

Slide 16- 85

Documents

Disk Storages, Basic File Structures, Hashing, and Modern