Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
CHAPTER 16
Disk Storages, Basic File Structures,
Hashing, and Modern Storage
Architecture
Slide 16- 2
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Chapter Outline
Secondary Storage Devices
Buffering of Blocks
Placing File Records on Disk
Operations on Files
Files of Unordered Records (Heap Files)
Files of Ordered Records (Sorted Files)
Hashing Techniques
Slide 16- 3
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Chapter Outline (cont’d)
Other Primary File Organizations
Parallelizing Disk Access Using RAID
Technology
New Storage Systems
Slide 16- 4
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Introduction
The collection of data that makes up a computerized database must be stored physically on some computer storage medium.
Category of storage media Primary storage: can be operated directly by CPU
most expensive, highest-speed, smaller capacity
Secondary storage: online storage of database (magnetic disk, solid-state drive [SSD])
Tertiary storage: offline storage for archiving database cost less, slower access, and larger capacity
Memory Hierarchies and Storage Devices Primary storage level: Cache memory, Main memory
Secondary and tertiary storage level: Flash memory, Magnetic disks, SSD, CD-ROM, DVD disks, Magnetic tapes
Main memory database Entire database kept in main memory (e.g., telephone switching)
Particularly useful for real-time application
Slide 16- 5
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Types of Storage with Capacity, Access Time, Max
Bandwidth (Transfer Speed), and Commodity Cost
Slide 16- 6
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Storage of Databases
Database typically store large amounts of data that must persist over long periods of time (in contrast to transient).
Database are stored permanently on magnetic disk secondary storage.
The secondary storage are referred as nonvolatile storage, whereas the main memory is often called volatile storage.
The process of physical database design involves choosing the particular data organization that best suit the given application requirements with acceptable performance
The data stored on disk is organized as files of records.
There are several primary file organizations, which determine how records of a file are physically placed on the disk, e.g., heap, sorted, hashed, B-tree, and how the records can be accessed.
A secondary organization or auxiliary access structure (i.e., index) allows efficient access to the records of a file than those that have been used for the primary file organization.
Slide 16- 7
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Disk Storage Devices
Magnetic disks are used for storing large amounts of data. The
device that holds the disks is referred to as a hard disk drive or HDD.
Preferred secondary storage device for high storage capacity and low
cost.
Data stored as magnetized areas on magnetic disk surfaces.
A disk pack contains several magnetic disks connected to a rotating
spindle.
Disks are divided into concentric circular tracks on each disk surface.
Track capacities vary typically from tens of kilobytes to 150 Kbytes or
more
In disk packs, track with the same diameter on the various surfaces
are called cylinder because of the shape they would form if
connected in space.
Data stored one cylinder can be retrieved much faster than if it were
distributed among different cylinders
Slide 16- 8
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Figure 16.1 (a) A single-sided disk with read/write
hardware. (b) A disk pack with read/write hardware.
Slide 16- 9
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Disk Storage Devices (cont’d)
A track is divided into smaller blocks or sectors
because a track usually contains a large amount of
information
The division of a track into sectors is hard-codedon the disk surface and cannot be changed.
One type of sector organization calls a portion of a
track that subtends a fixed angle at the center as a
sector (Figure 16.2a)
Other sector organizations are possible, one of which
is to have the sectors subtend smaller angles at the
center as one moves away, maintain a uniform density
of recording (Figure 16.2b)
Slide 16- 10
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Figure 16.2 Different sector organizations on disk. (a) Sectors
subtending a fixed angle. (b) Sectors maintaining a uniform
recording density.
Slide 16- 11
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Disk Storage Parameters
The division of a track into equal-sized disk
blocks (or pages) is set by the operating system
during disk formatting (or initialization)
A track is divided into blocks
The block size B is fixed for each system
Typical block sizes range from B=512 bytes to B=8192 bytes
Blocks are separated by fixed-size interblock gaps,
which include specially coded control information
written during disk initialization
Whole blocks are transferred between disk and main
memory for processing
Slide 16- 12
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Disk Storage Parameters
Slide 16- 13
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Disk Storage Parameters (cont’d.)
Slide 16- 14
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Disk Storage Devices (cont’d.)
A read-write head moves to the track that contains the
block to be transferred.
Disk rotation moves the block under the read-write head for
reading or writing
Fixed-head disks (as many heads as tracks): fast but costly
Moveable-head disks
Disk controller: controls the disk drive and interfaces it to
the computer system
The disk controller accepts high-level I/O commands and takes
appropriate action to position the arm and cause read/write action
to take place
Standard interfaces used for disk drives
SCSI, SATA, SAS, NL-SAS
Slide 16- 15
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Disk Storage Devices (cont’d.)
Transfer of data between main memory and disk takes
place in units of disk blocks
A physical disk block (hardware) address consists of :
a cylinder number (imaginary collection of tracks of same radius
from all recorded surfaces)
the track number or (surface number within the cylinder)
block number (within the track)
Reading or writing a disk block is time consuming
because of the seek time s and rotational delay (latency)
rd
Double buffering can be used to speed up the transfer of
contiguous disk blocks (called cluster)
Slide 16- 16
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Hardware Description of Disk
Devices
The total time need to locate and transfer an arbitrary block
Seek time: position the read/write head on the correct track
Rotational delay or latency: the beginning of the desired block rotates
into position under the read/write head
Block transfer time: transfer the data of the desired block
The seek time and rotational delay are usually much larger than the
block transfer time
It is common to transfer several blocks on the same track or cylinder. (eliminate the seek time and rotational delay except for the first block)
Bulk transfer rate: time required to transfer consecutive blocks
Locating data on disk is a major bottleneck in database applications.
It is important to minimize the number of block transfers needed to locate
and transfer the required data from disk to main memory
Placing “related information” on contiguous blocks is the basic goal of
any storage organization on disk
Slide 16- 17
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Making Data Access More Efficient
on Disk
The commonly used techniques to make accessing data
more efficient on HDDs
Buffering of data: double buffer strategy
Proper organization of data on disk:
keep related data on continuous blocks
Reading data ahead of request:
When a block is read into the buffer, blocks from the rest of the
track can also be read
Proper scheduling of I/O requests: elevator algorithm
Use of log disks to temporarily hold writes
A single disk may be assigned for logging of writes (eliminate seek time)
Use of SSDs or flash memory for recovery purposes
Writing the updates to a SSD buffer first and then update the data
from buffer during the idle time or when the buffer becomes fullSlide 16- 18
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Solid State Device (SSD) Storage
The recent trend is to use flash memories (SSDs) as an
intermediate layer between main memory and HDDs
As opposed to HDD, where related data from the same
relation must be placed on contiguous blocks, preferably
on contiguous cylinders, there is no restriction on
placement of data on an SSD since any address is
directly addressable.
As a result, the data is less likely to be fragmented; hence
no reorganization is needed
Typically, when a write to disk occurs on an HDD, the
same block is overwritten with new data.
In SDDs, the data is written to different NAND cells to attain
wear-leveling, which prolongs the life of SSDSlide 16- 19
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Magnetic Tape Storage Devices
Disks are random access. Magnetic tapes are sequential access.
Data records on tape are also stored in blocks
Blocks may substantially larger than those for disks and
Interblock gaps are also quite large
Tapes serve a very important function that of backing upthat database.
Tapes can also be used to store excessively large database files
Archive the historical records
Tapes can also be used to store images and system libraries.
Slide 16- 20
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Buffering of Blocks
When several blocks need to be transferred and their
block addresses are known, several buffers can be
reserved in main memory to speed up the transfer.
Concurrency of processes: (see Figure 13.3)
interleaved
parallel
Buffering is most useful when processes can run
concurrently in a parallel fashion, either because a
separate disk I/O processor is available or because
multiple CPU processors exist
Slide 16- 21
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Interleaved Concurrency versus
Parallel Execution
Slide 16- 22
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Double Buffering
Double buffering The CPU can start processing a block once its transfer
to main memory is completed; at the same time the disk I/O processor can be reading and transferring the next block into a different buffer
Can also be used to write continuous stream of blocks from memory to the disk
Permits continuous reading or writing of data on consecutive disk blocks, which eliminates the seek time and rotational delay for all but the first block transfer
Data is kept ready for processing, thus reducing the waiting time in the programs
Slide 16- 23
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Double Buffering (cont’d)
B
Slide 16- 24
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Buffer Management
For most large database, files containing millions of pages (or blocks),
it is not possible to bring all of the data into main memory at the same
time
The actual management of buffers (a part of main memory) and
decisions about what buffers to use to place a newly read page
in the buffer is a more complex process
Buffer manager is a software component of a DBMS that responds
to requests for data and decides what buffer to use and what pages to
replace in the buffer to accommodate the newly requested blocks
The buffer manager views the available main memory storage as a
buffer pool
The size of the shared buffer pool is typically a parameter for DBMS
controlled by DBAs
Slide 16- 25
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Buffer Management (cont’d.)
Two kinds of buffer managers
Controls the main memory directly as in most RDBMSs
Allocate buffers in virtual memory, which allows the control to transfer to the operating system (OS)
The overall goal of the buffer manager is twofold
To maximize the probability that the requested page is found in main memory
In case of reading a new disk block from disk, to find a page to replace that will cause the least harm in the sense that it will not be required shortly again
The buffer manager keep two types of information on hand about each page in the buffer pool A pin-count: the number of times that page has been requested
or the number of current users of that page If the count falls to 0, the page is considered unpinned;
Increasing the pin-count is called pinning
A dirty-bit: is initially set to 0 but is et to 1 whenever that is page is updated
Slide 16- 26
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Buffer Management (cont’d.)
When a certain page is requested, the buffer manager takes following
actions:
it checks if the requested page is already in a buffer in the buffer pool; if
so, it increments its pin-count and release the page
If the page is not in the buffer pool, the buffer manager does the
following
It chooses a page for replacement, using the replacement policy, and
increments its pin-count
If the dirty bit of the replacement page is on, the buffer manager writes that
page to disk by replacing its old copy on disk. If the dirty it is not on, no need
to write
The main memory address of the new page is passed to the requesting
application
If there is no unpinned page available in the buffer pool and the
requested page is not available in the buffer pool, the buffer manger
may have to wait until a page gets released. A transaction requesting this
page may go into a wait state or may even be aborted
Slide 16- 27
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Buffer Replacement Strategies
Least recently used (LRU)
Throw out the page that has not been used (read or written) for the longest time
Clock policy (a round-robin variant of the LRU)
Each buffer has a flag with a 0 or 1. Buffers with a 0 are vulnerable and may be
used for replacement. Buffers with a 1 are not vulnerable
When a block is read into a buffer or the buffer is accessed, the flag is set to 1
The clock hand is positioned on a “current buffer.” When a buffer is needed for a
new block, the hand is rotated until a buffer with a 0 is found and use that buffer for
the new block (if the dirty bit is on for the page being replaced, the page will be
written to disk first)
If the clock hand passes buffers with 1s, set them to a zero. Thus, a block is
replaced from its buffer only if it is not accessed until the hand completes a rotation
and returns to it
First-in-first-out (FIFO)
The one that has been occupied the longest by a page is used for replacement
Most recently used (MRU)
A block that is used most recently is not needed until all the remaining blocks in
the relation are processedSlide 16- 28
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Records
A record consists of a collection of related data values or
items, where each value is formed of one or more bytes
and corresponding to a particular field of the record.
Fixed and variable length records
Records contain fields which have values of a particular
type
E.g., amount, date, time, age
Fields themselves may be fixed length or variable length
Variable length fields can be mixed into one record:
Separator characters or length fields are needed so that the
record can be “parsed.”
A collection of field names and their corresponding data
types constitutes a record type or record format definition
Slide 16- 29
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Records (cont’d)
Example: an employee record defined in C language
struct employee {
char name [30];
char ssn [9];
int salary;
int jobcode;
char department [20]; }
BLOB (binary large object)
used for data items consisting of large unstructured objects
stored separately from its record in a pool of disk blocks
Slide 16- 30
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Blocking
Blocking:
Refers to storing a number of records in one block on
the disk.
Blocking factor (bfr) refers to the number of
records per block.
Denoted as bfr = B/R, where B is block size and
R is record size
There may be empty space in a block if an
integral number of records do not fit in one block.
The unused space in each block is B – (bfr * R)
bytesSlide 16- 31
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Blocking (cont’d)
Spanned Records:
Refers to records that exceed the size of one
or more blocks and hence span a number of
blocks.
a pointer is used if the rest of the record is not
stored in the next consecutive block
If records are not allowed to cross the block
boundaries, the organization is called
unspanned
Slide 16- 32
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Types of Record Organization.
(A) Unspanned. (B) Spanned
Slide 16- 33
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Files of Records
A file is a sequence of records, where each
record is a collection of data values (or data
items).
A file descriptor (or file header) includes
information that describes the file, such as
the field names and their data types, and
the addresses of the file blocks on disk.
Records are stored on disk blocks.
Slide 16- 34
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Files of Records (cont’d)
The blocking factor bfr for a file is the (average)
number of file records stored in a disk block.
The number of blocks b needed for a file of r
records
b = (r/bfr) blocks
A file can have fixed-length records or variable-
length records.
File records can be unspanned or spanned
Unspanned: no record can span two blocks
Spanned: a record can be stored in more than one
blockSlide 16- 35
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Files, Fixed-Length Records, and
Variable-Length Records (1)
Fixed-length records vs. Variable-length records
If every record in the file has exactly the same size or different records in the file have different sizes
The reasons to have variable-length records
File records are of same record type, but one or more of the fields are of varying size
File records are of same record type, but one or more of the fields (repeating field) may have multiple values (a group of values for the repeating field is called a repeating group)
File records are of same record type, but one or more of the fields (optional field) are optional
The file contains records of different record types (mixed file)
Slide 16- 36
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Files, Fixed-Length Records, and
Variable-Length Records (2)
Fixed-length records (can identify the position of each field easily)
Optional fields: have every optional field included in every file record
Repeating fields: each record allocates the maximum number of values of the
repeating filed
Variable-length records Variable-length fields
special separator characters are used to terminate variable length fields
store the length of the field in the record, proceeding the field value
Optional fields: can use the <filed-name, field-value> or <field-type, field-value> pair
rather than just the field values
Repeating fields One separator character to separate the repeating values of files and
another separator character to indicate termination of the field
Records of different types Each record is proceeded by a record type indicator
Slide 16- 37
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Figure 16.5 Three record storage formats. (a) A fixed-length record with six fields and size of 71 bytes. (b) A record with two variable-length fields and three fixed-length fields. (c) A variable-field record with three types of separator characters.
Slide 16- 38
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Files of Records (cont’d)
The physical disk blocks that are allocated to hold the
records of a file can be contiguous, linked, or indexed.
Combination of the Contiguous and Linked allocation Allocate clusters of consecutive blocks and the clusters
are linked
Clusters are sometimes called file segments or extents
In a file of fixed-length records, all records have the same
format. Usually, unspanned blocking is used with such
files.
Files of variable-length records require additional
information to be stored in each record, such as separator
characters and field types. Usually spanned blocking is used with such files.
Slide 16- 39
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
File Headers
A file header or file descriptor contains information about a file that is needed by the system program that access the file records.
The header includes information
Determine the disk addresses of the file blocks
The record format description including field lengths, order of the fields for fixed-length unspanned records, field type codes, separator characters, and record type codes for variable-length records
If the address of the block that contains the desired record is not known, a linear search through the file blocks is needed to locate the block
Slide 16- 40
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Operations on Files
Operations on files are usually grouped into retrieval operations and update operations.
In either cases, a selection condition (or filtering condition) is required to select one or more records.
Search conditions on files are generally based on simple selection conditions.
A complex condition is decomposed to extract simple conditions.
Typically operations for locating and accessing file records
Record-at-a-time operations Open, Reset, Find (or Locate), Read (or Get), FindNext, Delete,
Modify, Insert, Close
Set-at-a-time high-level operations FindAll, Find (or Locate) n, FindOrdered, Reorganize
Slide 16- 41
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Operations on Files (cont’d)
Typical file operations :
OPEN: Readies the file for access, and associates a pointer that will refer
to a current file record at each point in time.
FIND (Locate): Searches for the first file record that satisfies a certain
condition, and makes it the current file record.
FINDNEXT: Searches for the next file record (from the current record) that
satisfies a certain condition, and makes it the current file record.
READ (Get): Reads the current file record into a program variable.
INSERT: Inserts a new record into the file & makes it the current file record.
DELETE: Removes the current file record from the file, usually by marking
the record to indicate that it is no longer valid.
MODIFY: Changes the values of some fields of the current file record.
CLOSE: Terminates access to the file.
REORGANIZE: Reorganizes the file records.
For example, the records marked deleted are physically removed from the file or
a new organization of the file records is created.
READ_ORDERED: Read the file blocks in order of a specific field of the file.
Slide 16- 42
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Operations on Files (cont’d)
The difference between file organization and access method
A file organization refers to the organization of the data of a file into records, blocks, and access structures
An access method provides a group of operations that can be applied to a file, such as open, reset, find, read, …
It is possible to apply several access methods to a file organization
Static vs dynamic
Some files on which update operations are rarely performed
Some files may change frequently, so update operations are constantly applied to them
A successful file organization should perform efficiently as possible the operations expected to apply frequently to the file
Slide 16- 43
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Files of Unordered Files
The simplest and most basic type of organization
Also called a heap or a pile file
New records are inserted at the end of the file
A linear search through the file records is necessary to search for a
record This requires reading and searching half the file blocks on the average,
and is hence quite expensive
Record insertion is quite efficient
A record is deleted by setting its deletion marker to a certain value,
the space of deleted records can be used when inserting new records
Period reorganization of the file to reclaim the unused space
Modifying a variable-length record may require deleting the old record
and inserting a modified record
Reading the records in order of a particular field requires sorting the
file records
Slide 16- 44
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Files of Ordered Files
Also called a sequential file
File records are kept sorted by the values of an ordering
field
Insertion is expensive: records must be inserted in the
correct order
It is common to keep a separate unordered overflow (or
transaction) file for new records to improve insertion efficiency;
this is periodically merged with the main ordered file
Modifying a field value of a record depends on two factors
The search condition to locate the record, and
The field to be modified
Slide 16- 45
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Files of Ordered Files (cont’d)
A binary search can be used to search for a record on its
ordering field value If the ordering field is a key field, the field is called the ordering
key
This requires reading and searching log2 of the file blocks on the
average, an improvement over linear search
Reading the records in order of the ordering field is quite
efficient
Ordering does not provide any advantages for random or
order access based on the values of other nonordering
fields
Ordered files are rarely used in database applications
unless an additional access path, called a primary index,
is used; this results in an index-sequential file.Slide 16- 46
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Files of Ordered
Files (cont’d)
Slide 16- 47
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Files of Ordered Files (cont’d)
Average Access Times
average access time to access a specific
record for a given type of file
Slide 16- 48
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Hashing Techniques
Hash file (or direct file)
Provides very fast access to records on certain search conditions
The search condition must be an equality condition on a single
field, called the hash field
In most cases, the hash field is also a key field of the file, in which
case it is called the hash key
The idea behind hashing is to provide a function h, called a hash
function or randomizing function, that is applied to the hash
field value of record and yield the address of the disk block in
which the record is stored
Search is very efficient on the hash key. For most records, only
a single-block access is needed to retrieve the record
Slide 16- 49
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Internal Hashing
For internal files, hashing is typically implemented as a
hash table through the use of an array of records.
Suppose that the array index ranges 0..M-1, a common hash
function is h(K) = K mod M
The hash field space – the number of possible values a
hash field can take – is usually much larger than the
address space – the number of available addresses for
records.
A collision occurs when two or more hash field values
mapping into the same address (position) for records.
Slide 16- 50
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Internal Hashing (cont’d)
Numerous methods for collision resolution:
Open addressing: Proceeding from the occupied position
specified by the hash address, the program checks the
subsequent positions in order until an unused (empty) position is
found.
Chaining: For this method, various overflow locations are kept,
usually by extending the array with a number of overflow positions.
In addition, a pointer field is added to each record location. A
collision is resolved by placing the new record in an unused
overflow location and setting the pointer of the occupied hash
address location to the address of that overflow location.
Multiple hashing: The program applies a second hash function if
the first results in a collision. If another collision results, the
program uses open addressing or applies a third hash function
and then uses open addressing if necessary.
Slide 16- 51
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Internal Hashing (cont’d)
To reduce overflow records, a hash file is
typically kept 70-90% full
The hash function h should distribute the
records uniformly among the buckets
Otherwise, search time will be increased
because many overflow records will exist
Slide 16- 52
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Figure 16.8Internal hashing data structures. (a) Array of M positions for use in internal hashing. (b) Collision resolution by chaining records.
Slide 16- 53
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
External Hashing Files
Hashing for disk files is called external hashing.
The file blocks are divided into M equal-sized buckets, numbered bucket0, bucket1, ..., bucketM-1, each of which holds multiple records
Typically, a bucket corresponds to one (or a fixed number of contiguous ) disk block.
The hashing function maps a key into a relative bucket number
A table is maintained in the file header to converts the bucket number into the corresponding disk block address
Collisions occur when a new record hashes to a bucket that is already
full
An overflow file is kept for storing such records
Overflow records that hash to each bucket can be linked together
The pointers in the linked list should be record pointers, including both a
block address and a relative record position within the block
Slide 16- 54
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Hashed Files
Figure 16.9 Matching bucket numbers to disk block addresses.
Slide 16- 55
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Figure 16.10 Handling overflow for
buckets by chaining.
Slide 16- 56
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
External Hashing Files (cont’d)
Static hashing A fixed number of buckets is allocated
Main disadvantages of static external hashing:
Fixed number of buckets M is a problem if the number
of records in the file grows or shrinks
Can result in either a lot of unused space or an inefficient
record access due to the long lists of overflow records
Ordered access on the hash key is quite inefficient
(requires sorting the records)
A dynamic file organization allows the number of buckets to vary dynamically with only localized reorganization
Slide 16- 57
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Dynamic and Extendible Hashed
Files
Dynamic and Extendible Hashing
Techniques
Hashing techniques are adapted to allow the
dynamic growth and shrinking of the number of
file records
These techniques include the following:
dynamic hashing, extendible hashing, and
linear hashing
Slide 16- 58
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Dynamic and Extendible Hashed
Files (cont’d)
Both dynamic and extendible hashing use
the binary representation of the hash
value h(K) in order to access a directory
In dynamic hashing the directory is a binary
tree
In extendible hashing the directory is an array
of size 2d where d is called the global depth
Slide 16- 59
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Dynamic and Extendible Hashed
Files (cont’d)
The directories can be stored on disk, and
they expand or shrink dynamically
Directory entries point to the disk blocks that
contain the stored records
An insertion in a disk block that is full
causes the block to split into two blocks
and the records are redistributed among
the two blocks
The directory is updated appropriately
Slide 16- 60
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Extendible
Hashing
Figure 16.11 Structure of the extendible hashing scheme.
Slide 16- 61
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Dynamic and Extendible Hashed
Files (cont’d)
Examples of extensible hashing (consider the Figure in previous slide)
Insert record into the bucket starting with 01 (the 3rd bucket) The records will be distributed between two buckets: 010 and 011
and both buckets have d’=3
Insert records into the bucket starting with 111 (the last bucket) Two new buckets 1110 and 1111 (d’=4) need a directory with d=4.
Thus, the size of directory is doubled, and each of the other original locations in the directory is also split into two locations, both of which have the same pointer values as did the original location
The main advantage of extensible hashing
The performance of the file does not degrade as the file grows
Additional buckets can be allocated dynamically as needed
The space overhead for the directory table is negligible
Splitting cause only minor (local) reorganization
The disadvantage of extensible hashing
Two block accesses instead of one in static hashing
Slide 16- 62
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Dynamic
Hashing
Figure 16.12Structure of the dynamichashing scheme.
Slide 16- 63
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Dynamic and Extendible Hashed
Files (cont’d)
Dynamic and extendible hashing do not
require an overflow area
Linear hashing does require an overflow
area but does not use a directory
Blocks are split in linear order as the file
expands
Slide 16- 64
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Linear Hashing
The idea behind linear hashing is expand and shrink the
number of buckets dynamically without needing a
directory.
Suppose that the file starts with M buckets numbered
from 0, 1, …, M-1
Let hash function h(K) = K mod M be the initial hash
function hi
Overflow chain for each bucket is still required for collision
resolution
When a collision leads to an overflow records in any file
bucket, the first bucket in the file – bucket 0 – is split into
two buckets: bucket 0 and bucket M
Slide 16- 65
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Linear Hashing (cont’d.)
The records originally in bucket 0 are distributed between new buckets 0 and M based on the hash function hi+1(K) = K mod 2M
As further collision lead to overflow records, additional buckets are split in the linear order 1, 2, 3, … and eventually the file has 2M instead of M buckets, and all the buckets use the hash function hi+1(K) = K mod 2M
Hence, the records in overflow are eventually redistributed into regular buckets, using hi+1 via a delayed split of their buckets
A value n, initially set to 0 and is incremented by 1 whenever a split occurs, is needed to determines which buckets have been split.
To retrieve a record with hash key K, first apply hi to K; if hi(K) < n, then apply hi+1 on K
n grows linearly as buckets are split; when n=M, reset n=0
Any new collisions cause overflow lead to the use of a new hashing function hi+2(K) = K mod 4M
Slide 16- 66
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Linear Hashing (cont’d.)
In general, a sequence of hashing function hi+j(K) = K mode (2jM) is used, where j =0, 1, 2,…; a new hashing function hi+j+1 is needed whenever all the buckets 0, 1,…, (2jM)1 have been split and n is reset to 0.
Splitting can be controlled by monitoring the file load factor instead of splitting whenever an overflow occurs.
The file load factor l can be defined as l = r/(bfr*N), where r is current number of file records and N is current number of file buckets
Buckets that have been split can also be recombinedlinearly if the load of the file falls below a certain threshold.
The file load can be used to trigger both splits and combinations so that the file load can be kept within a desired range, say 0.7~0.9
Slide 16- 67
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Parallelizing Disk Access using RAID
Technology
Secondary storage technology must take steps to
keep up in performance and reliability with
processor technology
A major advance in secondary storage technology is
represented by the development of RAID, which
originally stood for Redundant Arrays of
Inexpensive Disks
The main goal of RAID is to even out the widely
different rates of performance improvement of disks
against those in memory and microprocessors
Slide 16- 68
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Parallelizing Disk Access using RAID
Technology (cont’d.)
A natural solution is a large array of small independent disks acting as a single higher-performance logical disk.
A concept called data striping is used, which utilizes parallelism to improve disk performance.
Utilize parallelism to improve disk performance
Distribute data transparently over multiple disks to make them appear a single large, fast disk
Provide high overall transfer rate and also accomplish load balancing
Improve reliability by storing redundant information on disks using parity or some other error correction code
Slide 16- 69
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Improving Reliability with RAID
Solutions to disk reliability problem
Employ redundancy of data so that disk failure can be tolerated
For an array of n disks, the likelihood of failure is n time as much as that for one disk. An obvious solution is to employ redundant data
The disadvantages include additional I/O, extra computation to maintain redundancy and to do recovery from errors, additional disk capacity to store redundant information
Use mirroring or shadowing technique to introduce redundancy
Store extra information that can be used to reconstruct the lost information in case of disk failure
The incorporation of redundancy must consider two problems:
selecting a technique, such as error correction codes involving parity bits, for computing the redundant information
selecting a method of distributing the redundant information across the disk array (store the redundant information on a small number of disks or distribute it uniformly across all disks)
Slide 16- 70
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Improving Performance with RAID
The disk array employ the technique of data striping to
achieve higher transfer rate.
The disk striping can be applied in different granularity
Bit-level data striping
Splitting a byte of data and writing bit j to the jth disk
Can be generalized to a number of disks that is either a
multiple or a factor of eight
Block-level data striping
Multiple independent small requests (that accessing single
blocks) can be serviced parallel by separate disks to reduce
queuing time of I/O request
Large requests (accessing multiple blocks) can be parallelized
to reduce the response time
Slide 16- 71
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Parallelizing Disk Access using RAID
Technology (cont’d.)
Slide 16- 72
Figure 16.13 Striping of data across multiple disks. (a) Bit-level striping across four disks. (b) Block-level striping across four disks.
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
RAID Organization and Levels
Different RAID organizations were defined based on
different combinations of the two factors of granularity of
data interleaving (striping) and pattern used to compute
redundant information Raid level 0 has no redundant data and hence has the best write
performance at the risk of data loss
Raid level 1 uses mirrored disks
Raid level 2 uses memory-style redundancy by using Hamming codes,
which contain parity bits for distinct overlapping subsets of components.
Level 2 includes both error detection and correction.
Raid level 3 uses a single parity disk relying on the disk controller to
figure out which disk has failed
Raid Levels 4 and 5 use block-level data striping, with level 5 distributing
data and parity information across all disks
Raid level 6 applies the so-called P + Q redundancy scheme using Reed-
Soloman codes to protect against up to two disk failures by using just two
redundant disksSlide 16- 73
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Use of RAID Technology
Different raid organizations are being used under
different situations
Raid level 1 (mirrored disks) is the easiest for rebuild of a
disk from other disks
• It is used for critical applications like logs
Raid level 2 uses memory-style redundancy by using
Hamming codes, which contain parity bits for distinct
overlapping subsets of components
• Level 2 includes both error detection and correction
Raid level 3 (single parity disks relying on the disk
controller to figure out which disk has failed) and level 5
(block-level data striping) are preferred for Large volume
storage, with level 3 giving higher transfer ratesSlide 16- 74
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Use of RAID Technology (cont’d.)
Most popular uses of the RAID technology currently are:
Level 0 (with striping), Level 1 (with mirroring) and Level
5 with an extra drive for parity
Design Decisions for RAID include:
Level of RAID, number of disks, choice of parity schemes,
and grouping of disks for block-level striping
Slide 16- 75
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Use of RAID Technology (cont’d.)
Slide 16- 76
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Use of RAID Technology (cont’d.)
Figure 16.14 Some popular levels of RAID. (a) RAID level 1: Mirroring of data on two disks. (b) RAID level 5: Striping of data with distributed parity across four disks.
Slide 16- 77
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Storage Area Networks
The demand for higher storage has risen considerably in
recent times
The growth of e-commerce, ERP, data warehouse makes the
demand of data storage increase substantially
The total cost of managing all data is growing rapidly that in many
cases the cost of managing server attached storage exceeds the
cost of server itself
Organizations have a need to move from a static fixed
data center oriented operation to a more flexible and
dynamic infrastructure for information processing
Many users of RAID systems cannot use the capacity
effectively because it has to be attached in a fixed manner
to one or more server
Slide 16- 78
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Storage Area Networks (cont’d.)
Thus they are moving to a concept of Storage Area
Networks (SANs)
In a SAN, online storage peripherals are configured as
nodes on a high-speed network and can be attached
and detached from servers in a very flexible manner
This allows storage systems to be placed at longer
distances from the servers and provide different
performance and connectivity options
Slide 16- 79
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Storage Area Networks (cont’d.)
Advantages of SANs are:
Flexible many-to-many connectivity among servers and
storage devices using fiber channel hubs and switches
Up to 10km separation between a server and a storage
system using appropriate fiber optic cables
Better isolation capabilities allowing non-disruptive
addition of new peripherals and servers
SANs face the problem of combining storage options from
multiple vendors and dealing with evolving standards of
storage management software and hardware
Slide 16- 80
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Network-Attached Storage
With the phenomenal growth in digital data, particularly generated from multimedia and other enterprise applications, the need for high performance storage solutions at low cost has become extremely important
Network-Attached Storage (NAS)
NAS allows the addition of storage for file sharing. NAS devices allow vast amounts of hard disk storage space to be added to a network and can make that space available to multiple servers without shutting them down for maintenance and upgrades
NAS devices are being deployed as a replacement to traditional file servers
NAS system strive for reliable operation and easy administration
Slide 16- 81
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Network-Attached Storage (cont’d.)
SAN vs. NAS
SAN often utilize Fiber Channel rather than
Ethernet
SAN often incorporates multiple network devices
or endpoints on a self-contained or private LAN
NAS relies on individual devices connected
directly to the existing public LAN
NAS systems claim greater operating system
independence of clients
Slide 16- 82
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
iSCSI Storage Systems
iSCSI (Internet SCSI)
A new protocol
When a DBMS needs to access data, the operating system generates the
appropriate SCSI commands and data request. A package header is added
before the resulting IP packages are transmitted over Ethernet. The package
is disassembled, separating the SCSI commands and request, after received.
Allows clients (called initiators) to send SCSI commands to SCSI storage
devices on remote channels
Not require the special cabling needed by Fiber Channel and can run
over longer distances using existing network infrastructure
By carrying SCSI commands over IP networks, iSCSI facilitates data
transfer over intranets and manages storage over long distance
iSCSI storage has mainly impacted small- and medium-sized
businesses because of its combination of simplicity, low cost, and the
functionality of iSCSI devices.
Slide 16- 83
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
iSCSI Storage Systems (cont’d.)
Two approaches to storage data transmission over IP network
iSCSI
Fiber Channel over IP (FCIP)
Translate Fiber Channel control codes and data into IP packages for
transmission
The protocol is known also as Fiber Channel tunneling or storage
tunneling
Fiber Channel over Ethernet (FCoE)
Can be thought of as iSCSI without the IP
Use many elements of SCSI and FC (just like iSCSI), but does not
include TCP/IP components
Promise excellent performance, especially on 10 Gigabit Ethernet
(10GbE) and is relatively easy for vendors to add to their products
Slide 16- 84
Copyright © 2017 Ramez Elmasri and Shamkant B. Navathe
Summary
Disk Storage Devices
Files of Records
Operations on Files
Unordered Files
Ordered Files
Hashed Files
Dynamic and Extendible Hashing Techniques
RAID Technology
Slide 16- 85