67
©Silberschatz, Korth and Sudarsha 1 Storage and File Structure Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management File Organization

©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

Embed Size (px)

Citation preview

Page 1: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan1

Storage and File StructureStorage and File Structure

Overview of Physical Storage Media

Redundant Arrays of Independent Disks (RAID)

Buffer Management

File Organization

Page 2: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan2

Properties of Physical Storage MediaProperties of Physical Storage Media

Types of storage media differ in terms of:

Speed of data access

Cost per unit of data

Reliability: Data loss on power failure or system (software) crash Physical failure of the storage device

=> What is the value of data (e.g., salon example)?

Page 3: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan3

Classification of Storage MediaClassification of Storage Media

Storage be classified as:

Volatile: Content is lost when power is switched off Includes primary storage (cache, main-memory)

Non-volatile: Contents persist even when power is switched off Secondary and tertiary storage, plus battery-backed up RAM

Page 4: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan4

Storage Hierarchy, Cont.Storage Hierarchy, Cont.

Storage can also be classified as:

Primary Storage: Fastest media Volatile Includes cache and main memory

Secondary Storage: Moderately fast access time Non-volatile Includes flash memory and magnetic disks Also called on-line storage

Tertiary Storage: Slow access time Non-volatile Includes magnetic tape and optical storage Also called off-line storage

Page 5: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan5

Primary StoragePrimary Storage

Cache: Volatile

Fastest and most costly form of storage

Managed by the computer system hardware

Proprietary

Main memory (also known as RAM): Volatile

Fast access (10s to 100s of nanosecs; 1 nanosec = 10–9 seconds)

Generally too small (or too expensive) to store the entire database

=> The above comment notwithstanding, main memory databases do exist, especially for clusters.

Page 6: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan6

Physical Storage Media, Cont.Physical Storage Media, Cont.

Flash memory: Non-volatile

Widely used in embedded devices such as digital cameras

SD cards, USB drives, and solid-state drives

Data can be written at a location only once, but location can be erased and written to again

Can support only a limited number of write/erase cycles

Reads are roughly as fast as main memory

Writes are slow (few microseconds), erase is slower

Cost per unit of storage roughly similar to main memory

Page 7: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan7

Physical Storage Media, Cont.Physical Storage Media, Cont.

Magnetic-disk: Non-volatile

• Survives power failures and system (software) crashes

• Failure can destroy data, but is very rare (??????)

Primary medium for database storage*

Typically stores the entire database

• Capacities of individual disks is currently in the100s of GBs or TBs

• Much larger capacity than main or flash memory

Direct/random-access, unlike magnetic tape, which is sequential access.

Page 8: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan8

Physical Storage Media, Cont.Physical Storage Media, Cont.

Optical storage: Non-volatile

Data is read optically from a spinning disk using a laser

CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms

Write-once, read-many (WORM) optical disks used for archival storage (CD-R and DVD-R)

Multiple write versions available (CD-RW, DVD-RW, and DVD-RAM)

Reads and writes are slower than with magnetic disk

Page 9: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan9

Physical Storage Media, Cont.Physical Storage Media, Cont.

Tape storage: Non-volatile

Data is accessed sequentially, consequently known as sequential-access

Much slower than a disk

Very high capacity (40 to 300 GB tapes available)

Storage costs are much cheaper than for a disk, but high quality drives can be very expensive

Used mainly for backup (recover from disk failure), archive and transfer of very large amounts of data

Page 10: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan10

Physical Storage Media, Cont.Physical Storage Media, Cont.

Juke-boxes: For storing massive amounts of data

• Hundreds of terabytes (1 terabyte = 109 bytes), petabyte (1 petabyte = 1012 bytes) or exabytes (1 exabyte = 1012 bytes)

Large number of removable disks or tapes

A mechanism for automatic loading/unloading of disks or tapes

Page 11: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan11

Storage HierarchyStorage Hierarchy

speedcostvolatility

Page 12: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan12

Magnetic Hard Disk MechanismMagnetic Hard Disk Mechanism

NOTE: Diagram is only a simplification of actual disk drives

Page 13: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan13

Magnetic DisksMagnetic Disks

Disk assembly consists of: A single spindle that spins continually (at 7500 or 10000 RPMs typically)

Multiple disk platters (typically 2 to 5)

Surface of platter divided into circular tracks: Over 16,000 tracks per platter on typical hard disks

Each track is divided into sectors: A sector is the smallest unit of data that can be read or written

Typically 512 bytes

Typical sectors per track: 200 (on inner tracks) to 400 (on outer tracks)

Head-disk assemblies: One head per platter, mounted on a common arm, each very close to its

platter.

Reads or writes magnetically encoded information

“Cylinder i” consists of ith track of all the platters

Page 14: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan14

Magnetic Disks, Cont.Magnetic Disks, Cont.

Disks are the primary performance bottleneck in a database system, in part because of the need for physical movement.

To read/write a sector: disk arm swings to position head on right track – seek time (4-10 ms)

sector rotates under read/write head – rotational latency (4-11 ms)

data is read/written as sector passes under head – transfer rate (25-100 bps)

Access time - The time it takes from when a read or write request is issued to when data transfer begins, i.e., seek time + rotational latency.

Page 15: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan15

Performance Measures of Disks, Cont.Performance Measures of Disks, Cont.

Multiple disks may share an interface controller, so the rate that the interface controller can handle data is also important ATA-5: 66 MB/second, SCSI-3: 40 MB/s, Fiber Channel: 256 MB/s

Mean time to failure (MTTF) - The average time the disk is expected to run continuously without any failure. Typically 3 to 5 years

Probability of failure of new disks is quite low - 30,000 to 1,200,000 hours

An MTTF of 1,200,000 hours for a new disk means that given 1000 relatively new disks, on an average one will fail every 1200 hours

MTTF decreases as disk ages

Page 16: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan16

Magnetic Disks (Cont.)Magnetic Disks (Cont.)

The term “controller” is used (primarily) in two ways, both in the book and other literature; the book does not distinguish between the two very well.

Disk controller: (use #1) Packaged within the disk Accepts high-level commands to read or write a sector Initiates actions such as moving the disk arm to the right track and actually

reading or writing the data Computes and attaches checksums to each sector for reliability Performs re-mapping of bad sectors

Page 17: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan17

Disk SubsystemDisk Subsystem

Interface controller: (use #2) Multiple disks are typically connected to the system bus through an interface

controller (i.e., a host adapter)

Many functions (checksum, bad sector re-mapping) are often carried out by individual disk controllers; reduces load on the interface controller

interface

Page 18: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan18

Disk SubsystemDisk Subsystem

The distribution of work between the disk controller and interface controller depends on the interface standard.

Disk interface standard families: ATA (AT adaptor)/IDE SCSI (Small Computer System Interconnect) Fibre Channel, etc. Several variants of each standard (different speeds and capabilities)

One computer can have many interface controllers, of the same or different type

=> Like disks, interface controllers are also a performance bottleneck.

Page 19: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan19

Techniques for OptimizationTechniques for Optimizationof Disk-Block Accessof Disk-Block Access

Several techniques are employed to minimize the negative effects of disk and controller bottlenecks

File organization: Organize data on disk based on how it will be accessed

• Store related information (e.g., data in the same file) on the same or nearby cylinders.

A disk may get fragmented over time:• As data is inserted and deleted from disk, free “slots” on disk tend

to become scattered• Data items (e.g., files) are then broken up and scattered over the

disk• Sequential access to a fragmented data results in increased disk arm

movement Most DBMSs have utilities to de-fragment the file system and, more

generally, manipulate the file organization of a database.

Page 20: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan20

Techniques for OptimizationTechniques for Optimizationof Disk-Block Accessof Disk-Block Access

Block size adjustment: A block is a contiguous sequence of sectors from a single track

A DBMS transfers data between disk and main memory in blocks

Sizes range from 512 bytes to several kilobytes

• Smaller blocks: more transfers from disk

• Larger blocks: may waste space due to partially filled blocks

• Typical block sizes range from 4 to 16 kilobytes

Block size can be adjusted to accommodate workload

Disk-arm-scheduling algorithms: Order pending accesses so that disk arm movement is minimized

One example is the “Elevator algorithm”

Page 21: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan21

Log disk: A disk devoted to the transaction log Writes to a log disk are very fast since no seeks are required Often times provided with battery back-up

Nonvolatile write buffers/RAM: Battery backed up RAM or flash memory

Typically associated with a disk or storage device

Blocks to be written are first written to the non-volatile RAM buffer

Disk controller writes data to disk whenever the disk has no other requests

Allows processes to continue without waiting on write operations

Data is safe even in the event of power failure

Allows writes to be reordered to minimize disk arm movement

Techniques for OptimizationTechniques for Optimizationof Disk Block Access, Cont.of Disk Block Access, Cont.

Page 22: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan22

RAIDRAID

Redundant Arrays of Independent Disks (RAID): Disk organization techniques that exploit a collection of disks

Speed, reliability or the combination are increased

Collection appears as a single disk to the system

Originally “inexpensive” disks

Ironically, using multiple disks increases the risk of failure (not data loss): A system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years),

will have a system MTTF of 1000 hours (approximately 41 days)

Page 23: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan23

Improvement of ReliabilityImprovement of Reliabilityvia Redundancyvia Redundancy

RAID improves on reliability by using two techniques: Parity Information

Mirroring

Parity Information (basic): Makes use of the exclusive-or operator

1 + 1 + 0 + 1 = 1 0 + 1 + 1 + 0 = 0

Enables the detection of single-bit errors

Could be applied to a collection of disks

Page 24: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan24

Improvement of ReliabilityImprovement of Reliabilityvia Redundancyvia Redundancy

Mirroring: For every disk, keep a duplicate copy

Every write is carried out on both disks (in parallel)

If one disk in a pair fails, data still available

Different reads can take place from either disk and, in particular, from both disks at the same time

Data loss would occur only if 1) a disk fails, and 2) its mirror disk fails before the first is repaired: Probability of both events is very small

If MTTF is100,000 hours, mean time to repair is10 hours, then mean time to data loss is approximately 57,000 years for a mirrored pair of disks

Page 25: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan25

Improvement in PerformanceImprovement in Performancevia Parallelismvia Parallelism

RAID improves performance mainly through the use of parallelism.

Two main goals of parallelism in a disk system: Parallelize large accesses to reduce response time (transfer rate).

Load balance multiple small accesses to increase throughput (I/O rate).

Parallelism is achieved primarily through striping, but also through mirroring.

Striping can be done at varying levels: bit-level

block-level (variable size)

Page 26: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan26

Improvement in Performance via ParallelismImprovement in Performance via Parallelism

Bit-level striping: Abstractly, the data to be stored can be thought of as a list of bytes.

Suppose there are eight disks, and write bit i of each byte to disk i.

In theory, each access can read data at eight times the rate of a single disk (reduces transfer time, and hence improves response time).

Each byte read or written ties up all 8 disks (reduces throughput).

Not commonly used, in particular, as described.

Block-level striping: Abstractly, the data to be stored can be thought of as a list of blocks,

numbered 0..(k-1).

Suppose there are n disks, numbered 0..(n-1).

Store block i of a file on disk (i mod n).

Requests for different blocks can run in parallel if the blocks reside on different disks (improves throughput).

A request for a long sequence of blocks can use all disks in parallel (improves response time).

Page 27: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan27

RAID LevelsRAID Levels

Different RAID organizations, or RAID “levels,” have differing cost, performance and reliability characteristics.

Levels vary by source and vendor: Standard levels: 0, 1, 2, 3, 4, 5, 6 Nested levels: 0+1, 1+0, 5+0, 5+1 Several other, non-standard, vendor specific levels exist as well Our book: 0, 1, 2, 3, 4, 5

Performance and reliability are not linear in the level #.

It is helpful to compare each level to every other level, and to the single disk option.

Page 28: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan28

RAID LevelsRAID Levels

RAID Level 0: Block-level striping.

Used in high-performance applications where data lost is not critical.

RAID Level 1: Mirrored disks with block striping.

Offers the best write performance, according to the authors.

Sometimes called 0+1, 1+0, 01 or 10.

C C C C

Page 29: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan29

RAID Levels, Cont.RAID Levels, Cont.

RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit-level

striping.

Parity information is used to detect, locate and correct errors.

Outdated - Disks contain embedded functionality to detect and locate sector errors, so only a single parity disk is needed (for correction).

Also, disks don’t read/write individual bits.

P P P

Page 30: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan30

RAID Levels, Cont.RAID Levels, Cont.

RAID Level 3: Bit-level striping, with parity.

Individual disks report errors.

To recover data in a damaged disk, compute XOR of bits from other disks (including parity bit disk).

Faster response time than with a single disk, but fewer I/Os per second since every disk has to participate in every I/O.

When writing data, corresponding parity bits must also be computed and written to a parity bit disk.

Subsumes Level 2 (provides all its benefits, at lower cost).

P

Page 31: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan31

RAID Levels (Cont.)RAID Levels (Cont.)

RAID Level 4: Block-level striping, with parity.

Individual disks report errors.

To recover data in a damaged disk, compute XOR of bits from other disks (including parity bit disk). When writing a data block, the corresponding block of parity bits must also be recomputed and written

to the parity disk• For a single block write – use the old parity block, old value of current block and new value of current block (2 block reads + 2 block

writes)

• For a large sequential block write - use the new values of blocks corresponding to the parity block.

P

Page 32: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan32

RAID Levels (Cont.)RAID Levels (Cont.)

RAID Level 4: (Cont.) Provides higher I/O rates for independent block reads

than Level 3

Provides higher transfer rates for large, multi-block reads & writes compared to a single disk.

Parity block is a bottleneck for independent block writes.

Page 33: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan33

RAID Levels (Cont.)RAID Levels (Cont.)

RAID Level 5: Block-level striping with distributed parity; partitions

data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk.

For example, with 5 disks, parity block for ith set of N blocks is stored on disk (i mod 5), with the data blocks stored on the other 4 disks.

P P P P P

Page 34: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan34

RAID Levels (Cont.)RAID Levels (Cont.)

RAID Level 5 (Cont.) Provides same benefits of level 4, but minimizes the

parity disk bottleneck.

Higher I/O rates than Level 4.

RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but uses

additional information to guard against multiple disk failures.

Better reliability than Level 5 at a higher cost; not used as widely. P P

P PP P

Page 35: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan35

Choice of RAID LevelChoice of RAID Level

The “best” level for any particular application is not always clear.

Factors in choosing RAID level: Monetary cost of disks, controllers or RAID storage

devices

Reliability requirements

Performance requirements:• Throughput vs. response time

• Short vs. long I/O

• Reads vs. writes

• During failure and rebuilding of a failed disk

Page 36: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan36

Choice of RAID Level, Cont.Choice of RAID Level, Cont.

The books analysis…

RAID 0 is only used when (data) reliability is not important.

Level 2 and 4 never used since they are subsumed by 3 and 5.

Level 3 is not used (typically) since bit-level striping forces single block reads to access all disks, wasting disk arm movement. Sometimes vendors advertize a version of level 3 using byte-level

striping.

Level 6 is rarely used since levels 1 and 5 offer adequate safety for almost all applications.

So competition is between 1 and 5 only.

Page 37: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan37

Choice of RAID Level (Cont.)Choice of RAID Level (Cont.)

Level 1 provides much better write performance than level 5: Level 5 requires at least 2 block reads and 2 block writes to

write a single block Level 1 only requires 2 block writes

Level 1 has higher storage cost than level 5, however: I/O requirements have increased greatly, e.g. for Web servers. When enough disks have been bought to satisfy required rate

I/O rate, they often have spare storage capacity… In that case there is often no extra monetary cost for Level

1!

Level 5 is preferred for applications with low update rate,and large amounts of data.

Level 1 is preferred for all other applications.

Page 38: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan38

Other RAID IssuesOther RAID Issues

Hardware vs. software RAID.

Local storage buffers.

Hot-swappable disks.

Spare components: disks, fans, power supplies, controllers.

Battery backup.

Page 39: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan39

Storage AccessStorage Access

A DBMS will typically have several files allocated to it for storage, which are (usually) formatted and managed by the DBMS. Each file typically maps to either a disk, disk partition or storage device.

Each file is partitioned into blocks (a.k.a, pages). A block consists of one or more contiguous sectors.

A block is the smallest unit of DBMS storage allocation & transfer.

Each block is partitioned into records.

Each record is partitioned into fields.

Page 40: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan40

Buffer ManagementBuffer Management

The DBMS will transfer blocks of data between RAM and Disk, in a manner similar to an operating system.

The DBMS seeks to minimize the number of block transfers between the disk and memory.

The portion of main memory available to store copies of disk blocks is called the buffer (a.k.a. the cache).

The DBMS subsystem responsible for allocating buffer space in main memory is called the buffer manager.

Page 41: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan41

Buffer Manager AlgorithmBuffer Manager Algorithm

Programs call the buffer manager when they need a (disk) block.

If a requested block is already in the buffer then the requesting program is given the address of the block in main memory.

If the block is not in the buffer then the buffer manager: Allocates buffer space for the block, throwing out another block, if

required.

Writes out to disk the thrown out block if it has been modified since the last time it was retrieved.

Reads the requested block from disk to the buffer once space is available.

Passes the address of the block in main memory to requester.

Page 42: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan42

Buffer-Replacement PoliciesBuffer-Replacement Policies

How is a block selected for replacement?

Most operating systems use Least Recently Used (LRU).

In a DBMS, LRU can be a bad strategy for certain access patterns.

Queries have well-defined access patterns (such as sequential scans), and so a DBMS can predict future references (an OS cannot). Mixed strategies are common. Statistical information can be exploited. Well-known access patterns can be exploited.

Well-known access patterns: The data dictionary is frequently accessed, so keep those blocks in the buffer. Index blocks are used frequently, so keep them in the buffer.

Page 43: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan43

Buffer-Replacement Policies, Cont.Buffer-Replacement Policies, Cont.

Some terminology...

A memory block that has been loaded into the buffer and is not allowed to be replaced is said to be pinned.

At any given time a buffer block is in one of the following states: Unused (free) – does not contain the copy of a block from disk.

Used, but not pinned – contains the copy of a block from disk, which is available for replacement.

Pinned – contains the copy of a block from disk, but which is not available for replacement.

Page 44: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan44

Buffer-Replacement Policies, Cont.Buffer-Replacement Policies, Cont.

Page replacement strategies: Most recently used (MRU) – The moment a disk block in the buffer becomes

unpinned, it becomes the most recently used block.

Least recently used (LRU) – Of all unpinned disk blocks in the buffer, the one referenced least recently.

Page 45: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan45

Buffer-Replacement Policies, Cont.Buffer-Replacement Policies, Cont.

Recall the borrower and customer relations:

borrower = (customer-name, loan-number)

customer = (customer-name, customer-street, customer-city)

Consider the a query that performs a join on borrower and customer:

select customer-name, loan-number, street, city

from borrower, customer

where b[customer-name] = c[customer-name];

Suppose that borrower consists of blocks b1, b2, and b3, that customer consists of blocks c1, c2, c3, c4, and c5, and that the buffer can fit five (5) blocks total.

Page 46: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan46

Buffer-Replacement Policies, Cont.Buffer-Replacement Policies, Cont.

Now consider the following query plan for the query:

Use a nested-loop join algorithm:for each tuple b of borrower do // scan borrower sequentially

for each tuple c of customer do // scan customer sequentially

if b[customer-name] = c[customer-name] then begin

x[customer-name] := b[customer-name];

x[loan-number] := b[load-number];

x[street] := c[street]

x[city] := c[city];

include x in the result of the query;

end if;

end for;

end for;

Use LRU for block replacement:− Read in a block of borrower, and keep it, i.e., pin it, in the buffer until the last tuple in that

block has been processed.

− Once all of the tuples from that block have been processed, unpin it.

− If a customer block is to be brought into the buffer, and another block must be moved out in order to make room, then move out the least recently used block (borrower, or customer).

Page 47: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan47

Buffer-Replacement Policies, Cont.Buffer-Replacement Policies, Cont.

Applying the LRU replacement algorithm results in the following for the first two tuples of b1:

Notice that each time a customer block is read into the buffer, the block replaced is the next block required!

This results in a total of six (6) page replacements during the processing of the first two borrower tuples.

b1

c1

c2

c3

b1

c1

c2

c3

c4

b1

c1

c2

b1

c1

b1 b1

c4

c1

c2

c2

b1

c5

c1

c2

c3

b1

c5

c1

c2

c4

b1

c5

c1

c3

c4

b1

c5

c2

c3

c4

b1

c4

c5

c2

c2

Page 48: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan48

Buffer-Replacement Policies, Cont.Buffer-Replacement Policies, Cont.

Suppose an MRU page replacement strategy is used: Read in a block of borrower, and keep it, i.e., pin it, in the buffer until

the last tuple in that block has been processed.

Once all of the tuples from that block have been processed, unpin it.

If a customer block is to be brought into the buffer, and another block must be moved out in order to make room, then move out the most recently used block (borrower, or customer).

Page 49: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan49

Buffer-Replacement Policies, Cont.Buffer-Replacement Policies, Cont.

Applying the MRU replacement algorithm results in the following for the first two tuples of b1:

b1 b1 b1 b1 b1 b1 b1

c1 c1 c1 c1 c1 c1

c2 c2 c2 c2c2 .......

c3 c3 c3 c4

c4 c5 c5

This results in a total of two (2) page replacements for processing the first two borrower tuples.

Page 50: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan50

Storage MechanismsStorage Mechanisms

Traditional “low-level” storage DBMS issues: Avoiding large numbers of block reads - algorithmically, the bar has been

raised

Records crossing block boundaries

Dangling pointers

Disk fragmentation

Allocating free space

Storage “mechanisms,” which provide partial solutions: Record packing

Free/reserved space lists

Reserved space

Record splitting

Mixed block formatting

Slotted-page structure

Page 51: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan51

Storage MechanismsStorage Mechanisms

Recall: A database is stored as a collection of files. A file consists of a sequence of blocks. A block consists of a sequence of records. A record consists of a sequence of fields.

Assumptions: Record size may be fixed or variable. Record size is usually small relative to the block size, but might be

larger. Each file may have records of one particular type or different files are

used for different relations. Sometimes tuples need to be sorted, other times not.

Which assumptions apply determine which techniques are appropriate.

Page 52: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan52

Record PackingRecord Packing

Record “Packing” - basic approach: Pack records as tightly as possible.

Works best for fixed-size records: Store record i , where i >=1, starting from byte n (i – 1), where n is

the record size.

Record access is simple, given i.

Add each new record to the end

Deletion of record i options: Move records i + 1, . . ., n

to i, . . . , n – 1 (shift)

Move record n to i

Page 53: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan53

Record PackingRecord Packing

For mixed or variable sized records: Attach an end-of-record () character to the end of each record. Pack the records as tightly as possible.

Issues: Deletion, insertion and record growth result in 1) lots of record

movement or 2) fragmentation.

Record movement potentially results in dangling pointers.

Page 54: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan54

Free ListsFree Lists

Free lists – connect free space (or used space) into a linked list.

Advantages: Insertion & deletion can be done with few block I/Os (assuming tuples aren’t sorted). Records don’t have to move; eliminates dangling pointers.

Free lists are used by virtually all DBMSs at the file level.

Page 55: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan55

Reserved SpaceReserved Space

Reserved space – use the maximum required space for each record: Can be used for repeating fields.

Also with varying length attributes, e.g., a name field.

Issues: Could waste significant space.

Page 56: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan56

Record SplittingRecord Splitting

Record Splitting: A single variable-length record is stored as collection of smaller

records, linked with pointers.

Can be used even if the maximum record length is not known.

Once again, could use with pointers for free and occupied records.

Disadvantages: Records are more likely to cross block boundaries Space is wasted in all records except the first in a chain

Page 57: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan57

Mixed Block FormattingMixed Block Formatting

Mixed Block Formatting – format blocks differently (typically in different files): Anchor block Overflow block

Records are likely to cross block boundaries.

Frequently used for very large attributes such as BLOBS, text, etc.

Page 58: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan58

Slotted Page StructureSlotted Page Structure

A Slotted Page (i.e., block) is formatted with three sections: Header Free-space Records

Header contents: Number of record entries Pointer to the end of free space in the block Location and size of each record

Key points: External pointers refer to header entries rather than directly to

records. Records can be moved within a page to eliminate empty space; header must

be updated.

Page 59: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan59

File OrganizationFile Organization

More sophisticated techniques are typically used in combination with the preceding storage mechanisms: Hashing

Sequential/sorted

Clustering

These techniques can improve query performance substantially.

Page 60: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan60

Sequential File OrganizationSequential File Organization

Records in a file are ordered by a search-key.

Good for point queries and range queries.

Maintaining physical sequential order is difficult with insertions & deletions.

Page 61: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan61

Sequential File OrganizationSequential File Organization

Use pointers to keep track of logical record order:

Could be used in conjunction with a free-list.

Page 62: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan62

Sequential File OrganizationSequential File Organization

Insertion – locate the position where the record is to be inserted. If there is free space there then insert directly.

If no free space, insert the record in another location off the free-list.

In either case, pointer chain must be updated.

Deletion – maintain pointer chains in the obvious way.

Main issue - file may degenerate over time - reorganize the file from time to time to restore sequential order.

Page 63: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan63

Clustering File OrganizationClustering File Organization

A clustered file organization stores several relations in one file.

Clustered organization of customer and depositor:

Good for some queries involving: depositor customer

an individual customer and their accounts.

Bad for others, e.g., involving only customer.

Page 64: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan64

Large Objects – A Special CaseLarge Objects – A Special Case

Large objects: text documents images computer aided designs audio and video data

Most DBMS vendors provide supporting types: binary large objects (blobs) character large objects (clobs) text, image, etc.

Large objects have high demands: Typically stored in a contiguous sequence of bytes when brought into memory. If an object is bigger than a block, contiguous blocks in the buffer must be

allocated for it. Record splitting is commonly used.

Might be preferable to disallow direct access to data using SQL, and only allow access through a 3rd party file system API.

Page 65: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

End of ChapterEnd of Chapter

Page 66: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan66

Data Dictionary StorageData Dictionary Storage

The data dictionary (a.k.a. system catalog) stores metadata: Information about relations

• names of relations

• names and types of attributes

• names and definitions of views

• integrity constraints

Physical file organization information• How relation is stored (sequential, hash, etc.)

• Physical location of relation

– operating system file name, or

– disk addresses of blocks containing the records

Statistical and descriptive data• number of tuples in each relation

User and accounting information, including passwords Information about indices (Chapter 12)

Page 67: ©Silberschatz, Korth and Sudarshan1 Storage and File Structure Overview of Physical Storage Media Redundant Arrays of Independent Disks (RAID) Buffer Management

©Silberschatz, Korth and Sudarshan67

Data Dictionary Storage (Cont.)Data Dictionary Storage (Cont.)

Catalog structure can use either: specialized data structures designed for efficient access a set of relations, with existing system features used to ensure efficient

access

The latter alternative is the standard.

A possible catalog representation:

Relation-metadata = (relation-name, number-of-attributes, storage-organization, location)Attribute-metadata = (attribute-name, relation-name, domain-type, position, length)User-metadata = (user-name, encrypted-password, group)Index-metadata = (index-name, relation-name, index-type, index-attributes)View-metadata = (view-name, definition)