Chapter 2 Data Storage How does a computer system store and manage very large volumes of data ?

Chapter 2 Data Storage

How does a computer system store and manage very large volumes of data ?

OutlineMemory Hierarchy

Using Hard Disks Efficiently

Accessing Hard Disks Quickly

Keeping Hard Disks Safely

Mechanics of Hard Disks

The Memory Hierarchy

Tertiary Storage

Main memory

Cache

AsVirtualMemory

Disk FileSystem

DBMS

Programs, Main-memoryDBMS’s

Speed CostCapacity

Small

Large

Fast

Slow

High

Low

Secondary Storage

Cache

• Capacity Up to 1 megabyte• Speed between cache and processor: 10

nanosecond • Speed between cache and memory: 100

nanoseconds

Main Memory

• Capacity up to 10 gigabytes

• Random Access

• Access time in 10-100 nanosecond range

Virtual Memory• Most machines use 32-bit address space

which is up to 4 gigabytes.• Main memory is usually 256 Megabytes.• Virtual memory is supported by the

machine hardware and the operating system through paging mechanism.

• Main-memory database system can be implemented by virtual memory.

Secondary Storage

• Significantly more capacious than main memory• Significantly cheaper than main memory• Significantly slower than main memory

Magnetic Disks are usually used as secondary storage.

Tertiary Storage

• Data volumes measured in terabytes

• Slow and cheaper

• Access times varying widely

Ad-hoc Tape Storage, Optical Disk Juke Boxes and Tape Silos are the common tertiary storages.

Volatile and Nonvolatile Storage

• Volatile device “forgets” its contents when the power goes off, such as main memory.

• Nonvolatile device keeps its contents intact in the presence of power failures, such as magnetic disk, tapes, flash memory.

13

12

11

10

9

8

7

6

5

2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9

Tertiary

Secondary

Zip disk

Floppy disk

Main Memory

Cache

Access time versus capacity for various levels of memory hierarchy

The horizontal axis measures seconds in exponents of 10.

The vertical axis measures bytes in exponents of 10.

Mechanics of Disks

cylinder

Platter

= 2 surfaces

disk heads

A typical disk

tracks

sectorgap

Top view of a disk surface

Disk Controller

• Controlling disk head to move and position the heads at a

particular radius• Selecting a surface, and selecting a

sector from the track on that surface that is under the head

• Transferring data

Processor

MainMemory

Disk Controller

Disks

Bus

Schematic of a simple computer system

Disk Storage Characteristics

The typical measures：

----Rotation Speed of the Disk Assembly

----Number of Platters per Unit

----Number of Tracks per Surface

----Number of Bytes per TrackExample ： Megatron 747’s characteristics ：

---- 3840 RPM

---- There are four platters providing eight surfaces

---- There are 8192 tracks per surface

---- There are ( on average) 256 sectors per track, 512 bytes per sector

Capacity of Megatron 747

8 surfaces X 8192 tracks X 256 sectors X 512 Bytes

8 gigabytes

Block Address:

• Physical Device

• Cylinder #

• Surface #

• Sector

Disk Access Characteristics

Headhere

Rotation

Blockwe want

The cause of rotational latency

1

x

MAX

in range

3x~20x

Seek time varies with distance traveled

0 4096 8192

0

2048

4096

Average travel distance as a function of initial head position

Starting track

Averagetravel

Disk Access Time = Seek Time + Rotational Delay + Transfer Time +

Other

Cylinders traveled

Average Random Seek Time

SEEKTIME (i j)

S =

N(N-1)

N N

i=1 j=1ji

“Typical” S: 10 ms 40 ms

Average Rotational Delay

R = 1/2 revolution

“typical” R = 8.33 ms (3600 RPM)

Transfer Rate: t

• “typical” t: 1 3 MB/second

• transfer time: block size

t

Other Delays

• CPU time to issue I/O

• Contention for controller

• Contention for bus, memory

“Typical” Value: 0

Average time to read a 4096-byte block from Megatron 747

•3840 rmp, makes one rotation in 1/64th of a second.•Take one millisecond to start and stop, plus one additional millisecond for every 500 cylinders travelled.

Seek Time: 1+ 2730/500=6.5 millisecond Rotational Latency: 1/64/2*1000 =7.8 millisecondTransfer Time: 36*7/256+324*8/256=11.10911.109/360/64 *1000 = 0.5 millisecond

The average latency is 6.5 + 7.8 + 0.5 = 14.8 ms

Cost for Writing similar to Reading

…. unless we want to verify! need to add (full) rotation + Block size

t

• To Modify a Block?

To Modify Block:(a) Read Block

(b) Modify in Memory

(c) Write Block

[(d) Verify?]

Using Hard Disk Efficiently

The time of disk access is much larger than the time likely to be used manipulating that data in main memory so the number of disk accesses need be limited during designing algorithm.

The I/O Model of Computation

Dominance of I/O cost

When the data is so large it does not fit in main memory, reading and writing disk blocks between disk and memory often takes much longer than it does to process the data once it is in main memory.

Algorithms need to change under the I/O model. The evaluation of algorithms for data in secondary storage focuses on the number of disk I/O’s required.

Sorting Data in Secondary Storage

There are a number of well-known algorithms for sorting data in main memory . However, when the data is much larger than main memory . We should consider how to reduce times moving each block between main memory and secondary storage.

Step List 1 List 2 Output

Start

1）2）3）4）5）6）7）8）

1，3，4，9

3，4，9

3，4，9

4，9

9

9

9

9

无

2，5，7，8

2，5，7，8

5，7，8

5，7，8

5，7，8

7，8

8

无无

None

1

1，2

1，2，3

1，2，3，4

1，2，3，4，5

1，2，3，4，5，7

1，2，3，4，5，6，7，8

1，2，3，4，5，6，7，8，9

Merging two sorted lists to make one sorted list.

Two-Phase, Multiway Merge-Sort

• Phase 1: Repeat sorting main-memory-sized pieces of the data.

• Phase 2: Merge all the sorted sublists into a single sorted list.

Input buffers, one for each sorted list

Pointers to firstunchosenrecords

Select smallest unchosen for output

Output Buffer

Main-memory organization for multiway merging

How large sets of record can be sorted

• Block size: B bytes

• Memory Size: M bytes

• Record: R bytes

Total number of record that can be sorted:

(M/R)((M/B)-1)

Accessing Hard Disk Quickly

• Organizing Data by Cylinders

• Using Multiple Disks

• Mirroring Disks

• Disk Scheduling and the Elevator Algorithm

• Prefetching and Large-Scale Buffering

Organizing Data by CylindersDisk Access Time = Seek Time + Rotational Delay +Transfer Time 6.5 ms + 7.8 ms + 0.5ms

Sorting 10,000,000 records by Two-Phase, Multiway Merge takes 250 minutes

Blocks distributed randomly on disk.

The organization of blocks by cylinders.

One phase 2.15 minutes + Second phase 125 minutes

Place blocks that are accessed together on the same cylinder so we can often avoid seek time, and possibly rotational latency.

Using Multiple Disks

Megatron 747 ( four platters with eight surfaces)

Megatron 737 ( one platter with two surfaces) X 4

Two-Phase, Multiway Merge-Sort

1. Phase 1: Speed-up 4 times

2. Phase 2: Speed-up 2~3 times

Divide the data among several smaller disks rather than one large one. Having more head assemblies can go after blocks independently and increase the number of block accesses per unit time

Mirroring Disks

• Enhance reliability• Speed up reading but not writing

Disk Scheduling and the Elevator Algorithm

Cylinder of Request

First time available

1000 0

3000 0

7000 0

2000 20

8000 30

5000 40

1000 8.3

3000 21.6

7000 38.9

8000 50.2

5000 65.5

2000 80.8

Cylinder of Request

Time completed

1000 8.3

3000 21.6

7000 38.9

2000 58.2

8000 79.5

5000 94.8

Cylinder of Request

Time completed

Arrival times for six block-access requests

Finishing times for block accesses using the elevator algorithm

Finishing times for block accesses using the first-come-first-served algorithm

Prefetching and Large-Scale Buffering

Input Buffer 1

Input Buffer 2

merge

Diskread

Prefetch blocks to main memory in anticipation of their later use. Using track-sized or cylinder-sized output buffers can eliminate seek time and rotational latency.

1. Store the sorted sublists on whole, consecutive cylinders, with the blocks on each track being consecutive blocks of the sorted sublist.

2. Read whole tracks or whole cylinders whenever we need some more records from a given list.

Output Buffer 1

Output Buffer 2

merge

Diskwrite

Keeping Hard Disk Safely

• Intermittent failure

• Media decay

• Write failure

• Disk crashes

Intermittent Failures

Disk Reading (W, S)

W: the data in the sector that is readS: status bit that tells whether or not the read was successful.

Disk Reading

S== “bad”

S == “good” W

We may be fooled.

Disk Writing Disk Reading Status Checking

Checksums

• If there is an odd number of 1’s among a collection of bits, we say the bits have odd parity, or that their parity bit is 1.

• If there is an even number of 1’s among a collection of bits, we say the bits have even parity, or that their parity bit is 0.

01101000 ------- 01101000111101110 ------- 111011100

Stable Storage

X

XL XR

While checksums will almost certainly detect the existence of a media or a failure to read or write correctly. it does not help us correct the error. To deal with the problems, we can implement a police known as stable storage.

The stable-storage writing policy:(1) Write the value of X into XL. Check that the value has status “good”. If not, repeat

the write. After a set number of write attempts, fix-up XL.

(2) Repeat (1) for XR.

The stable-storage reading policy:(1) To obtain the value of X, read XL. If status “ bad” is returned, repeat the read a

set number of times. If a value with status “ good” is eventually returned, take that value as X.

(2) If we cannot read XL, repeat (1) with XR.

Error-Handling Capabilities of Stable Storage

• Media failure If one fails, read the other.• Write failure Failure occurred during writing XL,

Copy XR to XL; Failure occurred after writing XL,

copy XL to XR

Recovery from Disk Crashes

RAID (Redundant Arrays of Independent Disks ) has been developed to reduce the risk of data loss by disk crashes.

RAID 1

Data Disk Redundant DiskMirroring

RAID 4

Disk 1 ： 11110000

Disk 2 ： 10101010

Disk 3 ： 00111000

The redundant disk will have the following parity check bits ：

Disk 4 ： 01100010

While mirroring disks uses as many redundant disks as there are data disks, RAID 4 uses only one redundant disk no matter how many data disks there are.

Reading

Reading blocks from a data disk is no different from reading blocks from any disk. In some circumstances, we can actually get the effect of two simultaneous read from one of the data disks.

Suppose Disk 1 is busy and we want to read it, while

none of the other disks are busy.

Disk 2： 10101010

Disk 3： 00111000

Disk 4： 01100010If we take the modulo-2 sum of the bits in each column.

Disk 1: 11110000

Writing

Disk 1 ： 11110000

Disk 2 ： 10101010 ----- 11001100

Disk 3 ： 00111000

+

01100110Redundant 4: 01100010

00000100

+

Failure Recovery

disk 1: 11110000

disk 2: ????????

disk 3: 00111000

disk 4: 01100010

disk 2 is : 10101010

RAID 5RAID 4 suffers from a bottleneck defect that we can see when re-examine the process of writing a new data block.RAID 5 treating each disk as the redundant disk for some of the blocks.

Disk 1 Disk 2 Disk 3

Coping With Multiple Disk Crashes (RAID 6)

Data Disk Redundant Disk

1 2 3 4 5 6 7

1 1 1 0 1 0 0

1 1 0 1 0 1 0

1 0 1 1 0 0 1

a) Every possible column of three 0’s and 1’s , except for the all-0 column.b) The columns for the redundant disks have a single 1.c) The columns for the data disks each have at least two 1’s.

Writing

Disk Content

1 ） 11110000

2 ） 10101010

3 ） 00111000

4 ） 01000001

5 ） 01100010

6 ） 00011011

7 ） 10001001

Disk Content

1 ） 11110000

2 ） 00001111

3 ） 00111000

4 ） 01000001

5 ） 11000111

6 ） 10111110

7 ） 10001001

Failure Recovery

Disk Content

1 ） 11110000

2 ） ????????

3) 00111000

4) 01000001

5) ????????

6) 10111110

7) 10001001

Disk Content

1 ） 11110000

2 ） 00001111

3) 00111000

4) 01000001

5) ????????

6) 10111110

7) 10001001

Disk Content

1 ） 11110000

2 ） 00001111

3) 00111000

4) 01000001

5) 11000111

6) 10111110

7) 10001001Disk 2 and Disk 5 failure

Disk 2 recovery

from Disk 1, 4, 6

Disk 5 recovery

from Disk 1, 2, 3

Documents

Chapter 2 Data Storage How does a computer system store and manage very large volumes of data ?