Upload
andrea-welch
View
219
Download
4
Embed Size (px)
Citation preview
Chapter 2 Data Storage
How does a computer system store and manage very large volumes of data ?
OutlineMemory Hierarchy
Using Hard Disks Efficiently
Accessing Hard Disks Quickly
Keeping Hard Disks Safely
Mechanics of Hard Disks
The Memory Hierarchy
Tertiary Storage
Main memory
Cache
AsVirtualMemory
Disk FileSystem
DBMS
Programs, Main-memoryDBMS’s
Speed CostCapacity
Small
Large
Fast
Slow
High
Low
Secondary Storage
Cache
• Capacity Up to 1 megabyte• Speed between cache and processor: 10
nanosecond • Speed between cache and memory: 100
nanoseconds
Main Memory
• Capacity up to 10 gigabytes
• Random Access
• Access time in 10-100 nanosecond range
Virtual Memory• Most machines use 32-bit address space
which is up to 4 gigabytes.• Main memory is usually 256 Megabytes.• Virtual memory is supported by the
machine hardware and the operating system through paging mechanism.
• Main-memory database system can be implemented by virtual memory.
Secondary Storage
• Significantly more capacious than main memory• Significantly cheaper than main memory• Significantly slower than main memory
Magnetic Disks are usually used as secondary storage.
Tertiary Storage
• Data volumes measured in terabytes
• Slow and cheaper
• Access times varying widely
Ad-hoc Tape Storage, Optical Disk Juke Boxes and Tape Silos are the common tertiary storages.
Volatile and Nonvolatile Storage
• Volatile device “forgets” its contents when the power goes off, such as main memory.
• Nonvolatile device keeps its contents intact in the presence of power failures, such as magnetic disk, tapes, flash memory.
13
12
11
10
9
8
7
6
5
2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9
Tertiary
Secondary
Zip disk
Floppy disk
Main Memory
Cache
Access time versus capacity for various levels of memory hierarchy
The horizontal axis measures seconds in exponents of 10.
The vertical axis measures bytes in exponents of 10.
Mechanics of Disks
cylinder
Platter
= 2 surfaces
disk heads
A typical disk
tracks
sectorgap
Top view of a disk surface
Disk Controller
• Controlling disk head to move and position the heads at a
particular radius• Selecting a surface, and selecting a
sector from the track on that surface that is under the head
• Transferring data
Processor
MainMemory
Disk Controller
Disks
Bus
Schematic of a simple computer system
Disk Storage Characteristics
The typical measures:
----Rotation Speed of the Disk Assembly
----Number of Platters per Unit
----Number of Tracks per Surface
----Number of Bytes per TrackExample : Megatron 747’s characteristics :
---- 3840 RPM
---- There are four platters providing eight surfaces
---- There are 8192 tracks per surface
---- There are ( on average) 256 sectors per track, 512 bytes per sector
Capacity of Megatron 747
8 surfaces X 8192 tracks X 256 sectors X 512 Bytes
8 gigabytes
Block Address:
• Physical Device
• Cylinder #
• Surface #
• Sector
Disk Access Characteristics
Headhere
Rotation
Blockwe want
The cause of rotational latency
1
x
MAX
in range
3x~20x
Seek time varies with distance traveled
0 4096 8192
0
2048
4096
Average travel distance as a function of initial head position
Starting track
Averagetravel
Disk Access Time = Seek Time + Rotational Delay + Transfer Time +
Other
Cylinders traveled
Average Random Seek Time
SEEKTIME (i j)
S =
N(N-1)
N N
i=1 j=1ji
“Typical” S: 10 ms 40 ms
Average Rotational Delay
R = 1/2 revolution
“typical” R = 8.33 ms (3600 RPM)
Transfer Rate: t
• “typical” t: 1 3 MB/second
• transfer time: block size
t
Other Delays
• CPU time to issue I/O
• Contention for controller
• Contention for bus, memory
“Typical” Value: 0
Average time to read a 4096-byte block from Megatron 747
•3840 rmp, makes one rotation in 1/64th of a second.•Take one millisecond to start and stop, plus one additional millisecond for every 500 cylinders travelled.
Seek Time: 1+ 2730/500=6.5 millisecond Rotational Latency: 1/64/2*1000 =7.8 millisecondTransfer Time: 36*7/256+324*8/256=11.10911.109/360/64 *1000 = 0.5 millisecond
The average latency is 6.5 + 7.8 + 0.5 = 14.8 ms
Cost for Writing similar to Reading
…. unless we want to verify! need to add (full) rotation + Block size
t
• To Modify a Block?
To Modify Block:(a) Read Block
(b) Modify in Memory
(c) Write Block
[(d) Verify?]
Using Hard Disk Efficiently
The time of disk access is much larger than the time likely to be used manipulating that data in main memory so the number of disk accesses need be limited during designing algorithm.
The I/O Model of Computation
Dominance of I/O cost
When the data is so large it does not fit in main memory, reading and writing disk blocks between disk and memory often takes much longer than it does to process the data once it is in main memory.
Algorithms need to change under the I/O model. The evaluation of algorithms for data in secondary storage focuses on the number of disk I/O’s required.
Sorting Data in Secondary Storage
There are a number of well-known algorithms for sorting data in main memory . However, when the data is much larger than main memory . We should consider how to reduce times moving each block between main memory and secondary storage.
Step List 1 List 2 Output
Start
1)2)3)4)5)6)7)8)
1,3,4,9
3,4,9
3,4,9
4,9
9
9
9
9
无
2,5,7,8
2,5,7,8
5,7,8
5,7,8
5,7,8
7,8
8
无无
None
1
1,2
1,2,3
1,2,3,4
1,2,3,4,5
1,2,3,4,5,7
1,2,3,4,5,6,7,8
1,2,3,4,5,6,7,8,9
Merging two sorted lists to make one sorted list.
Two-Phase, Multiway Merge-Sort
• Phase 1: Repeat sorting main-memory-sized pieces of the data.
• Phase 2: Merge all the sorted sublists into a single sorted list.
Input buffers, one for each sorted list
Pointers to firstunchosenrecords
Select smallest unchosen for output
Output Buffer
Main-memory organization for multiway merging
How large sets of record can be sorted
• Block size: B bytes
• Memory Size: M bytes
• Record: R bytes
Total number of record that can be sorted:
(M/R)((M/B)-1)
Accessing Hard Disk Quickly
• Organizing Data by Cylinders
• Using Multiple Disks
• Mirroring Disks
• Disk Scheduling and the Elevator Algorithm
• Prefetching and Large-Scale Buffering
Organizing Data by CylindersDisk Access Time = Seek Time + Rotational Delay +Transfer Time 6.5 ms + 7.8 ms + 0.5ms
Sorting 10,000,000 records by Two-Phase, Multiway Merge takes 250 minutes
Blocks distributed randomly on disk.
The organization of blocks by cylinders.
One phase 2.15 minutes + Second phase 125 minutes
Place blocks that are accessed together on the same cylinder so we can often avoid seek time, and possibly rotational latency.
Using Multiple Disks
Megatron 747 ( four platters with eight surfaces)
Megatron 737 ( one platter with two surfaces) X 4
Two-Phase, Multiway Merge-Sort
1. Phase 1: Speed-up 4 times
2. Phase 2: Speed-up 2~3 times
Divide the data among several smaller disks rather than one large one. Having more head assemblies can go after blocks independently and increase the number of block accesses per unit time
Mirroring Disks
• Enhance reliability• Speed up reading but not writing
Disk Scheduling and the Elevator Algorithm
Cylinder of Request
First time available
1000 0
3000 0
7000 0
2000 20
8000 30
5000 40
1000 8.3
3000 21.6
7000 38.9
8000 50.2
5000 65.5
2000 80.8
Cylinder of Request
Time completed
1000 8.3
3000 21.6
7000 38.9
2000 58.2
8000 79.5
5000 94.8
Cylinder of Request
Time completed
Arrival times for six block-access requests
Finishing times for block accesses using the elevator algorithm
Finishing times for block accesses using the first-come-first-served algorithm
Prefetching and Large-Scale Buffering
Input Buffer 1
Input Buffer 2
merge
Diskread
Prefetch blocks to main memory in anticipation of their later use. Using track-sized or cylinder-sized output buffers can eliminate seek time and rotational latency.
1. Store the sorted sublists on whole, consecutive cylinders, with the blocks on each track being consecutive blocks of the sorted sublist.
2. Read whole tracks or whole cylinders whenever we need some more records from a given list.
Output Buffer 1
Output Buffer 2
merge
Diskwrite
Keeping Hard Disk Safely
• Intermittent failure
• Media decay
• Write failure
• Disk crashes
Intermittent Failures
Disk Reading (W, S)
W: the data in the sector that is readS: status bit that tells whether or not the read was successful.
Disk Reading
S== “bad”
S == “good” W
We may be fooled.
Disk Writing Disk Reading Status Checking
Checksums
• If there is an odd number of 1’s among a collection of bits, we say the bits have odd parity, or that their parity bit is 1.
• If there is an even number of 1’s among a collection of bits, we say the bits have even parity, or that their parity bit is 0.
01101000 ------- 01101000111101110 ------- 111011100
Stable Storage
X
XL XR
While checksums will almost certainly detect the existence of a media or a failure to read or write correctly. it does not help us correct the error. To deal with the problems, we can implement a police known as stable storage.
The stable-storage writing policy:(1) Write the value of X into XL. Check that the value has status “good”. If not, repeat
the write. After a set number of write attempts, fix-up XL.
(2) Repeat (1) for XR.
The stable-storage reading policy:(1) To obtain the value of X, read XL. If status “ bad” is returned, repeat the read a
set number of times. If a value with status “ good” is eventually returned, take that value as X.
(2) If we cannot read XL, repeat (1) with XR.
Error-Handling Capabilities of Stable Storage
• Media failure If one fails, read the other.• Write failure Failure occurred during writing XL,
Copy XR to XL; Failure occurred after writing XL,
copy XL to XR
Recovery from Disk Crashes
RAID (Redundant Arrays of Independent Disks ) has been developed to reduce the risk of data loss by disk crashes.
RAID 1
Data Disk Redundant DiskMirroring
RAID 4
Disk 1 : 11110000
Disk 2 : 10101010
Disk 3 : 00111000
The redundant disk will have the following parity check bits :
Disk 4 : 01100010
While mirroring disks uses as many redundant disks as there are data disks, RAID 4 uses only one redundant disk no matter how many data disks there are.
Reading
Reading blocks from a data disk is no different from reading blocks from any disk. In some circumstances, we can actually get the effect of two simultaneous read from one of the data disks.
Suppose Disk 1 is busy and we want to read it, while
none of the other disks are busy.
Disk 2: 10101010
Disk 3: 00111000
Disk 4: 01100010If we take the modulo-2 sum of the bits in each column.
Disk 1: 11110000
Writing
Disk 1 : 11110000
Disk 2 : 10101010 ----- 11001100
Disk 3 : 00111000
+
01100110Redundant 4: 01100010
00000100
+
Failure Recovery
disk 1: 11110000
disk 2: ????????
disk 3: 00111000
disk 4: 01100010
disk 2 is : 10101010
RAID 5RAID 4 suffers from a bottleneck defect that we can see when re-examine the process of writing a new data block.RAID 5 treating each disk as the redundant disk for some of the blocks.
Disk 1 Disk 2 Disk 3
Coping With Multiple Disk Crashes (RAID 6)
Data Disk Redundant Disk
1 2 3 4 5 6 7
1 1 1 0 1 0 0
1 1 0 1 0 1 0
1 0 1 1 0 0 1
a) Every possible column of three 0’s and 1’s , except for the all-0 column.b) The columns for the redundant disks have a single 1.c) The columns for the data disks each have at least two 1’s.
Writing
Disk Content
1 ) 11110000
2 ) 10101010
3 ) 00111000
4 ) 01000001
5 ) 01100010
6 ) 00011011
7 ) 10001001
Disk Content
1 ) 11110000
2 ) 00001111
3 ) 00111000
4 ) 01000001
5 ) 11000111
6 ) 10111110
7 ) 10001001
Failure Recovery
Disk Content
1 ) 11110000
2 ) ????????
3) 00111000
4) 01000001
5) ????????
6) 10111110
7) 10001001
Disk Content
1 ) 11110000
2 ) 00001111
3) 00111000
4) 01000001
5) ????????
6) 10111110
7) 10001001
Disk Content
1 ) 11110000
2 ) 00001111
3) 00111000
4) 01000001
5) 11000111
6) 10111110
7) 10001001Disk 2 and Disk 5 failure
Disk 2 recovery
from Disk 1, 4, 6
Disk 5 recovery
from Disk 1, 2, 3