CS 505: Thu D. NguyenRutgers University, Spring 2003 1
CS 505: Computer Structures
Memory and Disk I/O
Thu D. Nguyen
Spring 2005
Computer Science
Rutgers University
CS 505: Thu D. NguyenRutgers University, Spring 2003 2
Main Memory Background
• Performance of Main Memory: – Latency: Cache Miss Penalty
» Access Time: time between request and word arrives» Cycle Time: time between requests
– Bandwidth: I/O & Large Block Miss Penalty (L2)
• Main Memory is DRAM: Dynamic Random Access Memory
– Dynamic since needs to be refreshed periodically (8 ms)– Addresses divided into 2 halves (Memory as a 2D matrix):
» RAS or Row Access Strobe» CAS or Column Access Strobe
• Cache uses SRAM: Static Random Access Memory– No refresh (6 transistors/bit vs. 1 transistor)– Size: DRAM/SRAM 4-8– Cost/Cycle time: SRAM/DRAM 8-16
CS 505: Thu D. NguyenRutgers University, Spring 2003 3
DRAM logical organization (4 Mbit)
Column Decoder
Sense Amps & I/O
Memory Array(2,048 x 2,048)
A0…A10
…
11 D
Q
Word LineStorage Cell
CS 505: Thu D. NguyenRutgers University, Spring 2003 4
4 Key DRAM Timing Parameters
• tRAC: minimum time from RAS line falling to the valid data output.
– Quoted as the speed of a DRAM when buy
– A typical 4Mb DRAM tRAC = 60 ns
• tRC: minimum time from the start of one row access to the start of the next.
– tRC = 110 ns for a 4Mbit DRAM with a tRAC of 60 ns
• tCAC: minimum time from CAS line falling to valid data output.
– 15 ns for a 4Mbit DRAM with a tRAC of 60 ns
• tPC: minimum time from the start of one column access to the start of the next.
– 35 ns for a 4Mbit DRAM with a tRAC of 60 ns
CS 505: Thu D. NguyenRutgers University, Spring 2003 5
DRAM Performance
• A 60 ns (tRAC) DRAM can – perform a row access only every 110 ns (tRC)
– perform column access (tCAC) in 15 ns, but time between column accesses is at least 35 ns (tPC).
» In practice, external address delays and turning around buses make it 40 to 50 ns
• These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead!
CS 505: Thu D. NguyenRutgers University, Spring 2003 6
DRAM History• DRAMs: capacity +60%/yr, cost –30%/yr
– 2.5X cells/area, 1.5X die size in 3 years
• ‘98 DRAM fab line costs $2B– DRAM only: density, leakage v. speed
• Rely on increasing no. of computers & memory per computer (60% market)
– SIMM or DIMM is replaceable unit => computers use any generation DRAM
• Commodity, second source industry => high volume, low profit, conservative
– Little organization innovation in 20 years
• Order of importance: 1) Cost/bit 2) Capacity– First RAMBUS: 10X BW, +30% cost => little impact
CS 505: Thu D. NguyenRutgers University, Spring 2003 7
• Tunneling Magnetic Junction RAM (TMJ-RAM):– Speed of SRAM, density of DRAM, non-
volatile (no refresh)– New field called “Spintronics”:
combination of quantum spin and electronics
– Same technology used in high-density disk-drives
• MEMs storage devices:– Large magnetic “sled” floating on top of
lots of little read/write heads– Micromechanical actuators move the sled
back and forth over the heads
More esoteric Storage Technologies?
CS 505: Thu D. NguyenRutgers University, Spring 2003 8
MEMS-based Storage• Magnetic “sled”
floats on array of read/write heads
– Approx 250 Gbit/in2
– Data rates:IBM: 250 MB/s w 1000 headsCMU: 3.1 MB/s w 400 heads
• Electrostatic actuators move media around to align it with heads
– Sweep sled ±50m in < 0.5s
• Capacity estimated to be in the 1-10GB in 10cm2
See Ganger et all: http://www.lcs.ece.cmu.edu/research/MEMS
CS 505: Thu D. NguyenRutgers University, Spring 2003 9
Main Memory Performance
• Simple: – CPU, Cache, Bus,
Memory same width (32 or 64 bits)
• Wide: – CPU/Mux 1 word;
Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UtraSPARC 512)
• Interleaved: – CPU, Cache, Bus 1
word: Memory N Modules(4 Modules); example is word interleaved
CS 505: Thu D. NguyenRutgers University, Spring 2003 10
Main Memory Performance
• Timing model (word size is 32 bits)– 1 to send address, – 6 access time, 1 to send data– Cache Block is 4 words
• Simple M.P. = 4 x (1+6+1) = 32• Wide M.P. = 1 + 6 + 1 = 8• Interleaved M.P. = 1 + 6 + 4x1 = 11
CS 505: Thu D. NguyenRutgers University, Spring 2003 11
How Many Banks?
• Number of banks Number of clock cycles to access word in bank
– otherwise will return to original bank before it can have next word ready
• Increasing DRAM size => fewer chips => harder to have banks
CS 505: Thu D. NguyenRutgers University, Spring 2003 12
DRAMs per PC over TimeM
inim
um
Mem
ory
Siz
e
DRAM Generation‘86 ‘89 ‘92 ‘96 ‘99 ‘02 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb
4 MB
8 MB
16 MB
32 MB
64 MB
128 MB
256 MB
32 8
16 4
8 2
4 1
8 2
4 1
8 2
CS 505: Thu D. NguyenRutgers University, Spring 2003 13
Avoiding Bank Conflicts
• Lots of banksint x[256][512];
for (j = 0; j < 512; j = j+1)for (i = 0; i < 256; i = i+1)
x[i][j] = 2 * x[i][j];• Even with 128 banks, since 512 is multiple of 128,
conflict on word accesses• SW: loop interchange or declaring array not power of 2
(“array padding”)• HW: Prime number of banks
– bank number = address mod number of banks– address within bank = address / number of words in bank– modulo & divide per memory access with prime no. banks?– address within bank = address mod number words in bank– bank number? easy if 2N words per bank
CS 505: Thu D. NguyenRutgers University, Spring 2003 14
Fast Memory Systems: DRAM specific• Multiple CAS accesses: several names (page mode)
– Extended Data Out (EDO): 30% faster in page mode
• New DRAMs to address gap; what will they cost, will they survive?
– RAMBUS: startup company; reinvent DRAM interface» Each Chip a module vs. slice of memory» Short bus between CPU and chips» Does own refresh» Variable amount of data returned» 1 byte / 2 ns (500 MB/s per chip)
– Synchronous DRAM: 2 banks on chip, a clock signal to DRAM, transfer synchronous to system clock (66 - 150 MHz)
• Niche memory or main memory?– e.g., Video RAM for frame buffers, DRAM + fast serial output
CS 505: Thu D. NguyenRutgers University, Spring 2003 15
Potential DRAM Crossroads?
• After 20 years of 4X every 3 years, running into wall? (64Mb - 1 Gb)
• How can keep $1B fab lines full if buy fewer DRAMs per computer?
• Cost/bit –30%/yr if stop 4X/3 yr?• What will happen to $40B/yr DRAM industry?
CS 505: Thu D. NguyenRutgers University, Spring 2003 16
Main Memory Summary
• Wider Memory• Interleaved Memory: for sequential or
independent accesses• Avoiding bank conflicts: SW & HW• DRAM specific optimizations: page mode &
Specialty DRAM• DRAM future less rosy?
CS 505: Thu D. NguyenRutgers University, Spring 2003 17
Virtual Memory: TB (TLB)
CPU
TB
$
MEM
VA
PA
PA
ConventionalOrganization
CPU
$
TB
MEM
VA
VA
PA
Virtually Addressed CacheTranslate only on miss
Synonym Problem
CPU
$ TB
MEM
VA
PATags
PA
Overlap $ accesswith VA translation:requires $ index to
remain invariantacross translation
VATags
L2 $
CS 505: Thu D. NguyenRutgers University, Spring 2003 18
2. Fast hits by Avoiding Address Translation
• Send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache vs. Physical Cache
– Every time process is switched logically must flush the cache; otherwise get false hits
» Cost is time to flush + “compulsory” misses from empty cache– Dealing with aliases (sometimes called synonyms);
Two different virtual addresses map to same physical address– I/O must interact with cache, so need virtual address
• Solution to aliases– One possible solution in Wang et al.’s paper
• Solution to cache flush– Add process identifier tag that identifies process as well as address
within process: can’t get a hit if wrong process
CS 505: Thu D. NguyenRutgers University, Spring 2003 19
2. Fast Cache Hits by Avoiding Translation: Process ID impact
• Black is uniprocess
• Light Gray is multiprocess when flush cache
• Dark Gray is multiprocess when use Process ID tag
• Y axis: Miss Rates up to 20%
• X axis: Cache size from 2 KB to 1024 KB
CS 505: Thu D. NguyenRutgers University, Spring 2003 20
2. Fast Cache Hits by Avoiding Translation: Index with Physical
Portion of Address• If index is physical part of address, can start
tag access in parallel with translation so that can compare to physical tag
• Limits cache to page size: what if want bigger caches and uses same trick?
– Higher associativity one solution
Page Address Page Offset
Address Tag Index Block Offset
CS 505: Thu D. NguyenRutgers University, Spring 2003 21
Alpha 21064
• Separate Instr & Data TLB & Caches
• TLBs fully associative• TLB updates in SW
(“Priv Arch Libr”)• Caches 8KB direct
mapped, write thru• Critical 8 bytes first• Prefetch instr.
stream buffer• 2 MB L2 cache, direct
mapped, WB (off-chip)
• 256 bit path to main memory, 4 x 64-bit modules
• Victim Buffer: to give read priority over write
• 4 entry write buffer between D$ & L2$
StreamBuffer
WriteBuffer
Victim Buffer
Instr Data
CS 505: Thu D. NguyenRutgers University, Spring 2003 22
0.0001
0.001
0.01
0.1
1AlphaSort Eqntott Ora Alvinn Spice
Mis
s R
ate I $
D $
L2
Alpha Memory Performance: Miss Rates
of SPEC92
8K
8K
2M
CS 505: Thu D. NguyenRutgers University, Spring 2003 23
00.5
11.5
22.5
33.5
44.5
5
AlphaSort Espresso Sc Mdljsp2 Ear Alvinn Mdljp2
CP
I
L2
I$
D$
I Stall
Other
Alpha CPI Components• Instruction stall: branch mispredict (green);• Data cache (blue); Instruction cache (yellow); L2$ (pink)
Other: compute + reg conflicts, structural conflicts
CS 505: Thu D. NguyenRutgers University, Spring 2003 24
Pitfall: Predicting Cache Performance from Different Prog.
(ISA, compiler, ...)
• 4KB Data cache miss rate 8%,12%, or 28%?
• 1KB Instr cache miss rate 0%,3%,or 10%?
• Alpha vs. MIPS for 8KB Data $:17% vs. 10%
• Why 2X Alpha v. MIPS?
0%
5%
10%
15%
20%
25%
30%
35%
1 2 4 8 16 32 64 128Cache Size (KB)
Miss Rate
D: tomcatv
D: gcc
D: espresso
I: gcc
I: espresso
I: tomcatv
D$, Tom
D$, gcc
D$, esp
I$, gcc
I$, esp
I$, Tom
CS 505: Thu D. NguyenRutgers University, Spring 2003 25
Instructions Executed (billions)
Cummlative
AverageMemoryAccessTime
1
1.5
2
2.5
3
3.5
4
4.5
0 1 2 3 4 5 6 7 8 9 10 11 12
Pitfall: Simulating Too Small an Address Trace
I$ = 4 KB, B=16BD$ = 4 KB, B=16BL2 = 512 KB, B=128BMP = 12, 200
CS 505: Thu D. NguyenRutgers University, Spring 2003 26
Main Memory Summary
• Wider Memory• Interleaved Memory: for sequential or
independent accesses• Avoiding bank conflicts: SW & HW• DRAM specific optimizations: page mode &
Specialty DRAM• DRAM future less rosy?
CS 505: Thu D. NguyenRutgers University, Spring 2003 27
Outline
• Disk Basics• Disk History• Disk options in 2000• Disk fallacies and performance• Tapes• RAID
CS 505: Thu D. NguyenRutgers University, Spring 2003 28
Disk Device Terminology
• Several platters, with information recorded magnetically on both surfaces (usually)
• Actuator moves head (end of arm,1/surface) over track (“seek”), select surface, wait for sector rotate under head, then read or write
– “Cylinder”: all tracks under heads
• Bits recorded in tracks, which in turn divided into sectors (e.g., 512 Bytes)
Platter
OuterTrack
InnerTrackSector
Actuator
HeadArm
CS 505: Thu D. NguyenRutgers University, Spring 2003 29
Photo of Disk Head, Arm, Actuator
Actuator
ArmHead
Platters (12)
{Spindle
CS 505: Thu D. NguyenRutgers University, Spring 2003 30
Disk Device Performance
Platter
Arm
Actuator
HeadSectorInnerTrack
OuterTrack
• Disk Latency = Seek Time + Rotation Time + Transfer Time + Controller Overhead
• Seek Time? depends no. tracks move arm, seek speed of disk
• Rotation Time? depends on speed disk rotates, how far sector is from head
• Transfer Time? depends on data rate (bandwidth) of disk (bit density), size of request
ControllerSpindle
CS 505: Thu D. NguyenRutgers University, Spring 2003 31
Disk Device Performance
• Average distance sector from head?• 1/2 time of a rotation
– 7200 Revolutions Per Minute = 120 Rev/sec– 1 revolution = 1/120 sec = 8.33 milliseconds– 1/2 rotation (revolution) = 4.16 ms
• Average no. tracks move arm?– Sum all possible seek distances
from all possible tracks / # possible» Assumes average seek distance is random
– Disk industry standard benchmark
CS 505: Thu D. NguyenRutgers University, Spring 2003 32
Data Rate: Inner vs. Outer Tracks
• To keep things simple, orginally kept same number of sectors per track
– Since outer track longer, lower bits per inch
• Competition decided to keep BPI the same for all tracks (“constant bit density”)
– More capacity per disk– More of sectors per track towards edge– Since disk spins at constant speed, outer tracks have
faster data rate
• Bandwidth outer track 1.7X inner track!
CS 505: Thu D. NguyenRutgers University, Spring 2003 33
Devices: Magnetic Disks
SectorTrack
Cylinder
HeadPlatter
• Purpose:– Long-term, nonvolatile
storage– Large, inexpensive, slow
level in the storage hierarchy
• Characteristics:– Seek Time (~8 ms avg)
• Transfer rate– 10-30 MByte/sec– Blocks
• Capacity– Gigabytes– Quadruples every 3
years (aerodynamics)
7200 RPM = 120 RPS => 8 ms per rev ave rot. latency = 4 ms128 sectors per track => 0.25 ms per sector1 KB per sector => 16 MB / s
Response time = Queue + Controller + Seek + Rot + Xfer
Service time
CS 505: Thu D. NguyenRutgers University, Spring 2003 34
Historical Perspective
• 1956 IBM Ramac — early 1970s Winchester– Developed for mainframe computers, proprietary interfaces– Steady shrink in form factor: 27 in. to 14 in.
• 1970s developments– 5.25 inch floppy disk formfactor (microcode into mainframe)– early emergence of industry standard disk interfaces
» ST506, SASI, SMD, ESDI
• Early 1980s– PCs and first generation workstations
• Mid 1980s– Client/server computing – Centralized storage on file server
» accelerates disk downsizing: 8 inch to 5.25 inch– Mass market disk drives become a reality
» industry standards: SCSI, IPI, IDE» 5.25 inch drives for standalone PCs, End of proprietary
interfaces
CS 505: Thu D. NguyenRutgers University, Spring 2003 35
Disk History
Data densityMbit/sq. in.
Capacity ofUnit ShownMegabytes
1973:1. 7 Mbit/sq. in140 MBytes
1979:7. 7 Mbit/sq. in2,300 MBytes
source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even more data into even smaller spaces”
CS 505: Thu D. NguyenRutgers University, Spring 2003 36
Historical Perspective
• Late 1980s/Early 1990s:– Laptops, notebooks, (palmtops)– 3.5 inch, 2.5 inch, (1.8 inch formfactors)– Formfactor plus capacity drives market, not so much
performance» Recently Bandwidth improving at 40%/ year
– Challenged by DRAM, flash RAM in PCMCIA cards» still expensive, Intel promises but doesn’t deliver» unattractive MBytes per cubic inch
– Optical disk fails on performance but finds niche (CD ROM)
CS 505: Thu D. NguyenRutgers University, Spring 2003 37
Disk History
1989:63 Mbit/sq. in60,000 MBytes
1997:1450 Mbit/sq. in2300 MBytes
source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even mroe data into even smaller spaces”
1997:3090 Mbit/sq. in8100 MBytes
CS 505: Thu D. NguyenRutgers University, Spring 2003 38
1 inch disk drive!
• 2000 IBM MicroDrive:– 1.7” x 1.4” x 0.2” – 1 GB, 3600 RPM,
5 MB/s, 15 ms seek– Digital camera, PalmPC?
• 2006 MicroDrive?• 9 GB, 50 MB/s!
– Assuming it finds a niche in a successful product
– Assuming past trends continue
CS 505: Thu D. NguyenRutgers University, Spring 2003 39
Disk Performance Model /Trends
• Capacity– + 100%/year (2X / 1.0 yrs)
• Transfer rate (BW)– + 40%/year (2X / 2.0 yrs)
• Rotation + Seek time– – 8%/ year (1/2 in 10 yrs)
• MB/$– > 100%/year (2X / <1.5 yrs)– Fewer chips + areal density
CS 505: Thu D. NguyenRutgers University, Spring 2003 40
State of the Art: Ultrastar 72ZX– 73.4 GB, 3.5 inch disk– 2¢/MB– 10,000 RPM;
3 ms = 1/2 rotation– 11 platters, 22 surfaces– 15,110 cylinders– 7 Gbit/sq. in. areal den– 17 watts (idle)– 0.1 ms controller time– 5.3 ms avg. seek– 50 to 29 MB/s(internal)
source: www.ibm.com; www.pricewatch.com; 2/14/00
Latency = Queuing Time + Controller time +Seek Time + Rotation Time + Size / Bandwidth
per access
per byte{+
Sector
Track
Cylinder
Head PlatterArmTrack Buffer
CS 505: Thu D. NguyenRutgers University, Spring 2003 41
Disk Performance Example
• Calculate time to read 1 sector (512B) for UltraStar 72 using advertised performance; sector is on outer track
• Disk latency = average seek time + average rotational delay + transfer time + controller overhead
• = 5.3 ms + 0.5 * 1/(10000 RPM) + 0.5 KB / (50 MB/s) + 0.15 ms
• = 5.3 ms + 0.5 /(10000 RPM/(60000ms/M)) + 0.5 KB / (50 KB/ms) + 0.15 ms
• = 5.3 + 3.0 + 0.10 + 0.15 ms = 8.55 ms
CS 505: Thu D. NguyenRutgers University, Spring 2003 42
Areal Density
• Bits recorded along a track– Metric is Bits Per Inch (BPI)
• Number of tracks per surface– Metric is Tracks Per Inch (TPI)
• Care about bit density per unit area– Metric is Bits Per Square Inch– Called Areal Density– Areal Density = BPI x TPI
CS 505: Thu D. NguyenRutgers University, Spring 2003 43
Areal Density
Year Areal Density1973 1.71979 7.71989 631997 30902000 17100
1
10
100
1000
10000
100000
1970 1980 1990 2000
Year
Are
al D
ensity
– Areal Density = BPI x TPI– Change slope 30%/yr to 60%/yr about 1991
CS 505: Thu D. NguyenRutgers University, Spring 2003 44
Disk Characteristics in 2000Seagate
CheetahST173404LC
Ultra160 SCSI
IBMTravelstar
32GH DJSA -232 ATA-4
IBM 1GBMicrodrive
DSCM-11000
Disk diameter(inches)
3.5 2.5 1.0Formatted datacapacity (GB)
73.4 32.0 1.0Cylinders 14,100 21,664 7,167Disks 12 4 1RecordingSurfaces (Heads)
24 8 2Bytes per sector 512 to 4096 512 512Avg Sectors pertrack (512 byte)
~ 424 ~ 360 ~ 140Max. arealdensity(Gbit/sq.in.)
6.0 14.0 15.2
CS 505: Thu D. NguyenRutgers University, Spring 2003 45
Disk Characteristics in 2000Seagate
CheetahST173404LC
Ultra160 SCSI
IBMTravelstar
32GH DJSA -232 ATA-4
IBM 1GBMicrodrive
DSCM-11000
Rotation speed(RPM)
10033 5411 3600Avg. seek ms(read/write)
5.6/6.2 12.0 12.0Minimum seekms (read/write)
0.6/0.9 2.5 1.0Max. seek ms 14.0/15.0 23.0 19.0Data transferrate MB/second
27 to 40 11 to 21 2.6 to 4.2Link speed tobuffer MB/s
160 67 13Poweridle/operatingWatts
16.4 / 23.5 2.0 / 2.6 0.5 / 0.8
CS 505: Thu D. NguyenRutgers University, Spring 2003 46
Disk Characteristics in 2000Seagate
CheetahST173404LC
Ultra160 SCSI
IBMTravelstar
32GH DJSA -232 ATA-4
IBM 1GBMicrodrive
DSCM-11000
Buffer size in MB 4.0 2.0 0.125Size: height xwidth x depthinches
1.6 x 4.0 x5.8
0.5 x 2.7 x3.9
0.2 x 1.4 x1.7
Weight pounds 2.00 0.34 0.035Rated MTTF inpowered-on hours
1,200,000 (300,000?) (20K/5 yrlife?)
% of POH permonth
100% 45% 20%% of POHseeking, reading,writing
90% 20% 20%
CS 505: Thu D. NguyenRutgers University, Spring 2003 47
Disk Characteristics in 2000Seagate
CheetahST173404LC
Ultra160 SCSI
IBM Travelstar32GH DJSA -
232 ATA-4
IBM 1GB MicrodriveDSCM-11000
Load/Unloadcycles (diskpowered on/off)
250 per year 300,000 300,000
Nonrecoverableread errors perbits read
<1 per 1015 < 1 per 1013 < 1 per 1013
Seek errors <1 per 107 not available not availableShock tolerance:Operating, Notoperating
10 G, 175 G 150 G, 700 G 175 G, 1500 G
Vibrationtolerance:Operating, Notoperating (sineswept, 0 to peak)
5-400 Hz @0.5G, 22-400Hz @ 2.0G
5-500 Hz @1.0G, 2.5-500Hz @ 5.0G
5-500 Hz @ 1G, 10-500 Hz @ 5G
CS 505: Thu D. NguyenRutgers University, Spring 2003 48
Technology Trends
Disk Capacity now doubles every 12 months; before1990 every 36 motnhs
• Today: Processing Power Doubles Every 18 months
• Today: Memory Size Doubles Every 18-24 months(4X/3yr)
• Today: Disk Capacity Doubles Every 12-18 months
• Disk Positioning Rate (Seek + Rotate) Doubles Every Ten Years!
The I/OGAP
The I/OGAP
CS 505: Thu D. NguyenRutgers University, Spring 2003 49
Fallacy: Use Data Sheet “Average Seek” Time
• Manufacturers needed standard for fair comparison (“benchmark”)
– Calculate all seeks from all tracks, divide by number of seeks => “average”
• Real average would be based on how data laid out on disk, where seek in real applications, then measure performance
– Usually, tend to seek to tracks nearby, not to random track
• Rule of Thumb: observed average seek time is typically about 1/4 to 1/3 of quoted seek time (i.e., 3X-4X faster)
– UltraStar 72 avg. seek: 5.3 ms => 1.7 ms
CS 505: Thu D. NguyenRutgers University, Spring 2003 50
Fallacy: Use Data Sheet Transfer Rate
• Manufacturers quote the speed of the data rate off the surface of the disk
• Sectors contain an error detection and correction field (can be 20% of sector size) plus sector number as well as data
• There are gaps between sectors on track• Rule of Thumb: disks deliver about 3/4 of
internal media rate (1.3X slower) for data• For example, UlstraStar 72 quotes
50 to 29 MB/s internal media rate • => Expect 37 to 22 MB/s user data rate
CS 505: Thu D. NguyenRutgers University, Spring 2003 51
Disk Performance Example
• Calculate time to read 1 sector for UltraStar 72 again, this time using 1/3 quoted seek time, 3/4 of internal outer track bandwidth; (8.55 ms before)
• Disk latency = average seek time + average rotational delay + transfer time + controller overhead
• = (0.33 * 5.3 ms) + 0.5 * 1/(10000 RPM) + 0.5 KB / (0.75 * 50 MB/s) + 0.15 ms
• = 1.77 ms + 0.5 /(10000 RPM/(60000ms/M)) + 0.5 KB / (37 KB/ms) + 0.15 ms
• = 1.73 + 3.0 + 0.14 + 0.15 ms = 5.02 ms
CS 505: Thu D. NguyenRutgers University, Spring 2003 52
Future Disk Size and Performance
• Continued advance in capacity (60%/yr) and bandwidth (40%/yr)
• Slow improvement in seek, rotation (8%/yr)• Time to read whole disk • Year Sequentially Randomly
(1 sector/seek)• 1990 4 minutes 6 hours• 2000 12 minutes 1 week(!)• 3.5” form factor make sense in 5-7 yrs?
CS 505: Thu D. NguyenRutgers University, Spring 2003 53
SCSI: Small Computer System Interface
• Clock rate: 5 MHz / 10 (fast) / 20 (ultra)- 80 MHz (Ultra3)• Width: n = 8 bits / 16 bits (wide); up to n – 1 devices to
communicate on a bus or “string”• Devices can be slave (“target”) or master(“initiator”)• SCSI protocol: a series of “phases”, during which
specific actions are taken by the controller and the SCSI disks
– Bus Free: No device is currently accessing the bus– Arbitration: When the SCSI bus goes free, multiple devices may
request (arbitrate for) the bus; fixed priority by address– Selection: informs the target that it will participate (Reselection if
disconnected)– Command: the initiator reads the SCSI command bytes from host
memory and sends them to the target– Data Transfer: data in or out, initiator: target– Message Phase: message in or out, initiator: target (identify,
save/restore data pointer, disconnect, command complete)– Status Phase: target, just before command complete
CS 505: Thu D. NguyenRutgers University, Spring 2003 54
Use Arrays of Small Disks?
14”10”5.25”3.5”
3.5”
Disk Array: 1 disk design
Conventional: 4 disk designs
Low End High End
•Katz and Patterson asked in 1987: •Can smaller disks be used to close gap in performance between disks and CPUs?
CS 505: Thu D. NguyenRutgers University, Spring 2003 55
Replace Small Number of Large Disks with Large Number of Small Disks! (1988 Disks)
Capacity
Volume
Power
Data Rate
I/O Rate
MTTF
Cost
IBM 3390K
20 GBytes
97 cu. ft.
3 KW
15 MB/s
600 I/Os/s
250 KHrs
$250K
IBM 3.5" 0061
320 MBytes
0.1 cu. ft.
11 W
1.5 MB/s
55 I/Os/s
50 KHrs
$2K
x70
23 GBytes
11 cu. ft.
1 KW
120 MB/s
3900 IOs/s
??? Hrs
$150K
Disk Arrays have potential for large data and I/O rates, high MB per cu. ft., high MB per KW, but what about reliability?
9X
3X
8X
6X
CS 505: Thu D. NguyenRutgers University, Spring 2003 56
Array Reliability
• Reliability of N disks = Reliability of 1 Disk ÷ N
50,000 Hours ÷ 70 disks = 700 hours
Disk system MTTF: Drops from 6 years to 1 month!
• Arrays (without redundancy) too unreliable to be useful!
Hot spares support reconstruction in parallel with access: very high media availability can be achievedHot spares support reconstruction in parallel with access: very high media availability can be achieved
CS 505: Thu D. NguyenRutgers University, Spring 2003 57
Redundant Arrays of (Inexpensive) Disks
• Files are "striped" across multiple disks• Redundancy yields high data availability
– Availability: service still provided to user, even if some components failed
• Disks will still fail• Contents reconstructed from data
redundantly stored in the array– Capacity penalty to store redundant info– Bandwidth penalty to update redundant info
CS 505: Thu D. NguyenRutgers University, Spring 2003 58
Redundant Arrays of Inexpensive Disks
RAID 1: Disk Mirroring/Shadowing
• Each disk is fully duplicated onto its “mirror” Very high availability can be achieved• Bandwidth sacrifice on write: Logical write = two physical writes
• Reads may be optimized• Most expensive solution: 100% capacity overhead
• (RAID 2 not interesting, so skip)
recoverygroup
CS 505: Thu D. NguyenRutgers University, Spring 2003 59
Redundant Array of Inexpensive Disks RAID 3:
Parity Disk
P
100100111100110110010011
. . .logical record 1
0100011
11001101
10100011
11001101
P contains sum ofother disks per stripe mod 2 (“parity”)If disk fails, subtract P from sum of other disks to find missing information
Striped physicalrecords
CS 505: Thu D. NguyenRutgers University, Spring 2003 60
RAID 3
• Sum computed across recovery group to protect against hard disk failures, stored in P disk
• Logically, a single high capacity, high transfer rate disk: good for large transfers
• Wider arrays reduce capacity costs, but decreases availability
• 33% capacity cost for parity in this configuration
CS 505: Thu D. NguyenRutgers University, Spring 2003 61
Inspiration for RAID 4
• RAID 3 relies on parity disk to discover errors on Read
• But every sector has an error detection field• Rely on error detection field to catch errors
on read, not on the parity disk• Allows independent reads to different disks
simultaneously
CS 505: Thu D. NguyenRutgers University, Spring 2003 62
Redundant Arrays of Inexpensive Disks RAID 4:
High I/O Rate ParityD0 D1 D2 D3 P
D4 D5 D6 PD7
D8 D9 PD10 D11
D12 PD13 D14 D15
PD16 D17 D18 D19
D20 D21 D22 D23 P
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.Disk Columns
IncreasingLogicalDisk
Address
Stripe
Insides of 5 disksInsides of 5 disks
Example:small read D0 & D5, large write D12-D15
Example:small read D0 & D5, large write D12-D15
CS 505: Thu D. NguyenRutgers University, Spring 2003 63
Inspiration for RAID 5
• RAID 4 works well for small reads• Small writes (write to one disk):
– Option 1: read other data disks, create new sum and write to Parity Disk
– Option 2: since P has old sum, compare old data to new data, add the difference to P
• Small writes are limited by Parity Disk: Write to D0, D5 both also write to P disk
D0 D1 D2 D3 P
D4 D5 D6 PD7
CS 505: Thu D. NguyenRutgers University, Spring 2003 64
Redundant Arrays of Inexpensive Disks RAID 5: High I/O Rate Interleaved
ParityIndependent writespossible because ofinterleaved parity
Independent writespossible because ofinterleaved parity
D0 D1 D2 D3 P
D4 D5 D6 P D7
D8 D9 P D10 D11
D12 P D13 D14 D15
P D16 D17 D18 D19
D20 D21 D22 D23 P
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.Disk Columns
IncreasingLogical
Disk Addresses
Example: write to D0, D5 uses disks 0, 1, 3, 4
CS 505: Thu D. NguyenRutgers University, Spring 2003 65
Problems of Disk Arrays: Small Writes
D0 D1 D2 D3 PD0'
+
+
D0' D1 D2 D3 P'
newdata
olddata
old parity
XOR
XOR
(1. Read) (2. Read)
(3. Write) (4. Write)
RAID-5: Small Write Algorithm
1 Logical Write = 2 Physical Reads + 2 Physical Writes
CS 505: Thu D. NguyenRutgers University, Spring 2003 66
System Availability: Orthogonal RAIDs
ArrayController
StringController
StringController
StringController
StringController
StringController
StringController
. . .
. . .
. . .
. . .
. . .
. . .
Data Recovery Group: unit of data redundancy
Redundant Support Components: fans, power supplies, controller, cables
End to End Data Integrity: internal parity protected data paths
CS 505: Thu D. NguyenRutgers University, Spring 2003 67
System-Level Availability
Fully dual redundantI/O Controller I/O Controller
Array Controller Array Controller
. . .
. . .
. . .
. . . . . .
.
.
.RecoveryGroup
Goal: No SinglePoints ofFailure
Goal: No SinglePoints ofFailure
host host
with duplicated paths, higher performance can beobtained when there are no failures
CS 505: Thu D. NguyenRutgers University, Spring 2003 68
Summary: Redundant Arrays of Disks (RAID)
Techniques• Disk Mirroring, Shadowing (RAID 1)
Each disk is fully duplicated onto its "shadow" Logical write = two physical writes
100% capacity overhead
• Parity Data Bandwidth Array (RAID 3)
Parity computed horizontally
Logically a single high data bw disk
• High I/O Rate Parity Array (RAID 5)Interleaved parity blocks
Independent reads and writes
Logical write = 2 reads + 2 writes
Parity + Reed-Solomon codes
10010011
11001101
10010011
00110010
10010011
10010011