Upload
kerry-pope
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
GMR Head
Northern California CMGFebruary 2004
Ted Oatway
Enterprise Solution Specialist
© Copyright 2004 Storage Technology Corporation (StorageTek)
Know Your Data
1) I/O is typically dominated by Reads
2) I/O is very Random even for large files
3) Large caches have little to offer
4) A few files are responsible for the majority of the activity
5) These files are typically mapped in host-memory by the calling process
© Copyright 2004 Storage Technology Corporation (StorageTek)
Reference Paper
Instructional Workload (INS)
> twenty laboratory machines
> eight months of traces
Research Workload (RES)
> 13 desktop machines
> one year of traces
WEB workload
> single web server for an online library project
> uses the Postgres database management system
> 2,300 accesses per day
> one month of traces
A Comparison of File System Workloads
> Drew Roselli, Jacob R. Lorch, and Thomas E. Anderson
> University of California, Berkeley and University of Washington
> June, 2000
© Copyright 2004 Storage Technology Corporation (StorageTek)
Basis Reference Papers
The BSD study
> In 1985, Ousterhout et al. presented a general characterization of dynamically collected traces [Oust85].
> They traced three servers running BSD UNIX for slightly over three days.
The Sprite study
> In 1991, Baker et al. conducted the same type of analysis on file server and client information and it’s affect on local cache activity.
> They collected four two-day sets of traces[Bake91].
The IBM study
> In 1991, Bozman et al. repeated many of the Sprite studies using traces from two separate IBM sites [Bozm91].
> This study confirmed that the Sprite study applied to non-academic sites as well.
The NT study
> In 1999, the same studies were repeated on three sets of two-week traces taken from 45 hosts running Windows NT [Voge99].
© Copyright 2004 Storage Technology Corporation (StorageTek)
Conclusions (1 of 3)
Reads vs. Writes
> Reads typically dominate writes for block activity.
> WEB workloads are read intensive.
> RES - activity is dominated by writes to logs when local cache is small.
– When local cache is increased logging activity occurs on the host.
– Workload then matches INS
> INS workload - a small local cache significantly increases Read activity. (5x with too small local cache)
Average Block Lifetime
> UNIX, newly created blocks die within an hour.
> NT, newly created blocks that survive one second are likely to remain alive a day.
> All workloads –– Overwrites cause most
deleted blocks– Overwrites show substantial
locality– A small write buffer is
sufficient to absorb write traffic for nearly all workloads.
> A 30-second write delay in cache benefits most workloads.
© Copyright 2004 Storage Technology Corporation (StorageTek)
Conclusions (2 of 3)
Caching read traffic
> Small caches can sharply decrease disk read traffic
> There is no support for the claim that disk traffic becomes dominated by writes when a large cache is used
> Large caches show a diminishing return beyond the working set size
> Even a 1MB cache reduces read-bandwidth by 65–90%
Memory-mapping
> All modern workloads use memory-mapping to a large extent
> UNIX, a small number of memory-mapped files are shared among many active processes.
> If kept in memory as long as it is memory-mapped by any process, the miss rate for file map requests is extremely low.
© Copyright 2004 Storage Technology Corporation (StorageTek)
Conclusions (3 of 3)
Read-ahead pre-fetching
> Applications are accessing larger files and the maximum file size has increased in recent years.
> However, larger files are more likely to be accessed more randomly than before
> This renders straightforward pre-fetching less useful
File access patterns
> File access patterns are bimodal in that most files tend to be mostly read or mostly written.
> Especially true for files that are accessed frequently
© Copyright 2004 Storage Technology Corporation (StorageTek)
Understanding the Basic Building Block
Disk capacity has increased by a factor of 8x in the last five years
> 18GB, 36GB, 73GB, 146GB
> Soon to be 300GB
Disk Internal Transfer Rates have recently barely doubled
> 32MB/sec to 60MB/sec
© Copyright 2004 Storage Technology Corporation (StorageTek)
Critical Disk Drive Metrics
ITR – Internal Transfer Rate - How quickly the disk can transfer the data between the platters and the buffer cache.
ETR – External Transfer Rate - How quickly the disk can transfer the data between the buffer cache and the controller.
On-Board Cache - The buffer cache resident on the disk drive. Some disk controllers allow this cache to be operated in write-through mode but this is unusual today.
RPM-Revolutions Per Minute or ‘How quickly the next block of data comes around ‘.
Average Access Time- On average how quickly the head can move to a selected track.
IOPS- RPM + Average Access Time determines the number of I/O operations the disk can perform per second.
© Copyright 2004 Storage Technology Corporation (StorageTek)
Disk Specifications
Capacity RPM Cache ETR ITR IOPS36GB (older) 10K 4MB 100 MB/sec 35 MB/sec 12336GB 10K 4MB 200 MB/sec 35 MB/sec 10072GB 10K 4MB 200 MB/sec 35 MB/sec 100
36GB 15K 8MB 200 MB/sec 60 MB/sec 179 72GB 15K 8MB 200 MB/sec 60 MB/sec 179
146GB 10K 16MB 200 MB/sec 60 MB/sec 130180GB 7.2K 16MB 200 MB/sec 28 MB/sec 89
Barracuda SATA Disk Drive:120GB 7.2K 8MB 150 MB/sec 71MB/sec 75
© Copyright 2004 Storage Technology Corporation (StorageTek)
Understanding Volume (RAID) Groups
1) Small RAID groups are better than larger RAID groups for most workloads
2) Small RAID groups better emulate individual disk drives
3) I/O is “bursty” by nature
4) “Bursty” and Random access patterns cause I/O’s to block at the disk drive
5) Contention occurs at the Volume Group level
© Copyright 2004 Storage Technology Corporation (StorageTek)
Volume Group
RAID5 (3+1)
LUN0
Volume = Logical Unit
LU N 1
Rules of Thumb• The smallest disk in a stripe determines the overall
size of the stripe.• The slowest disk in a stripe determines the overall
speed of the stripe.• Contention occurs at the Volume Group level.
D-Series Disk – Basic Building Blocks
© Copyright 2004 Storage Technology Corporation (StorageTek)
Queuing Theory
60 IOPS
AWQ = 181ms
18GB 68 IOPS
60 IOPS
AWQ = 2.8ms64x Improvement(Parallelism)
18GB 68 IOPS
18GB 68 IOPS
White Paper – Storage Systems Performance ConsiderationsAWQ = Average Wait Queue
60 IOPS
AWQ = 4.3ms42x Improvement(Concurrency)
36GB 15K 179 IOPS
200 IOPS
36GB 10K 100 IOPS
© Copyright 2004 Storage Technology Corporation (StorageTek)
Small RAID Groups
Small RAID Groups provide –
more throughput than a single large RAID group (in aggregate)
• We do not design for a RAID level but determine the best layout for the drives we have configured.
a better configuration for High Availability (H/A)
• A large RAID group often must have two or more disks in a single tray
smaller LUNs with less contention
• A RAID5 (3+1) using four 73GB disk drives is 210GB useable
• A RAID5 (7+1) using eight 73GB disk drives is 420GB useable better RAID5 write performance
better balance on four FC-AL loop disk arrays
© Copyright 2004 Storage Technology Corporation (StorageTek)
More Throughput
RAID5 (3+1)128K Segment Size
RAID5(3+1) design benchmark is about 9,000 IOPS.
RAID5 (7+1) or 128K Segment Size
RAID5(7+1) design benchmark is about 11,000 IOPS.
2x RAID5(3+1) designs benchmark at 18,000 IOPS.
© Copyright 2004 Storage Technology Corporation (StorageTek)
Better High Availability (H/A)
Configuring RAID10 vertically potentially puts the mirror in the same disk tray as the primary.
Configuring RAID10 horizontally puts the mirror in a separate disk tray as the primary.
Configuring RAID5 horizontally puts all disks in the same disk tray.
Configuring RAID5 vertically puts all disks in a separate disk tray.
© Copyright 2004 Storage Technology Corporation (StorageTek)
Less ContentionOne 8-way vs. two 4-way stripes
Disk RAID5 Clock 1 2 3 4 5 6 7 8 9 1012345 64K678
RAID5 1234
RAID51 64K234
Database Retrievals
SS
SS
SS
SS
SS
SS
SS
SS
S = serviced concurrently
SS
© Copyright 2004 Storage Technology Corporation (StorageTek)
RAID5 Write Algorithms
Full-stripe writes: writes that update all the stripe units in a parity group.
> The new parity value is computed across all new blocks.
> No additional read or write operations are required to compute parity
> Full-stripe writes are the most efficient type of writes.
Reconstruct writes: writes that compute parity by reading in the data from the stripe that are not to be updated.
> Parity is then computed over this data and the new data.
> Reconstruct writes are less efficient than full-stripe writes
Read-modify writes: writes that compute the new parity value by
> 1) reading the old data blocks from the disks to be updated
> 2) reading the old parity blocks for the stripe
> 3) calculating how the new data is different from the old data
> 4) changing the old parity to reflect these differences.
Source: Striping in a RAID Level 5 Disk Array University of Michigan
© Copyright 2004 Storage Technology Corporation (StorageTek)
RAID5 Rules of Thumb
Reads in a RAID Level 5 are very similar to RAID Level 0
Writes in a RAID Level 5 are quite different
In general -
> writes that span a larger fraction of the stripe are more efficient than writes that span a smaller fraction.
Smaller RAID groups tend to be more efficient for writes than large RAID groups.
Source: Striping in a RAID Level 5 Disk Array University of Michigan
© Copyright 2004 Storage Technology Corporation (StorageTek)
Designing with four FC-AL Loops
Larger capacity disk drives are allowing for smaller arrays with 20TB capacities
Larger capacities but smaller configurations require fewer components overall
Attention to small details are now more important
© Copyright 2004 Storage Technology Corporation (StorageTek)
9176 Customer Benchmark
0
50
100
150
200
250
9145 (3+1) 128K (7+1) 128K (7+1) 256K
RAID5 Type and Segment Size
Ela
pse
d T
ime
(lo
wer
is b
ette
r)
Test 1
Test 2
Test 3
Test 4
Small RAID - Matching Segment Size
Baseline
80% Improvement Overall
© Copyright 2004 Storage Technology Corporation (StorageTek)
Fibre Channel Arbitrated Loop Performance Estimate
0.00
20.00
40.00
60.00
80.00
100.00
120.00
2 10 18 26 34 42 50 58 66 74 82 90 98 106 114 122
Number of Drives on the Loop
Th
rou
gh
pu
t (M
B/s
ec)
0.00
5.00
10.00
15.00
20.00
25.00
Lo
op
Ove
rhea
d p
er C
om
man
d (
mse
c)
Resultant Throughput
Loop Throughput
Drive Throughput
Command Latency (msec)
(Transfer size = 8,192 bytes: 5 Word delay, Full-Duplex, BB_Credit = Non-zero, 0% Reads / 100% Writes, 3 Write Tenancies, 2 Read Tenancies)
Loop Saturation – 1Gbps FC-AL
© Copyright 2004 Storage Technology Corporation (StorageTek)
Vertical Load Balancing (StorageTek Array)
Two trays support four fibers for 4 x 2Gbps throughput.
•LCC•LCC •LCC•LCC
A D200 tray has a 2Gbps back-plane.
One tray supports two fibers for 2 x 2Gbps throughput.
•LCC•LCC •LCC•LCC
D280
controller
controller
Cache Battery Cache Battery
•LCC•LCC •LCC•LCC
Three trays support four fibers for 4 x 2Gbps throughput but the loops are unbalanced.
Four trays support four fibers for 4 x 2Gbps throughput and the loops are balanced.
•LCC•LCC •LCC•LCC
© Copyright 2004 Storage Technology Corporation (StorageTek)
Horizontal Load Balancing (StorageTek Array)
Each tray supports two fibre channel loops.
All Even Numbered slots are serviced by the red loop with fail-over to the green loop.
All Odd Numbered slots are serviced by green loop with fail-over to the red loop.
•LCC•LCC •LCC•LCC
0 1 2 3 4 5
© Copyright 2004 Storage Technology Corporation (StorageTek)
Tray3
Tray1
Tray2
1 2 3 4 5 6 7 8 9 10
Volume Group1
Volume Group2
Volume Group3
Volume Group4
Hot Spare
Not Used
3-Tray Design Problems
RAID5(6+1)
© Copyright 2004 Storage Technology Corporation (StorageTek)
1 431 22
21 1 1 1 2
3
4
1 431 22 3
21 1 2 42 3
Volume group1
Volume group2
Volume group3
Volume group4
Back End Primary Paths
3-Tray Design Problems
Source: Jean Dion, StorageTek Canada
1 32 48 68 64-Tray RAID5(6+1)
1 32 47 77 74-Tray RAID5(3+1)
1 32 410 95 43-Tray RAID5(6+1)
© Copyright 2004 Storage Technology Corporation (StorageTek)
Spinning on a Cache Miss
Sequential Reads - Throughput18 GB - 10K Seagate Drives
120
140
160
180
200
1 LUN per Controller = 50% throughput.6 LUNs per Controller = 100% throughput.
© Copyright 2004 Storage Technology Corporation (StorageTek)
A Real-World Example
Traditional ORACLE layout using the Optimal Flexible Architecture (OFA)
Newer ORACLE layout using the “Stripe and Mirror Everything” Architecture (S.A.M.E.)
© Copyright 2004 Storage Technology Corporation (StorageTek)
Parallel Access - 24 Disks - RAID1
Data1
Data2
Data3
Data4
Index1
Index2
Index3
Index4
Archive
RollBack
/u01
Temp
RAID1 - 72GB Drives - 144GB per Volume Group - 864GB Useable
IOPS IOPSGB Used Available
1 72 Data1 230 2002 72 Data2 160 2003 72 Data3 100 2004 72 Data4 110 2005 72 Index1 230 2006 72 Index2 160 2007 72 Index3 100 2008 72 Index4 110 200
9 72 Dump 50 20010 72 RollBack 50 20011 72 /U01 20 20012 72 Temp 40 200
864GB Used IOPS 1360
Max IOPS 2400# of Disks 24
Since IOPS cannot be shared Hot Spots are created.
Twelve threads.
© Copyright 2004 Storage Technology Corporation (StorageTek)
Parallel Access - 24 Disks - RAID5
Data1 – Index2
Data2 – Index3
Data3 – Index4
Data4 – Index1
Presented to the server as 12 separate LUNs that are NOT concatenated together.
Faster “Virtual” volumes are more resistant to Hot Spots.
Used Available Used Available144GB 216GB Data1 Index2 390 400144GB 216GB Data2 Index3 260 400144GB 216GB Data3 Index4 210 400144GB 216GB Data4 Index1 340 400
144GB 216GB RollBack Dump 100 400144GB 216GB Temp /U01 60 400
864GB 1296GB Used IOPS 1360
Max IOPS 2400# of Disks 24# of arrays 6
RAID5 72GB Drives
RollBack - Dump
Temp - /u01
For the same number of disks we see:
Wasted disk
but no Hot Spots.
Six threads.
© Copyright 2004 Storage Technology Corporation (StorageTek)
Stripe and Mirror Everything (SAME)
DataIndexSwap
RollbackDumpTemp/u01
IOPS IOPSUsed Available
340 400340 400340 400340 400
1360 Used1600 Max85%
16 Disks864GB Total
IOPS IOPSUsed Available
113 200113 200
/// ///113 200
1360 Used2400 Max56%
24 Disks864GB Total
IOPS IOPSUsed Available
227 400227 400
/// ///227 400
1362 Used2400 Max57%
24 Disks1296GB Total
24 Disks – 1 Thread
16 Disks – 1 Thread
Option # 2Same # of disks
Option # 1Same Size DB
© Copyright 2004 Storage Technology Corporation (StorageTek)
ORACLE RDBMS Layout
RAID5 Group
RAID5 Group
Service Processor “B”Service Processor “B”
Service Processor “A”Service Processor “A”
RAID1 Group
Data_vg
RAID5(3+1) 73GB Disk drives 10K RPM~219GB Volume GroupTwo ~110GB Volumes
RAID1(1+1) 73GB Disk drives 10K RPM~73GB Volume GroupTwo ~36GB Volumes
Oracle_vg
© Copyright 2004 Storage Technology Corporation (StorageTek)
Technologies roadmapOptical-electronic technologies
2000 2005 2010 2015
DVD R/W
(18GB today)Blue-Ray Disc
200 GB planned
Holographic storage
(3D)
Colossal Storage Corp. project:
(atomic holography recoding)
density : 200 Tbits/in2
Other Technologies (magnetic, MEMS..)
2000 2005 2010 2015
Atomic resolution storage
(1 000 Gbits/in2)
Heat-assisted magnetic recording
(1000 Gbits/in2 planned)
Perpendicular magnetization AFM storage
( Millipede IBM project)
Superparamagnetic effect (2005)
60 gbits/in2 (source IBM)
Self-Ordered Magnetic Arrays (SOMA)
© Copyright 2004 Storage Technology Corporation (StorageTek)
Increasing the Areal Density
SUPER PARAMAGNETIC LIMIT:
> Min elementary bit size 9 nm
> Could be reached by 2005
> 60 - 70 Gb/in² is the limit
TRANSITION WIDTH:
> The width between two neighboring bits of opposite magnetization : minimum distance 40 to 80 nm
SIDE TRACK EFFECT:
> Requires extra space between tracks to prevent over-writing
TRACKING:
> The smaller the bits get the more difficult it is to read them
Source: Seagate – From Gigabytes to Exabytes
© Copyright 2004 Storage Technology Corporation (StorageTek)
Single Pole Perpendicular Magnetic Recording
Source: Seagate – From Gigabytes to Exabytes
High-density magnetic data storageHigh-density magnetic data storage
Longitudinal recording methods lay the magnetic media in the plane of
the recording surface
Longitudinal Recording
Perpendicular Recording
• may approach 1 Tb per in2
• 1TB of data on a 3.5 inch disk• 1TB of data on a tape cartridge
Perpendicular recording methods stand the magnetization of the media on end, perpendicular to the plane of the recording surface
© Copyright 2004 Storage Technology Corporation (StorageTek)
Heat Assisted Magnetic Recording
Heat Assisted Magnetic Recording (HAMR)
> Also know as Optically assisted recording
> Involves producing a hot spot (commonly with a laser) on the media, while data is simultaneously written magnetically.
> The net effect is that when the media is heated, the coercivity or field required to write on the media is reduced
> Higher stability against superparagmagnetism
Is it HAMR or OAR?Is it HAMR or OAR?
A laser heats spots on the disk to make them easier to magnetize.
Source: Seagate – From Gigabytes to Exabytes
© Copyright 2004 Storage Technology Corporation (StorageTek)
Micro Electro-Mechanical System (MEMS)
Thermomechanical storage:
> Tiny depressions melted by an AFM tip into a polymer medium represent stored data bits that can then be read by the same tip
> 60Kbps Throughput – but there can be thousands of heads in an array
> 150 Gb/in² to 400 Gb/in²
The Millipede (IBM)
AFM (atomic-force microscopy) or “probe recording” AFM (atomic-force microscopy) or “probe recording”
Questions, Concerns or Comments?