GMR Head Northern California CMG February 2004 Ted Oatway Enterprise Solution Specialist

GMR Head

Northern California CMGFebruary 2004

Ted Oatway

Enterprise Solution Specialist

© Copyright 2004 Storage Technology Corporation (StorageTek)

Know Your Data

1) I/O is typically dominated by Reads

2) I/O is very Random even for large files

3) Large caches have little to offer

4) A few files are responsible for the majority of the activity

5) These files are typically mapped in host-memory by the calling process


Reference Paper

Instructional Workload (INS)

> twenty laboratory machines

> eight months of traces

Research Workload (RES)

> 13 desktop machines

> one year of traces

WEB workload

> single web server for an online library project

> uses the Postgres database management system

> 2,300 accesses per day

> one month of traces

A Comparison of File System Workloads

> Drew Roselli, Jacob R. Lorch, and Thomas E. Anderson

> University of California, Berkeley and University of Washington

> June, 2000


Basis Reference Papers

The BSD study

> In 1985, Ousterhout et al. presented a general characterization of dynamically collected traces [Oust85].

> They traced three servers running BSD UNIX for slightly over three days.

The Sprite study

> In 1991, Baker et al. conducted the same type of analysis on file server and client information and it’s affect on local cache activity.

> They collected four two-day sets of traces[Bake91].

The IBM study

> In 1991, Bozman et al. repeated many of the Sprite studies using traces from two separate IBM sites [Bozm91].

> This study confirmed that the Sprite study applied to non-academic sites as well.

The NT study

> In 1999, the same studies were repeated on three sets of two-week traces taken from 45 hosts running Windows NT [Voge99].


Conclusions (1 of 3)

Reads vs. Writes

> Reads typically dominate writes for block activity.

> WEB workloads are read intensive.

> RES - activity is dominated by writes to logs when local cache is small.

– When local cache is increased logging activity occurs on the host.

– Workload then matches INS

> INS workload - a small local cache significantly increases Read activity. (5x with too small local cache)

Average Block Lifetime

> UNIX, newly created blocks die within an hour.

> NT, newly created blocks that survive one second are likely to remain alive a day.

> All workloads –– Overwrites cause most

deleted blocks– Overwrites show substantial

locality– A small write buffer is

sufficient to absorb write traffic for nearly all workloads.

> A 30-second write delay in cache benefits most workloads.



Caching read traffic

> Small caches can sharply decrease disk read traffic

> There is no support for the claim that disk traffic becomes dominated by writes when a large cache is used

> Large caches show a diminishing return beyond the working set size

> Even a 1MB cache reduces read-bandwidth by 65–90%

Memory-mapping

> All modern workloads use memory-mapping to a large extent

> UNIX, a small number of memory-mapped files are shared among many active processes.

> If kept in memory as long as it is memory-mapped by any process, the miss rate for file map requests is extremely low.



Read-ahead pre-fetching

> Applications are accessing larger files and the maximum file size has increased in recent years.

> However, larger files are more likely to be accessed more randomly than before

> This renders straightforward pre-fetching less useful

File access patterns

> File access patterns are bimodal in that most files tend to be mostly read or mostly written.

> Especially true for files that are accessed frequently


Understanding the Basic Building Block

Disk capacity has increased by a factor of 8x in the last five years

> 18GB, 36GB, 73GB, 146GB

> Soon to be 300GB

Disk Internal Transfer Rates have recently barely doubled

> 32MB/sec to 60MB/sec


Critical Disk Drive Metrics

ITR – Internal Transfer Rate - How quickly the disk can transfer the data between the platters and the buffer cache.

ETR – External Transfer Rate - How quickly the disk can transfer the data between the buffer cache and the controller.

On-Board Cache - The buffer cache resident on the disk drive. Some disk controllers allow this cache to be operated in write-through mode but this is unusual today.

RPM-Revolutions Per Minute or ‘How quickly the next block of data comes around ‘.

Average Access Time- On average how quickly the head can move to a selected track.

IOPS- RPM + Average Access Time determines the number of I/O operations the disk can perform per second.


Disk Specifications

Capacity RPM Cache ETR ITR IOPS36GB (older) 10K 4MB 100 MB/sec 35 MB/sec 12336GB 10K 4MB 200 MB/sec 35 MB/sec 10072GB 10K 4MB 200 MB/sec 35 MB/sec 100

36GB 15K 8MB 200 MB/sec 60 MB/sec 179 72GB 15K 8MB 200 MB/sec 60 MB/sec 179

146GB 10K 16MB 200 MB/sec 60 MB/sec 130180GB 7.2K 16MB 200 MB/sec 28 MB/sec 89

Barracuda SATA Disk Drive:120GB 7.2K 8MB 150 MB/sec 71MB/sec 75


Understanding Volume (RAID) Groups

1) Small RAID groups are better than larger RAID groups for most workloads

2) Small RAID groups better emulate individual disk drives

3) I/O is “bursty” by nature

4) “Bursty” and Random access patterns cause I/O’s to block at the disk drive

5) Contention occurs at the Volume Group level


Volume Group

RAID5 (3+1)

LUN0

Volume = Logical Unit

LU N 1

Rules of Thumb• The smallest disk in a stripe determines the overall

size of the stripe.• The slowest disk in a stripe determines the overall

speed of the stripe.• Contention occurs at the Volume Group level.

D-Series Disk – Basic Building Blocks


Queuing Theory

60 IOPS

AWQ = 181ms

18GB 68 IOPS

60 IOPS

AWQ = 2.8ms64x Improvement(Parallelism)

18GB 68 IOPS

18GB 68 IOPS

White Paper – Storage Systems Performance ConsiderationsAWQ = Average Wait Queue

60 IOPS

AWQ = 4.3ms42x Improvement(Concurrency)

36GB 15K 179 IOPS

200 IOPS

36GB 10K 100 IOPS


Small RAID Groups

Small RAID Groups provide –

more throughput than a single large RAID group (in aggregate)

• We do not design for a RAID level but determine the best layout for the drives we have configured.

a better configuration for High Availability (H/A)

• A large RAID group often must have two or more disks in a single tray

smaller LUNs with less contention

• A RAID5 (3+1) using four 73GB disk drives is 210GB useable

• A RAID5 (7+1) using eight 73GB disk drives is 420GB useable better RAID5 write performance

better balance on four FC-AL loop disk arrays


More Throughput

RAID5 (3+1)128K Segment Size

RAID5(3+1) design benchmark is about 9,000 IOPS.

RAID5 (7+1) or 128K Segment Size

RAID5(7+1) design benchmark is about 11,000 IOPS.

2x RAID5(3+1) designs benchmark at 18,000 IOPS.


Better High Availability (H/A)

Configuring RAID10 vertically potentially puts the mirror in the same disk tray as the primary.

Configuring RAID10 horizontally puts the mirror in a separate disk tray as the primary.

Configuring RAID5 horizontally puts all disks in the same disk tray.

Configuring RAID5 vertically puts all disks in a separate disk tray.


Less ContentionOne 8-way vs. two 4-way stripes

Disk RAID5 Clock 1 2 3 4 5 6 7 8 9 1012345 64K678

RAID5 1234

RAID51 64K234

Database Retrievals

SS

SS

SS

SS

SS

SS

SS

SS

S = serviced concurrently

SS


RAID5 Write Algorithms

Full-stripe writes: writes that update all the stripe units in a parity group.

> The new parity value is computed across all new blocks.

> No additional read or write operations are required to compute parity

> Full-stripe writes are the most efficient type of writes.

Reconstruct writes: writes that compute parity by reading in the data from the stripe that are not to be updated.

> Parity is then computed over this data and the new data.

> Reconstruct writes are less efficient than full-stripe writes

Read-modify writes: writes that compute the new parity value by

> 1) reading the old data blocks from the disks to be updated

> 2) reading the old parity blocks for the stripe

> 3) calculating how the new data is different from the old data

> 4) changing the old parity to reflect these differences.

Source: Striping in a RAID Level 5 Disk Array University of Michigan


RAID5 Rules of Thumb

Reads in a RAID Level 5 are very similar to RAID Level 0

Writes in a RAID Level 5 are quite different

In general -

> writes that span a larger fraction of the stripe are more efficient than writes that span a smaller fraction.

Smaller RAID groups tend to be more efficient for writes than large RAID groups.

Source: Striping in a RAID Level 5 Disk Array University of Michigan


Designing with four FC-AL Loops

Larger capacity disk drives are allowing for smaller arrays with 20TB capacities

Larger capacities but smaller configurations require fewer components overall

Attention to small details are now more important


9176 Customer Benchmark

0

50

100

150

200

250

9145 (3+1) 128K (7+1) 128K (7+1) 256K

RAID5 Type and Segment Size

Ela

pse

d T

ime

(lo

wer

is b

ette

r)

Test 1

Test 2

Test 3

Test 4

Small RAID - Matching Segment Size

Baseline

80% Improvement Overall


Fibre Channel Arbitrated Loop Performance Estimate

0.00

20.00

40.00

60.00

80.00

100.00

120.00

2 10 18 26 34 42 50 58 66 74 82 90 98 106 114 122

Number of Drives on the Loop

Th

rou

gh

pu

t (M

B/s

ec)

0.00

5.00

10.00

15.00

20.00

25.00

Lo

op

Ove

rhea

d p

er C

om

man

d (

mse

c)

Resultant Throughput

Loop Throughput

Drive Throughput

Command Latency (msec)

(Transfer size = 8,192 bytes: 5 Word delay, Full-Duplex, BB_Credit = Non-zero, 0% Reads / 100% Writes, 3 Write Tenancies, 2 Read Tenancies)

Loop Saturation – 1Gbps FC-AL


Vertical Load Balancing (StorageTek Array)

Two trays support four fibers for 4 x 2Gbps throughput.

•LCC•LCC •LCC•LCC

A D200 tray has a 2Gbps back-plane.

One tray supports two fibers for 2 x 2Gbps throughput.


D280

controller

controller

Cache Battery Cache Battery


Three trays support four fibers for 4 x 2Gbps throughput but the loops are unbalanced.

Four trays support four fibers for 4 x 2Gbps throughput and the loops are balanced.



Horizontal Load Balancing (StorageTek Array)

Each tray supports two fibre channel loops.

All Even Numbered slots are serviced by the red loop with fail-over to the green loop.

All Odd Numbered slots are serviced by green loop with fail-over to the red loop.


0 1 2 3 4 5


Tray3

Tray1

Tray2

1 2 3 4 5 6 7 8 9 10

Volume Group1

Volume Group2

Volume Group3

Volume Group4

Hot Spare

Not Used

3-Tray Design Problems

RAID5(6+1)


1 431 22

21 1 1 1 2

3

4

1 431 22 3

21 1 2 42 3

Volume group1

Volume group2

Volume group3

Volume group4

Back End Primary Paths

3-Tray Design Problems

Source: Jean Dion, StorageTek Canada

1 32 48 68 64-Tray RAID5(6+1)

1 32 47 77 74-Tray RAID5(3+1)

1 32 410 95 43-Tray RAID5(6+1)


Spinning on a Cache Miss

Sequential Reads - Throughput18 GB - 10K Seagate Drives

120

140

160

180

200

1 LUN per Controller = 50% throughput.6 LUNs per Controller = 100% throughput.


A Real-World Example

Traditional ORACLE layout using the Optimal Flexible Architecture (OFA)

Newer ORACLE layout using the “Stripe and Mirror Everything” Architecture (S.A.M.E.)


Parallel Access - 24 Disks - RAID1

Data1

Data2

Data3

Data4

Index1

Index2

Index3

Index4

Archive

RollBack

/u01

Temp

RAID1 - 72GB Drives - 144GB per Volume Group - 864GB Useable

IOPS IOPSGB Used Available

1 72 Data1 230 2002 72 Data2 160 2003 72 Data3 100 2004 72 Data4 110 2005 72 Index1 230 2006 72 Index2 160 2007 72 Index3 100 2008 72 Index4 110 200

9 72 Dump 50 20010 72 RollBack 50 20011 72 /U01 20 20012 72 Temp 40 200

864GB Used IOPS 1360

Max IOPS 2400# of Disks 24

Since IOPS cannot be shared Hot Spots are created.

Twelve threads.


Parallel Access - 24 Disks - RAID5

Data1 – Index2

Data2 – Index3

Data3 – Index4

Data4 – Index1

Presented to the server as 12 separate LUNs that are NOT concatenated together.

Faster “Virtual” volumes are more resistant to Hot Spots.

Used Available Used Available144GB 216GB Data1 Index2 390 400144GB 216GB Data2 Index3 260 400144GB 216GB Data3 Index4 210 400144GB 216GB Data4 Index1 340 400

144GB 216GB RollBack Dump 100 400144GB 216GB Temp /U01 60 400

864GB 1296GB Used IOPS 1360

Max IOPS 2400# of Disks 24# of arrays 6

RAID5 72GB Drives

RollBack - Dump

Temp - /u01

For the same number of disks we see:

Wasted disk

but no Hot Spots.

Six threads.


Stripe and Mirror Everything (SAME)

DataIndexSwap

RollbackDumpTemp/u01

IOPS IOPSUsed Available

340 400340 400340 400340 400

1360 Used1600 Max85%

16 Disks864GB Total


113 200113 200

/// ///113 200

1360 Used2400 Max56%

24 Disks864GB Total


227 400227 400

/// ///227 400

1362 Used2400 Max57%

24 Disks1296GB Total

24 Disks – 1 Thread

16 Disks – 1 Thread

Option # 2Same # of disks

Option # 1Same Size DB


ORACLE RDBMS Layout

RAID5 Group

RAID5 Group

Service Processor “B”Service Processor “B”

Service Processor “A”Service Processor “A”

RAID1 Group

Data_vg

RAID5(3+1) 73GB Disk drives 10K RPM~219GB Volume GroupTwo ~110GB Volumes

RAID1(1+1) 73GB Disk drives 10K RPM~73GB Volume GroupTwo ~36GB Volumes

Oracle_vg


Technologies roadmapOptical-electronic technologies

2000 2005 2010 2015

DVD R/W

(18GB today)Blue-Ray Disc

200 GB planned

Holographic storage

(3D)

Colossal Storage Corp. project:

(atomic holography recoding)

density : 200 Tbits/in2

Other Technologies (magnetic, MEMS..)

2000 2005 2010 2015

Atomic resolution storage

(1 000 Gbits/in2)

Heat-assisted magnetic recording

(1000 Gbits/in2 planned)

Perpendicular magnetization AFM storage

( Millipede IBM project)

Superparamagnetic effect (2005)

60 gbits/in2 (source IBM)

Self-Ordered Magnetic Arrays (SOMA)


Increasing the Areal Density

SUPER PARAMAGNETIC LIMIT:

> Min elementary bit size 9 nm

> Could be reached by 2005

> 60 - 70 Gb/in² is the limit

TRANSITION WIDTH:

> The width between two neighboring bits of opposite magnetization : minimum distance 40 to 80 nm

SIDE TRACK EFFECT:

> Requires extra space between tracks to prevent over-writing

TRACKING:

> The smaller the bits get the more difficult it is to read them

Source: Seagate – From Gigabytes to Exabytes


Single Pole Perpendicular Magnetic Recording


High-density magnetic data storageHigh-density magnetic data storage

Longitudinal recording methods lay the magnetic media in the plane of

the recording surface

Longitudinal Recording

Perpendicular Recording

• may approach 1 Tb per in2

• 1TB of data on a 3.5 inch disk• 1TB of data on a tape cartridge

Perpendicular recording methods stand the magnetization of the media on end, perpendicular to the plane of the recording surface


Heat Assisted Magnetic Recording

Heat Assisted Magnetic Recording (HAMR)

> Also know as Optically assisted recording

> Involves producing a hot spot (commonly with a laser) on the media, while data is simultaneously written magnetically.

> The net effect is that when the media is heated, the coercivity or field required to write on the media is reduced

> Higher stability against superparagmagnetism

Is it HAMR or OAR?Is it HAMR or OAR?

A laser heats spots on the disk to make them easier to magnetize.



Micro Electro-Mechanical System (MEMS)

Thermomechanical storage:

> Tiny depressions melted by an AFM tip into a polymer medium represent stored data bits that can then be read by the same tip

> 60Kbps Throughput – but there can be thousands of heads in an array

> 150 Gb/in² to 400 Gb/in²

The Millipede (IBM)

AFM (atomic-force microscopy) or “probe recording” AFM (atomic-force microscopy) or “probe recording”

Questions, Concerns or Comments?

Documents

GMR Head Northern California CMG February 2004 Ted Oatway Enterprise Solution Specialist