1 Yotta Zetta Exa Peta Tera Giga Mega Kilo Storage: Alternate Futures Jim Gray Microsoft Research Gray/talks IBM Almaden,

1

Yotta

Zetta

Exa

Peta

Tera

Giga

Mega

Kilo

Storage: Alternate FuturesStorage: Alternate FuturesJim Gray

Microsoft Research

http://Research.Microsoft.com/~Gray/talks

IBM Almaden, 1 December 1999

2

Acknowledgments: Thank You!!

• Dave Patterson: – Convinced me that processors are moving to the

devices.

• Kim Keeton and Erik Riedell– Showed that many useful subtasks can be done by

disk-processors, and quantified execution interval

• Remzi Dusseau – Re-validated Amdahl's laws

3

Outline• The Surprise-Free Future (5 years)

– 500 mips cpus for 10$ – 1 Gb RAM chips – MAD at 50 Gbpsi – 10 GBps SANs are ubiquitous– 1 GBps WANs are ubiquitous

• Some consequences– Absurd (?) consequences.– Auto-manage storage– Raid10 replaces Raid5– Disc-packs– Disk is the archive media of choice

• A surprising future?– Disks (and other useful things) become supercomputers.– Apps run “in the disk”

4

The Surprise-free Storage Future• 1 Gb RAM chips

• MAD at 50 Gbpsi

• Drives shrink one quantum

• Standard IO

• 10 GBps SANs are ubiquitous

• 1 Gbps WANs are ubiquitous

• 5 bips cpus for 1K$ and 500 mips cpus for 10$

5

1 Gb RAM Chips • Moving to 256 Mb chips now

• 1Gb will be “standard” in 5 years, 4 Gb will be premium product.

• Note: – 256Mb = 32MB: the smallest memory– 1 Gb = 128 MB: the smallest memory

6

System On A Chip• Integrate Processing with memory on one chip

– chip is 75% memory now– 1MB cache >> 1960 supercomputers– 256 Mb memory chip is 32 MB!– IRAM, CRAM, PIM,… projects abound

• Integrate Networking with processing on one chip– system bus is a kind of network– ATM, FiberChannel, Ethernet,.. Logic on chip.– Direct IO (no intermediate bus)

• Functionally specialized cards shrink to a chip.

7

500 mips System On A Chip for 10$

• 486 now 7$ 233 MHz ARM for 10$ system on a chiphttp://www.cirrus.com/news/products99/news-product14.html AMD/Celeron 266 ~ 30$

• In 5 years, today’s leading edge will be– System on chip (cpu, cache, mem ctlr, multiple IO)– Low cost– Low-power – Have integrated IO

• High end is 5 BIPS cpus

8

Standard IO in 5 Years

• Probably

• Replace PCI with something better will still need a mezzanine bus standard

• Multiple serial links directly from processor

• Fast (10 GBps/link) for a few meters

• System Area Networks (SANS) ubiquitous (VIA morphs to SIO?)

9

1 GBps1 GBps

Ubiquitous 10 GBps SANs in 5 years

• 1Gbps Ethernet are reality now.– Also FiberChannel ,MyriNet, GigaNet,

ServerNet,, ATM,…

• 10 Gbps x4 WDM deployed now (OC192)

– 3 Tbps WDM working in lab

• In 5 years, expect 10x, progress is astonishing

• Gilder’s law: Bandwidth grows 3x/year http://www.forbes.com/asap/97/0407/090.htm

5 MBps20 Mbsp

40 MBps

80 MBps

120 MBps120 MBps(1Gbps)(1Gbps)

10

Thin Client’s mean HUGE servers

• AOL hosting customer pictures

• Hotmail allows 5 MB/user, 50 M users

• Web sites offer electronic vaulting for SOHO.

• IntelliMirror: replicate client state on server

• Terminal server: timesharing returns

• …. Many more.

11

Remember Your Roots?

12

MAD at 50 Gbpsi• MAD: Magnetic Aerial Density:

3-10 Mbpsi in products 28 Mbpsi in lab 50 Mbpsi = paramagnetic limit

but…. People have ideas.

• Capacity: rise 10x in 5 years (conservative)• Bandwidth: rise 4x in 5 years (density+rpm) • Disk: 50GB to 500 GB,

• 60-80MBps • 1k$/TB• 15 minute to 3 hour scan time.

13

The “Absurd” Disk

• 2.5 hr scan time (poor sequential access)

• 1 aps / 5 GB (VERY cold data)

• It’s a tape!

1 TB100 MB/s

200 Kaps

14

Disk vs Tape

• Disk– 47 GB

– 15 MBps

– 5 ms seek time

– 3 ms rotate latency

– 9$/GB for drive 3$/GB for ctlrs/cabinet

– 4 TB/rack

• Tape– 40 GB

– 5 MBps

– 30 sec pick time

– Many minute seek time

– 5$/GB for media10$/GB for drive+library

– 10 TB/rack

The price advantage of tape is narrowing, and the performance advantage of disk is growing

GuestimatesCern: 200 TB3480 tapes2 col = 50GBRack = 1 TB=20 drives

15

Standard Storage Metrics• Capacity:

– RAM: MB and $/MB: today at 512MB and 3$/MB– Disk: GB and $/GB: today at 50GB and 10$/GB– Tape: TB and $/TB: today at 50GB and 12k$/TB (nearline)

• Access time (latency)– RAM: 100 ns– Disk: 10 ms– Tape: 30 second pick, 30 second position

• Transfer rate– RAM: 1 GB/s– Disk: 15 MB/s - - - Arrays can go to 1GB/s– Tape: 5 MB/s - - - striping is problematic, but “works”

16

New Storage Metrics: Kaps, Maps, SCAN?

• Kaps: How many kilobyte objects served per second– The file server, transaction processing metric

– This is the OLD metric.

• Maps: How many megabyte objects served per second– The Multi-Media metric

• SCAN: How long to scan all the data– the data mining and utility metric

• And– Kaps/$, Maps/$, TBscan/$

19

The Access Time Myth• The Myth: seek or pick time dominates• The reality: (1) Queuing dominates• (2) Transfer dominates BLOBs• (3) Disk seeks often short• Implication: many cheap servers

better than one fast expensive server– shorter queues– parallel transfer– lower cost/access and cost/byte

• This is obvious for disk arrays• This even more obvious for tape arrays

Seek

Rotate

Transfer

Seek

Rotate

Transfer

Wait

20

Storage Ratios Changed• 10x better access time

• 10x more bandwidth

• 4,000x lower media price

Disk Performance vs Time

1

10

100

1980 1990 2000

Year

seek

s p

er s

eco

nd

ban

dw

idth

: MB

/s

0.1

1.

10.

Cap

acity

(GB

)

Disk accesses/second vs Time

1

10

100

1980 1990 2000

Year

Acc

esse

s p

er S

eco

nd

Storage Price vs TimeMegabytes per kilo-dollar

0.1

1.

10.

100.

1,000.

10,000.

1980 1990 2000

Year

MB

/k$

• DRAM/disk media price ratio changed– 1970-1990

100:1

– 1990-1995 10:1

– 1995-1997 50:1

– today ~ 0.1$pMB disk 30:1

3$pMB dram

21

Data on Disk Can Move to RAM in 8 years

Storage Price vs TimeMegabytes per kilo-dollar

0.1

1.

10.

100.

1,000.

10,000.

1980 1990 2000

Year

MB

/k$

30:1

6 years

22


– 500 mips cpus for 10$ – 1 Gb RAM chips – MAD at 50 Gbpsi – 10 GBps SANs are ubiquitous– 1 GBps WANs are ubiquitous


• A surprising future?– Disks (and other useful things) become supercomputers.– Apps run “in the disk”.

23

The (absurd?) consequences• 256 way nUMA?• Huge main memories: now:

500MB - 64GB memories then: 10GB - 1TB memories

• Huge disksnow: 5-50 GB 3.5” disks then: 50-500 GB disks

• Petabyte storage farms– (that you can’t back up or restore).

• Disks >> tapes– “Small” disks:

One platter one inch 10GB

• SAN convergence 1 GBps point to point is easy

• 1 GB RAM chips

• MAD at 50 Gbpsi

• Drives shrink one quantum

• 10 GBps SANs are ubiquitous

• 500 mips cpus for 10$

• 5 bips cpus at high end

24

The Absurd? Consequences• Further segregate processing from storage

• Poor locality

• Much useless data movement

• Amdahl’s laws: bus: 10 B/ips io: 1 b/ips

ProcessorsDisks

~ 1 Tips

RAM Memory

~ 1 TB

~ 100TB

100 GBps10 TBps

25

Storage Latency: How Far Away is the Data?

RegistersOn Chip CacheOn Board Cache

Memory

Disk

12

10

100

Tape /Optical Robot

10 9

10 6

Olympia

This Hotel

This RoomMy Head

10 min

1.5 hr

2 Years

1 min

Pluto

2,000 YearsAndromeda

26

Consequences• AutoManage Storage

• Sixpacks (for arm-limited apps)

• Raid5-> Raid10

• Disk-to-disk backup

• Smart disks

27

Auto Manage Storage• 1980 rule of thumb:

– A DataAdmin per 10GB, SysAdmin per mips

• 2000 rule of thumb– A DataAdmin per 5TB – SysAdmin per 100 clones (varies with app).

• Problem:– 5TB is 60k$ today, 10k$ in a few years.– Admin cost >> storage cost???

• Challenge: – Automate ALL storage admin tasks

28

The “Absurd” Disk

• 2.5 hr scan time (poor sequential access)

• 1 aps / 5 GB (VERY cold data)

• It’s a tape!

1 TB100 MB/s

200 Kaps

29

Extreme case: 1TB disk: Alternatives

• Use all the heads in parallel– Scan in 30 minutes– Still one Kaps/5GB

• Use one platter per arm– Share power/sheetmetal– Scan in 30 minutes– One KAPS per GB

1 TB500 MB/s

200 Kaps

200GB 200GB eacheach

500 MB/s

1,000 Kaps

30

Drives shrink (1.8”, 1”)• 150 kaps for 500 GB is VERY cold data

• 3 GB/platter today, 30 GB/platter in 5years.

• Most disks are ½ full• TPC benchmarks use 9GB drives

(need arms or bandwidth).

• One solution: smaller form factor– More arms per GB

– More arms per rack

– More arms per Watt

31

Prediction: 6-packs

• One way or another, when disks get huge– Will be packaged as multiple arms– Parallel heads gives bandwidth– Independent arms gives bandwidth & aps

• Package shares power, package, interfaces…

32

Stripes, Mirrors, Parity (RAID 0,1, 5)

• RAID 0: Stripes– bandwidth

• RAID 1: Mirrors, Shadows,…– Fault tolerance– Reads faster, writes 2x slower

• RAID 5: Parity– Fault tolerance– Reads faster– Writes 4x or 6x slower.

0,3,6,.. 1,4,7,.. 2,5,8,..

0,1,2,.. 0,1,2,..

0,2,P2,.. 1,P1,4,.. P0,3,5,..

33

RAID 10 (strips of mirrors) Wins“wastes space, saves arms”

RAID 5:

• Performance– 225 reads/sec– 70 writes/sec– Write

• 4 logical IO,

• 2 seek + 1.7 rotate

• SAVES SPACE

• Performance degrades on failure

RAID1

• Performance– 250 reads/sec– 100 writes/sec– Write

• 2 logical IO

• 2 seek 0.7 rotate

• SAVES ARMS

• Performance improves on failure

34

The Storage RackToday

• 140 arms • 4TB• 24 racks

24 storage processors6+1 in rack

• Disks = 2.5 GBps IO• Controllers = 1.2 GBps IO• Ports 500 MBps IO

35

Storage Rack in 5 years?• 140 arms

• 50TB• 24 racks

24 storage processors6+1 in rack

• Disks = 14 GBps IO• Controllers = 5 GBps IO• Ports 1 GBps IO

• My suggestion: move the processors into the storage racks.

36

It’s hard to archive a PetaByteIt takes a LONG time to restore it.

• Store it in two (or more) places online (on disk?).

• Scrub it continuously (look for errors)

• On failure, refresh lost copy from safe copy.

• Can organize the two copies differently (e.g.: one by time, one by space)

37

Crazy Disk Ideas• Disk Farm on a card: surface mount disks

• Disk (magnetic store) on a chip: (micro machines in Silicon)

• Full Apps (e.g. SAP, Exchange/Notes,..) in the disk controller

(a processor with 128 MB dram)ASIC

The Innovator's Dilemma: When New Technologies Cause Great Firms to FailClayton M. Christensen.ISBN: 0875845851

38

The Disk Farm On a Card• The 500GB disc card• An array of discs• Can be used as• 100 discs• 1 striped disc• 50 Fault Tolerant discs• ....etc• LOTS of accesses/second bandwidth

14"

39

Functionally Specialized Cards• Storage

• Network

• Display

M MB DRAM

P mips processor

ASIC

ASIC

ASIC Today:

P=50 mips

M= 2 MB

In a few years

P= 200 mips

M= 64 MB

40

Data Gravity Processing Moves to Transducers

• Move Processing to data sources• Move to where the power (and sheet metal) is • Processor in

– Modem– Display– Microphones (speech recognition)

& cameras (vision)– Storage: Data storage and analysis

41

It’s Already True of PrintersPeripheral = CyberBrick

• You buy a printer• You get a

– several network interfaces– A Postscript engine

• cpu, • memory, • software,• a spooler (soon)

– and… a print engine.

42

Disks Become Supercomputers

• 100x in 10 years 2 TB 3.5” drive

• Shrink to 1” is 200GB• Disk replaces tape?

• Disk is super computer!

Kilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

43

Tera Byte Backplane

• TODAY– Disk controller is 10 mips risc engine

with 2MB DRAM– NIC is similar power

• SOON– Will become 100 mips systems

with 100 MB DRAM.

• They are nodes in a federation(can run Oracle on NT in disk controller).

• Advantages– Uniform programming model– Great tools– Security– Economics (cyberbricks)– Move computation to data (minimize traffic)

All Device Controllers will be Cray 1’s

CentralProcessor &

Memory

44

With Tera Byte Interconnectand Super Computer Adapters

• Processing is incidental to – Networking– Storage– UI

• Disk Controller/NIC is – faster than device– close to device– Can borrow device

package & power

• So use idle capacity for computation.

• Run app in device.• Both Kim Keeton (UCB) and

Erik Riedel (CMU) thesis investigate thisshow benefits of this approach.

Tera ByteBackplane

45

Implications

• Offload device handling to NIC/HBA

• higher level protocols: I2O, NASD, VIA, IP, TCP…

• SMP and Cluster parallelism is important.

Tera Byte Backplane

• Move app to NIC/device controller

• higher-higher level protocols: CORBA / COM+.

• Cluster parallelism is VERY important.

CentralProcessor &

Memory

Conventional Radical

46

How Do They Talk to Each Other?• Each node has an OS• Each node has local resources: A federation.• Each node does not completely trust the others.• Nodes use RPC to talk to each other

– CORBA? COM+? RMI?

– One or all of the above.

• Huge leverage in high-level interfaces.• Same old distributed system story.

SANSIO

stre

ams

data

gram

s

RP

C?

Applications

SIO

streams

datagrams

RP

C ?

Applications

47

Basic Argument for x-Disks• Future disk controller is a super-computer.

– 1 bips processor– 128 MB dram– 100 GB disk plus one arm

• Connects to SAN via high-level protocols– RPC, HTTP, DCOM, Kerberos, Directory Services,…. – Commands are RPCs– management, security,….– Services file/web/db/… requests– Managed by general-purpose OS with good dev environment

• Move apps to disk to save data movement– need programming environment in controller

48

The Slippery Slope

• If you add function to server

• Then you add more function to server

• Function gravitates to data.

Nothing = Sector Server

Everything = App Server

Something =

Fixed App Server

49

Why Not a Sector Server?(let’s get physical!)

• Good idea, that’s what we have today.• But

– cache added for performance– Sector remap added for fault tolerance– error reporting and diagnostics added– SCSI commends (reserve,.. are growing)– Sharing problematic (space mgmt, security,…)

• Slipping down the slope to a 2-D block server

50

Why Not a 1-D Block Server?Put A LITTLE on the Disk Server

• Tried and true design– HSC - VAX cluster– EMC– IBM Sysplex (3980?)

• But look inside– Has a cache – Has space management– Has error reporting & management– Has RAID 0, 1, 2, 3, 4, 5, 10, 50,…– Has locking– Has remote replication– Has an OS– Security is problematic– Low-level interface moves too many bytes

51

Why Not a 2-D Block Server?Put A LITTLE on the Disk Server

• Tried and true design– Cedar -> NFS– file server, cache, space,..– Open file is many fewer msgs

• Grows to have– Directories + Naming– Authentication + access control– RAID 0, 1, 2, 3, 4, 5, 10, 50,…– Locking– Backup/restore/admin– Cooperative caching with client

• File Servers are a BIG hit: NetWare™– SNAP! is my favorite today

52

Why Not a File Server?Put a Little on the Disk Server

• Tried and true design– Auspex, NetApp, ...– Netware

• Yes, but look at NetWare– File interface gives you app invocation interface– Became an app server

• Mail, DB, Web,….

– Netware had a primitive OS• Hard to program, so optimized wrong thing

53

Why Not Everything?

Allow Everything on Disk Server(thin client’s)

• Tried and true design– Mainframes, Minis, ...– Web servers,…– Encapsulates data– Minimizes data moves– Scaleable

• It is where everyone ends up.

• All the arguments against are short-term.

54

The Slippery Slope

• If you add function to server

• Then you add more function to server

• Function gravitates to data.

Nothing = Sector Server

Everything = App Server

Something =

Fixed App Server

55


– Astonishing hardware progress.


• A surprising future?– Disks (and other useful things) become supercomputers.– Apps run “in the disk”

Documents

1 Yotta Zetta Exa Peta Tera Giga Mega Kilo Storage: Alternate Futures Jim Gray Microsoft Research Gray/talks IBM Almaden,