Upload
adam-lane
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
1
Yotta
Zetta
Exa
Peta
Tera
Giga
Mega
Kilo
Storage: Alternate FuturesStorage: Alternate FuturesJim Gray
Microsoft Research
http://Research.Microsoft.com/~Gray/talks
IBM Almaden, 1 December 1999
2
Acknowledgments: Thank You!!
• Dave Patterson: – Convinced me that processors are moving to the
devices.
• Kim Keeton and Erik Riedell– Showed that many useful subtasks can be done by
disk-processors, and quantified execution interval
• Remzi Dusseau – Re-validated Amdahl's laws
3
Outline• The Surprise-Free Future (5 years)
– 500 mips cpus for 10$ – 1 Gb RAM chips – MAD at 50 Gbpsi – 10 GBps SANs are ubiquitous– 1 GBps WANs are ubiquitous
• Some consequences– Absurd (?) consequences.– Auto-manage storage– Raid10 replaces Raid5– Disc-packs– Disk is the archive media of choice
• A surprising future?– Disks (and other useful things) become supercomputers.– Apps run “in the disk”
4
The Surprise-free Storage Future• 1 Gb RAM chips
• MAD at 50 Gbpsi
• Drives shrink one quantum
• Standard IO
• 10 GBps SANs are ubiquitous
• 1 Gbps WANs are ubiquitous
• 5 bips cpus for 1K$ and 500 mips cpus for 10$
5
1 Gb RAM Chips • Moving to 256 Mb chips now
• 1Gb will be “standard” in 5 years, 4 Gb will be premium product.
• Note: – 256Mb = 32MB: the smallest memory– 1 Gb = 128 MB: the smallest memory
6
System On A Chip• Integrate Processing with memory on one chip
– chip is 75% memory now– 1MB cache >> 1960 supercomputers– 256 Mb memory chip is 32 MB!– IRAM, CRAM, PIM,… projects abound
• Integrate Networking with processing on one chip– system bus is a kind of network– ATM, FiberChannel, Ethernet,.. Logic on chip.– Direct IO (no intermediate bus)
• Functionally specialized cards shrink to a chip.
7
500 mips System On A Chip for 10$
• 486 now 7$ 233 MHz ARM for 10$ system on a chiphttp://www.cirrus.com/news/products99/news-product14.html AMD/Celeron 266 ~ 30$
• In 5 years, today’s leading edge will be– System on chip (cpu, cache, mem ctlr, multiple IO)– Low cost– Low-power – Have integrated IO
• High end is 5 BIPS cpus
8
Standard IO in 5 Years
• Probably
• Replace PCI with something better will still need a mezzanine bus standard
• Multiple serial links directly from processor
• Fast (10 GBps/link) for a few meters
• System Area Networks (SANS) ubiquitous (VIA morphs to SIO?)
9
1 GBps1 GBps
Ubiquitous 10 GBps SANs in 5 years
• 1Gbps Ethernet are reality now.– Also FiberChannel ,MyriNet, GigaNet,
ServerNet,, ATM,…
• 10 Gbps x4 WDM deployed now (OC192)
– 3 Tbps WDM working in lab
• In 5 years, expect 10x, progress is astonishing
• Gilder’s law: Bandwidth grows 3x/year http://www.forbes.com/asap/97/0407/090.htm
5 MBps20 Mbsp
40 MBps
80 MBps
120 MBps120 MBps(1Gbps)(1Gbps)
10
Thin Client’s mean HUGE servers
• AOL hosting customer pictures
• Hotmail allows 5 MB/user, 50 M users
• Web sites offer electronic vaulting for SOHO.
• IntelliMirror: replicate client state on server
• Terminal server: timesharing returns
• …. Many more.
11
Remember Your Roots?
12
MAD at 50 Gbpsi• MAD: Magnetic Aerial Density:
3-10 Mbpsi in products 28 Mbpsi in lab 50 Mbpsi = paramagnetic limit
but…. People have ideas.
• Capacity: rise 10x in 5 years (conservative)• Bandwidth: rise 4x in 5 years (density+rpm) • Disk: 50GB to 500 GB,
• 60-80MBps • 1k$/TB• 15 minute to 3 hour scan time.
13
The “Absurd” Disk
• 2.5 hr scan time (poor sequential access)
• 1 aps / 5 GB (VERY cold data)
• It’s a tape!
1 TB100 MB/s
200 Kaps
14
Disk vs Tape
• Disk– 47 GB
– 15 MBps
– 5 ms seek time
– 3 ms rotate latency
– 9$/GB for drive 3$/GB for ctlrs/cabinet
– 4 TB/rack
• Tape– 40 GB
– 5 MBps
– 30 sec pick time
– Many minute seek time
– 5$/GB for media10$/GB for drive+library
– 10 TB/rack
The price advantage of tape is narrowing, and the performance advantage of disk is growing
GuestimatesCern: 200 TB3480 tapes2 col = 50GBRack = 1 TB=20 drives
15
Standard Storage Metrics• Capacity:
– RAM: MB and $/MB: today at 512MB and 3$/MB– Disk: GB and $/GB: today at 50GB and 10$/GB– Tape: TB and $/TB: today at 50GB and 12k$/TB (nearline)
• Access time (latency)– RAM: 100 ns– Disk: 10 ms– Tape: 30 second pick, 30 second position
• Transfer rate– RAM: 1 GB/s– Disk: 15 MB/s - - - Arrays can go to 1GB/s– Tape: 5 MB/s - - - striping is problematic, but “works”
16
New Storage Metrics: Kaps, Maps, SCAN?
• Kaps: How many kilobyte objects served per second– The file server, transaction processing metric
– This is the OLD metric.
• Maps: How many megabyte objects served per second– The Multi-Media metric
• SCAN: How long to scan all the data– the data mining and utility metric
• And– Kaps/$, Maps/$, TBscan/$
19
The Access Time Myth• The Myth: seek or pick time dominates• The reality: (1) Queuing dominates• (2) Transfer dominates BLOBs• (3) Disk seeks often short• Implication: many cheap servers
better than one fast expensive server– shorter queues– parallel transfer– lower cost/access and cost/byte
• This is obvious for disk arrays• This even more obvious for tape arrays
Seek
Rotate
Transfer
Seek
Rotate
Transfer
Wait
20
Storage Ratios Changed• 10x better access time
• 10x more bandwidth
• 4,000x lower media price
Disk Performance vs Time
1
10
100
1980 1990 2000
Year
seek
s p
er s
eco
nd
ban
dw
idth
: MB
/s
0.1
1.
10.
Cap
acity
(GB
)
Disk accesses/second vs Time
1
10
100
1980 1990 2000
Year
Acc
esse
s p
er S
eco
nd
Storage Price vs TimeMegabytes per kilo-dollar
0.1
1.
10.
100.
1,000.
10,000.
1980 1990 2000
Year
MB
/k$
• DRAM/disk media price ratio changed– 1970-1990
100:1
– 1990-1995 10:1
– 1995-1997 50:1
– today ~ 0.1$pMB disk 30:1
3$pMB dram
21
Data on Disk Can Move to RAM in 8 years
Storage Price vs TimeMegabytes per kilo-dollar
0.1
1.
10.
100.
1,000.
10,000.
1980 1990 2000
Year
MB
/k$
30:1
6 years
22
Outline• The Surprise-Free Future (5 years)
– 500 mips cpus for 10$ – 1 Gb RAM chips – MAD at 50 Gbpsi – 10 GBps SANs are ubiquitous– 1 GBps WANs are ubiquitous
• Some consequences– Absurd (?) consequences.– Auto-manage storage– Raid10 replaces Raid5– Disc-packs– Disk is the archive media of choice
• A surprising future?– Disks (and other useful things) become supercomputers.– Apps run “in the disk”.
23
The (absurd?) consequences• 256 way nUMA?• Huge main memories: now:
500MB - 64GB memories then: 10GB - 1TB memories
• Huge disksnow: 5-50 GB 3.5” disks then: 50-500 GB disks
• Petabyte storage farms– (that you can’t back up or restore).
• Disks >> tapes– “Small” disks:
One platter one inch 10GB
• SAN convergence 1 GBps point to point is easy
• 1 GB RAM chips
• MAD at 50 Gbpsi
• Drives shrink one quantum
• 10 GBps SANs are ubiquitous
• 500 mips cpus for 10$
• 5 bips cpus at high end
24
The Absurd? Consequences• Further segregate processing from storage
• Poor locality
• Much useless data movement
• Amdahl’s laws: bus: 10 B/ips io: 1 b/ips
ProcessorsDisks
~ 1 Tips
RAM Memory
~ 1 TB
~ 100TB
100 GBps10 TBps
25
Storage Latency: How Far Away is the Data?
RegistersOn Chip CacheOn Board Cache
Memory
Disk
12
10
100
Tape /Optical Robot
10 9
10 6
Olympia
This Hotel
This RoomMy Head
10 min
1.5 hr
2 Years
1 min
Pluto
2,000 YearsAndromeda
26
Consequences• AutoManage Storage
• Sixpacks (for arm-limited apps)
• Raid5-> Raid10
• Disk-to-disk backup
• Smart disks
27
Auto Manage Storage• 1980 rule of thumb:
– A DataAdmin per 10GB, SysAdmin per mips
• 2000 rule of thumb– A DataAdmin per 5TB – SysAdmin per 100 clones (varies with app).
• Problem:– 5TB is 60k$ today, 10k$ in a few years.– Admin cost >> storage cost???
• Challenge: – Automate ALL storage admin tasks
28
The “Absurd” Disk
• 2.5 hr scan time (poor sequential access)
• 1 aps / 5 GB (VERY cold data)
• It’s a tape!
1 TB100 MB/s
200 Kaps
29
Extreme case: 1TB disk: Alternatives
• Use all the heads in parallel– Scan in 30 minutes– Still one Kaps/5GB
• Use one platter per arm– Share power/sheetmetal– Scan in 30 minutes– One KAPS per GB
1 TB500 MB/s
200 Kaps
200GB 200GB eacheach
500 MB/s
1,000 Kaps
30
Drives shrink (1.8”, 1”)• 150 kaps for 500 GB is VERY cold data
• 3 GB/platter today, 30 GB/platter in 5years.
• Most disks are ½ full• TPC benchmarks use 9GB drives
(need arms or bandwidth).
• One solution: smaller form factor– More arms per GB
– More arms per rack
– More arms per Watt
31
Prediction: 6-packs
• One way or another, when disks get huge– Will be packaged as multiple arms– Parallel heads gives bandwidth– Independent arms gives bandwidth & aps
• Package shares power, package, interfaces…
32
Stripes, Mirrors, Parity (RAID 0,1, 5)
• RAID 0: Stripes– bandwidth
• RAID 1: Mirrors, Shadows,…– Fault tolerance– Reads faster, writes 2x slower
• RAID 5: Parity– Fault tolerance– Reads faster– Writes 4x or 6x slower.
0,3,6,.. 1,4,7,.. 2,5,8,..
0,1,2,.. 0,1,2,..
0,2,P2,.. 1,P1,4,.. P0,3,5,..
33
RAID 10 (strips of mirrors) Wins“wastes space, saves arms”
RAID 5:
• Performance– 225 reads/sec– 70 writes/sec– Write
• 4 logical IO,
• 2 seek + 1.7 rotate
• SAVES SPACE
• Performance degrades on failure
RAID1
• Performance– 250 reads/sec– 100 writes/sec– Write
• 2 logical IO
• 2 seek 0.7 rotate
• SAVES ARMS
• Performance improves on failure
34
The Storage RackToday
• 140 arms • 4TB• 24 racks
24 storage processors6+1 in rack
• Disks = 2.5 GBps IO• Controllers = 1.2 GBps IO• Ports 500 MBps IO
35
Storage Rack in 5 years?• 140 arms
• 50TB• 24 racks
24 storage processors6+1 in rack
• Disks = 14 GBps IO• Controllers = 5 GBps IO• Ports 1 GBps IO
• My suggestion: move the processors into the storage racks.
36
It’s hard to archive a PetaByteIt takes a LONG time to restore it.
• Store it in two (or more) places online (on disk?).
• Scrub it continuously (look for errors)
• On failure, refresh lost copy from safe copy.
• Can organize the two copies differently (e.g.: one by time, one by space)
37
Crazy Disk Ideas• Disk Farm on a card: surface mount disks
• Disk (magnetic store) on a chip: (micro machines in Silicon)
• Full Apps (e.g. SAP, Exchange/Notes,..) in the disk controller
(a processor with 128 MB dram)ASIC
The Innovator's Dilemma: When New Technologies Cause Great Firms to FailClayton M. Christensen.ISBN: 0875845851
38
The Disk Farm On a Card• The 500GB disc card• An array of discs• Can be used as• 100 discs• 1 striped disc• 50 Fault Tolerant discs• ....etc• LOTS of accesses/second bandwidth
14"
39
Functionally Specialized Cards• Storage
• Network
• Display
M MB DRAM
P mips processor
ASIC
ASIC
ASIC Today:
P=50 mips
M= 2 MB
In a few years
P= 200 mips
M= 64 MB
40
Data Gravity Processing Moves to Transducers
• Move Processing to data sources• Move to where the power (and sheet metal) is • Processor in
– Modem– Display– Microphones (speech recognition)
& cameras (vision)– Storage: Data storage and analysis
41
It’s Already True of PrintersPeripheral = CyberBrick
• You buy a printer• You get a
– several network interfaces– A Postscript engine
• cpu, • memory, • software,• a spooler (soon)
– and… a print engine.
42
Disks Become Supercomputers
• 100x in 10 years 2 TB 3.5” drive
• Shrink to 1” is 200GB• Disk replaces tape?
• Disk is super computer!
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
43
Tera Byte Backplane
• TODAY– Disk controller is 10 mips risc engine
with 2MB DRAM– NIC is similar power
• SOON– Will become 100 mips systems
with 100 MB DRAM.
• They are nodes in a federation(can run Oracle on NT in disk controller).
• Advantages– Uniform programming model– Great tools– Security– Economics (cyberbricks)– Move computation to data (minimize traffic)
All Device Controllers will be Cray 1’s
CentralProcessor &
Memory
44
With Tera Byte Interconnectand Super Computer Adapters
• Processing is incidental to – Networking– Storage– UI
• Disk Controller/NIC is – faster than device– close to device– Can borrow device
package & power
• So use idle capacity for computation.
• Run app in device.• Both Kim Keeton (UCB) and
Erik Riedel (CMU) thesis investigate thisshow benefits of this approach.
Tera ByteBackplane
45
Implications
• Offload device handling to NIC/HBA
• higher level protocols: I2O, NASD, VIA, IP, TCP…
• SMP and Cluster parallelism is important.
Tera Byte Backplane
• Move app to NIC/device controller
• higher-higher level protocols: CORBA / COM+.
• Cluster parallelism is VERY important.
CentralProcessor &
Memory
Conventional Radical
46
How Do They Talk to Each Other?• Each node has an OS• Each node has local resources: A federation.• Each node does not completely trust the others.• Nodes use RPC to talk to each other
– CORBA? COM+? RMI?
– One or all of the above.
• Huge leverage in high-level interfaces.• Same old distributed system story.
SANSIO
stre
ams
data
gram
s
RP
C?
Applications
SIO
streams
datagrams
RP
C ?
Applications
47
Basic Argument for x-Disks• Future disk controller is a super-computer.
– 1 bips processor– 128 MB dram– 100 GB disk plus one arm
• Connects to SAN via high-level protocols– RPC, HTTP, DCOM, Kerberos, Directory Services,…. – Commands are RPCs– management, security,….– Services file/web/db/… requests– Managed by general-purpose OS with good dev environment
• Move apps to disk to save data movement– need programming environment in controller
48
The Slippery Slope
• If you add function to server
• Then you add more function to server
• Function gravitates to data.
Nothing = Sector Server
Everything = App Server
Something =
Fixed App Server
49
Why Not a Sector Server?(let’s get physical!)
• Good idea, that’s what we have today.• But
– cache added for performance– Sector remap added for fault tolerance– error reporting and diagnostics added– SCSI commends (reserve,.. are growing)– Sharing problematic (space mgmt, security,…)
• Slipping down the slope to a 2-D block server
50
Why Not a 1-D Block Server?Put A LITTLE on the Disk Server
• Tried and true design– HSC - VAX cluster– EMC– IBM Sysplex (3980?)
• But look inside– Has a cache – Has space management– Has error reporting & management– Has RAID 0, 1, 2, 3, 4, 5, 10, 50,…– Has locking– Has remote replication– Has an OS– Security is problematic– Low-level interface moves too many bytes
51
Why Not a 2-D Block Server?Put A LITTLE on the Disk Server
• Tried and true design– Cedar -> NFS– file server, cache, space,..– Open file is many fewer msgs
• Grows to have– Directories + Naming– Authentication + access control– RAID 0, 1, 2, 3, 4, 5, 10, 50,…– Locking– Backup/restore/admin– Cooperative caching with client
• File Servers are a BIG hit: NetWare™– SNAP! is my favorite today
52
Why Not a File Server?Put a Little on the Disk Server
• Tried and true design– Auspex, NetApp, ...– Netware
• Yes, but look at NetWare– File interface gives you app invocation interface– Became an app server
• Mail, DB, Web,….
– Netware had a primitive OS• Hard to program, so optimized wrong thing
53
Why Not Everything?
Allow Everything on Disk Server(thin client’s)
• Tried and true design– Mainframes, Minis, ...– Web servers,…– Encapsulates data– Minimizes data moves– Scaleable
• It is where everyone ends up.
• All the arguments against are short-term.
54
The Slippery Slope
• If you add function to server
• Then you add more function to server
• Function gravitates to data.
Nothing = Sector Server
Everything = App Server
Something =
Fixed App Server
55
Outline• The Surprise-Free Future (5 years)
– Astonishing hardware progress.
• Some consequences– Absurd (?) consequences.– Auto-manage storage– Raid10 replaces Raid5– Disc-packs– Disk is the archive media of choice
• A surprising future?– Disks (and other useful things) become supercomputers.– Apps run “in the disk”