Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
HPC Storage Trends: Past, Present and Future
SAUDI ARABIA HPC USER GROUP
DECEMBER 2011
BRENT WELCH PHD, DIRECTOR ARCHITECTURE
Saudi HPC 2
AGENDA
HPC Storage Background and Introduction
• Storage models for HPC
Technology trends
• pNFS update
• Panasas retrospective
• Industry trends in disk and solid state
Looking forward
Saudi HPC 3
COMPUTATIONAL SCIENCE
Use of computer simulation as a tool for
greater understanding of the real world
• Complements experimentation and
theory
Problems are increasingly
computationally expensive
• Large parallel machines needed to
perform calculations
• Critical to leverage parallelism in all
phases
Data access is a huge challenge
• Using parallelism to obtain performance
• Finding usable, efficient, and portable
interfaces
• Understanding and tuning I/O
Visualization of entropy in Terascale
Supernova Initiative application. Image from
Kwan-Liu Ma’s visualization team at UC Davis.
IBM Blue Gene/P system at Argonne
National Laboratory.
Saudi HPC 4
HPC I/O EVOLUTION 1
First generation
• Proprietary solutions from compute vendors
• Special programming models
• Limited disk storage
• Tape archive important
In the early days, everything (CPU, network, storage, compilers)
was custom and provided by a single vendor
• Cray, IBM, CDC, Pyramid, Convex, Thinking Machines, …
Saudi HPC 5
HPC I/O EVOLUTION 2
Second generation
• Commodity clusters and NFS file servers
• Standard POSIX programming and I/O model
Efforts to leverage commodity hardware and file
servers resulted in compromises and poor I/O
performance
• 128 compute nodes vs. 1 filer was an unfair contest
• All I/O routed through a single “head node” with local
storage or an NFS mount
• Dedicated storage systems tightly bound to compute
clusters
Saudi HPC 6
HPC I/O EVOLUTION 3
Third generation
• General purpose parallel file systems, still proprietary
• GPFS, Lustre, Panasas, StorNext, IBRIX, CXFS, PVFS …
Data striped across storage nodes and accessed in parallel from
many compute nodes
• Better scalability, especially under ideal conditions
• Metadata management moves out of the data path
− Different nodes manage metadata than data
• Proprietary software necessary on compute nodes to access storage
systems
Saudi HPC 7
PARALLEL FILE SYSTEMS
Building block for HPC I/O systems • Present storage as a single, logical storage unit
• Stripe files across disks and nodes for performance
• Tolerate failures (in conjunction with other HW/SW)
An example parallel file system, with large astrophysics checkpoints distributed across multiple I/O
servers (IOS) while small bioinformatics files are each stored on a single IOS.
C C C C C
Comm. Network
PFS PFS PFS PFS PFS
IOS IOS IOS IOS
H01
/pfs
/astro
H03 /bio H06
H02 H05
H04
H01
/astro
/pfs
/bio
H02
H03
H04
H05 H06
chkpt32.nc
prot04.seq prot17.seq
Saudi HPC 8
EXAMPLE PARALLEL I/O PERFORMANCE GAIN
0
50
100
150
200
250
300
350
400
450
NFS DirectFLOW DirectFLOW
Tim
e (
min
)
7 hours
17 mins
Av. Read
BW
300MB/s
5 hours
16 mins
Av. Read
BW
350MB/s
2 hours
51 mins
Av. Read
BW
650MB/s
4 Shelves 4 Shelves 1 Shelf
Source: Paradigm & Panasas, February 2007
Paradigm GeoDepth Prestack Migration
8
Saudi HPC 9
HPC I/O EVOLUTION 4
Fourth generation
• Standards-based parallel file systems
• pNFS brings parallel I/O to the NFS standard
• OSD provides standard high-level building block
Much more detail on pNFS later in the presentation
pNFS
Clients
Block (FC) /
Object (OSD) /
File (NFS)
Storage NFSv4.1 Server
data
Saudi HPC 10
I/O FOR COMPUTATIONAL SCIENCE
Additional I/O software provides improved performance and
usability over directly accessing the parallel file system. Reduces
or (ideally) eliminates need for optimization in application codes.
Parallel I/O for the NFS standard
pNFS
Saudi HPC 17
PNFS
IETF standard for parallel I/O in the NFS protocol
2004 inception, 2010 RFC
Architecture similar to Panasas DirectFlow
• Out-of-band metadata server and direct client-storage datapath
Supports different data servers
• Object-based StorageBlades (OSD)
• NFS Filers (NFSv4)
• RAID Controllers (SCSI Block)
Saudi HPC 18
WHY A STANDARD FOR PARALLEL I/O?
NFS is the only network file system standard • Proprietary file systems have unique advantages, but aren’t right for
everyone
− PanFS, Lustre, GPFS, IBRIX, CXFS, etc.
pNFS widens the playing field • Panasas, IBM, EMC want to bring their experience in large scale, high-
performance file systems into the NFS community
• Sun/Oracle and NetApp want a standard HPC solution
• Broader market benefits vendors
• More competition benefits customers
Saudi HPC 22
THE PNFS STANDARD
The pNFS standard defines the NFSv4.1 protocol extensions
between the server and client
The I/O protocol between the client and storage is specified
elsewhere, for example:
• SCSI Object-based Storage Device (OSD) over iSCSI
• Network File System (NFS)
• SCSI Block Commands (SBC) over Fibre Channel (FC)
The control protocol between the server and storage devices is
also specified elsewhere, for example:
• SCSI Object-based Storage Device (OSD) over iSCSI
Client Storage
NFSv4.1 Server
Saudi HPC 23
PNFS LAYOUTS
Client gets a layout from the NFS Server
The layout maps the file onto storage devices and addresses
The client uses the layout to perform direct I/O to storage
At any time the server can recall the layout
Client commits changes and returns the layout when it’s done
pNFS is optional, the client can always use regular NFSv4 I/O
Clients
Storage
NFSv4.1 Server
layout
Saudi HPC 24
PNFS CLIENT
Common client for different storage back ends
Wider availability across operating systems
Fewer support issues for storage vendors
Client Apps
Layout
Driver
pNFS Client
pNFS Server
Cluster
Filesystem
1. SBC (blocks)
2. OSD (objects)
3. NFS (files)
4. PVFS2 (files)
5. Future backend…
Layout
metadata
grant & revoke
NFSv4.1
Saudi HPC 25
OBJECTS, FILES, AND BLOCKS
Why does the standard support different back ends?
• Inclusive protocol to gain widespread adoption
How are Objects different than Files?
• Object layout driver supports per-file RAID for advanced data integrity
features
• OSD interface has an efficient security protocol that optimizes metadata
operations
− iSCSI/OSD v1 standardized in 2005
− iSCSI/OSD v2 standardized in 2008
What about blocks?
• EMC (and others) have proprietary NFS variants that give clients direct
SAN access to block devices
Saudi HPC 27
PNFS STANDARD STATUS
IETF approved Internet Drafts in December 2008
RFCs for NFSv4.1, pNFS-objects, and pNFS-blocks
published January 2010 • RFC 5661 - Network File System (NFS) Version 4 Minor Version
1 Protocol
• RFC 5662 - Network File System (NFS) Version 4 Minor Version
1 External Data Representation Standard (XDR) Description
• RFC 5663 - Parallel NFS (pNFS) Block/Volume Layout
• RFC 5664 - Object-Based Parallel NFS (pNFS) Operations
Saudi HPC 30
LINUX RELEASE CYCLE, 2009 AND 2010
Kernel Merge Window
Date
What’s New
2.6.30 Mar 2009 RPC sessions, NVSv4.1 server, OSDv2 rev5, EXOFS
2.6.31 June 2009 NFSv4.1 client, sans pNFS
2.6.32 Sep 2009 130 server-side patches add back-channel
2.6.33 Dec 2009 43 pNFS patches
2.6.34 Feb 2010 21 NFS 4.1 patches
2.6.35 May 2010 1 client and 1 server patch (4.1 support)
2.6.36 Aug 2010 16 patches
2.6.37 Nov 2010 First chunk of pNFS beyond generic 4.1
infrastructure. pNFS disabled.
Saudi HPC 31
LINUX RELEASE CYCLE 2011
RHEL 7 and SLES 12 should have kernels with pNFS
RHEL 6.X and SLES 11.X are interested in pNFS backport
Kernel Merge Window
Date
What’s New
2.6.38 Jan 2011 More generic pNFS code, still disabled, not fully
functional
2.6.39 Apr 2011 Files-based back end, read, write, commit on the
client. Linux server is read-only via pNFS.
2.6.40
3.0
Jun 2011 Object-based back end
3.1 Sep 2011 Block-based back end
3.2 Dec 2011 (?) More stabilization and bug fixing
3.3 Apr 2012 (?) More stabilization and bug fixing
Saudi HPC 36
TECHNOLOGY TRENDS
Capacity vs. Bandwidth
Why Disks won’t die
Solid state applications
What’s next for Panasas
Saudi HPC 37
DISK ACCESS RATES OVER TIME
Thanks to R. Freitas of IBM Almaden Research Center for providing much of the data for this graph.
Saudi HPC 38
LARGE-SCALE DATA SETS
Application teams are beginning to generate 10s of Tbytes of data
in a single simulation. Keeping hundreds of Tbytes of data online
is increasingly common.
PI Project
On-line Data
(TBytes)
Off-line Data
(TBytes)
Khokhlov Combustion in Gaseous Mixtures 1 17
Baker Protein Structure 1 2
Hinkel Laser-Plasma Interactions 60 60
Lamb Type Ia Supernovae 75 300
Vary Nuclear Structure and Reactions 6 15
Fischer Fast Neutron Reactors 100 100
Mackenzie Lattice Quantum Chromodynamics 300 70
Vashishta Fracture Behavior in Materials 12 72
Moser Engineering Design of Fluid Systems 3 200
Lele Multi-material Mixing 215 100
Kurien Turbulent Flows 10 20
Jordan Earthquake Wave Propagation 1000 1000
Tang Fustion Reactor Design 50 100
Data requirements for select 2011 INCITE applications at ALCF
Saudi HPC 39
BLADE CAPACITY AND SPEED HISTORY
0
50
100
150
200
250
SB-160 SB-800 SB-2000 SB-4000
MB/Sec
0
1000
2000
3000
4000
5000
SB-160 SB-800 SB-2000 SB-4000
GB
0
5000
10000
15000
20000
25000
SB-160 SB-800 SB-2000 SB-4000
TimeCompare time to write a blade
(two disks) from end-to-end over
4 generations of Panasas blades
Capacity increased 25x
Bandwidth increased 3.4x
(function of CPU, memory, disk)
Time goes from 44 min to > 5 hrs
Saudi HPC 40
BANDWIDTH/CAPACITY GAP
Whole-device protection (traditional block RAID)
• Recovery based on time to read single device end-to-end
• Media defects during rebuild are a problem
• Long exposure times are a problem, and worse all the time
Per-file protection (Panasas object RAID)
• Declustering spreads rebuild workload among StorageBlades
− Read capacity of 1 device from 1/10 of 10 devices is
− 10x faster, or 10x less intrusive
• Per-file fault isolation
Saudi HPC 41
PAS-12 4TH GENERATION BLADE
2002 850 MHz / PC 100 80 GB PATA
2004 1.2 GHz / PC 100 250 GB SATA 330 MB/sec
2006 1.5 GHz / DDR 400 500 GB SATA 400 MB/sec
2008 10 GE shelf switch 750 GB SATA 600 MB/sec
2009 SSD Hybrid 1000 GB SATA, 32GB SSD 600 MB/sec
2010 1.67 GHz / DDR3 800 2000 GB SATA, (64GB SSD) 1.5 GB/sec
PS1
PS2
BAT
NET1
NET2
11x Blades
Saudi HPC 42
PAS-12 BLADE CONFIGURATIONS
Director Blade
10Gbe NIC x 2 ports i7 Multi-Core CPU
3 Memory Channels
Storage Blade
i7 Single
Core CPU
SSD Options
1) With SSD
2) No SSD
2 Memory
Channels
40TB
(2TB HDD x2)
Saudi HPC 43
WHY CAN’T WE JUNK ALL THE DISKS
10x the per-bit cost, and the gap isn’t closing
Cannot manufacture enough bits via Wafers vs. Disks
• Cost of semiconductor FAB is off the charts, per-bit, when compared to a
disk manufacturing facility
Your laptops may all have SSD, but not all of your compute
nodes will get one
Saudi HPC 44
DISK BECOMES THE NEW TAPE
And Tape doesn’t go away, either
• Still half the per-bit cost, and much less lifetime cost
• Tape is just different
no power at rest
physical mobility
higher per-device bandwidth (1.5x to 2x)
Disk capacity trends continue
• 3TB in 2011
• 4TB in 2012
• 8, 16, 32 TB drives are coming
− Patterned media
− Shingled recording
Saudi HPC 45
THE ROLE OF SOLID STATE
Response-time sensitive data
• Block-level metadata to know where the real data lives on HD
Efficient non-volatile storage
• Journals, etc.
Critical datasets
• Airline reservation systems (databae)
Personal devices
• Yeah, we all want some
Saudi HPC 46
THE IDEAL STORAGE BUILDING BLOCK
DRAM, SSD, Disk, CPU and Networking in one package
• Panasas PAS9 shipped in Q3 2009,
but constrained to one SSD and one HD
• Panasas actively engaged in next generation building block
High-level interface to storage devices
• iSCSI/OSDv2 includes high-level operations for
− Capacity management
− Snapshots
− High throughput
Industry standard access to high performance file system
• NFSv4.1 and pNFS
Saudi HPC 48
GETTING TO EXASCALE
Reliability first
• The ability to track and respond to failures is critical
• Panasas Realm Manager provides integrated fault management
• Harden individual components
• High availability fault model to stay online with multiple faults
• Sophisticated on-line error diagnostics and automated response
Speeds and feeds
• Performance scalability comes from base components and the
architecture to support aggregation of large numbers of them
• More sophisticated I/O scheduling to deal with QOS, jitter, and congestion
Saudi HPC 49
CONCLUSION
Thank you for the opportunity to speak at this event
Storage remains a challenging and interesting area
We look forward to a long partnership with the HPC user
community as we help advance the state of high performance I/O
systems
Panasas Background
Saudi HPC 51
TECHNICAL DIFFERENTIATION
Scalable Performance
• start with 12 servers -> grow to 12,000
Novel Data Integrity Protection
• File system and RAID are integrated
• highly reliable data w/ novel data protection systems
Maximum Availability
• built-in distributed system platform, tested under fire
• LANL measured > 99.9% availability across its Panasas systems for all
reasons
Simple to Deploy and Maintain
• integrated storage system with appliance model
Application Acceleration
• customer proven results
Saudi HPC 52
PANASAS FEATURES
Object RAID (2003-2004)
NFS w/ multiprotocol file locking (2005)
Replicated cluster management (2006)
Declustered Parallel Object RAID rebuild (2006)
Metadata Fail Over (2007)
Snapshots, NFS Fail Over, Tiered Parity (2008)
Async Mirror, Data Migration (2009)
SSD/SATA Hybrid Blade (2009)
64-bit multicore (2010)
User Group Quota (2010)
Dates are the year the feature shipped in a production release
Saudi HPC 53
PANASAS REFERENCES
www.panasas.com
Scalable Performance in the Panasas Parallel File System
http://www.usenix.org/event/fast08/tech/full_papers/welch/welch_html/
As complete a technical description as you can find if you want to know more
about how the system works