HPC Storage Trends: Past, Present and Futurearchive.hpcsaudi.org/events/2011_khobar... · Visualization of entropy in Terascale Supernova Initiative application. Image from Kwan-Liu

HPC Storage Trends: Past, Present and Future

SAUDI ARABIA HPC USER GROUP

DECEMBER 2011

BRENT WELCH PHD, DIRECTOR ARCHITECTURE

Saudi HPC 2

AGENDA

HPC Storage Background and Introduction

• Storage models for HPC

Technology trends

• pNFS update

• Panasas retrospective

• Industry trends in disk and solid state

Looking forward

Saudi HPC 3

COMPUTATIONAL SCIENCE

Use of computer simulation as a tool for

greater understanding of the real world

• Complements experimentation and

theory

Problems are increasingly

computationally expensive

• Large parallel machines needed to

perform calculations

• Critical to leverage parallelism in all

phases

Data access is a huge challenge

• Using parallelism to obtain performance

• Finding usable, efficient, and portable

interfaces

• Understanding and tuning I/O

Visualization of entropy in Terascale

Supernova Initiative application. Image from

Kwan-Liu Ma’s visualization team at UC Davis.

IBM Blue Gene/P system at Argonne

National Laboratory.

Saudi HPC 4

HPC I/O EVOLUTION 1

First generation

• Proprietary solutions from compute vendors

• Special programming models

• Limited disk storage

• Tape archive important

In the early days, everything (CPU, network, storage, compilers)

was custom and provided by a single vendor

• Cray, IBM, CDC, Pyramid, Convex, Thinking Machines, …

Saudi HPC 5

HPC I/O EVOLUTION 2

Second generation

• Commodity clusters and NFS file servers

• Standard POSIX programming and I/O model

Efforts to leverage commodity hardware and file

servers resulted in compromises and poor I/O

performance

• 128 compute nodes vs. 1 filer was an unfair contest

• All I/O routed through a single “head node” with local

storage or an NFS mount

• Dedicated storage systems tightly bound to compute

clusters

Saudi HPC 6

HPC I/O EVOLUTION 3

Third generation

• General purpose parallel file systems, still proprietary

• GPFS, Lustre, Panasas, StorNext, IBRIX, CXFS, PVFS …

Data striped across storage nodes and accessed in parallel from

many compute nodes

• Better scalability, especially under ideal conditions

• Metadata management moves out of the data path

− Different nodes manage metadata than data

• Proprietary software necessary on compute nodes to access storage

systems

Saudi HPC 7

PARALLEL FILE SYSTEMS

Building block for HPC I/O systems • Present storage as a single, logical storage unit

• Stripe files across disks and nodes for performance

• Tolerate failures (in conjunction with other HW/SW)

An example parallel file system, with large astrophysics checkpoints distributed across multiple I/O

servers (IOS) while small bioinformatics files are each stored on a single IOS.

C C C C C

Comm. Network

PFS PFS PFS PFS PFS

IOS IOS IOS IOS

H01

/pfs

/astro

H03 /bio H06

H02 H05

H04

H01

/astro

/pfs

/bio

H02

H03

H04

H05 H06

chkpt32.nc

prot04.seq prot17.seq

Saudi HPC 8

EXAMPLE PARALLEL I/O PERFORMANCE GAIN

0

50

100

150

200

250

300

350

400

450

NFS DirectFLOW DirectFLOW

Tim

e (

min

)

7 hours

17 mins

Av. Read

BW

300MB/s

5 hours

16 mins

Av. Read

BW

350MB/s

2 hours

51 mins

Av. Read

BW

650MB/s

4 Shelves 4 Shelves 1 Shelf

Source: Paradigm & Panasas, February 2007

Paradigm GeoDepth Prestack Migration

8

Saudi HPC 9

HPC I/O EVOLUTION 4

Fourth generation

• Standards-based parallel file systems

• pNFS brings parallel I/O to the NFS standard

• OSD provides standard high-level building block

Much more detail on pNFS later in the presentation

pNFS

Clients

Block (FC) /

Object (OSD) /

File (NFS)

Storage NFSv4.1 Server

data

Saudi HPC 10

I/O FOR COMPUTATIONAL SCIENCE

Additional I/O software provides improved performance and

usability over directly accessing the parallel file system. Reduces

or (ideally) eliminates need for optimization in application codes.

Parallel I/O for the NFS standard

pNFS

Saudi HPC 17

PNFS

IETF standard for parallel I/O in the NFS protocol

2004 inception, 2010 RFC

Architecture similar to Panasas DirectFlow

• Out-of-band metadata server and direct client-storage datapath

Supports different data servers

• Object-based StorageBlades (OSD)

• NFS Filers (NFSv4)

• RAID Controllers (SCSI Block)

Saudi HPC 18

WHY A STANDARD FOR PARALLEL I/O?

NFS is the only network file system standard • Proprietary file systems have unique advantages, but aren’t right for

everyone

− PanFS, Lustre, GPFS, IBRIX, CXFS, etc.

pNFS widens the playing field • Panasas, IBM, EMC want to bring their experience in large scale, high-

performance file systems into the NFS community

• Sun/Oracle and NetApp want a standard HPC solution

• Broader market benefits vendors

• More competition benefits customers

Saudi HPC 22

THE PNFS STANDARD

The pNFS standard defines the NFSv4.1 protocol extensions

between the server and client

The I/O protocol between the client and storage is specified

elsewhere, for example:

• SCSI Object-based Storage Device (OSD) over iSCSI

• Network File System (NFS)

• SCSI Block Commands (SBC) over Fibre Channel (FC)

The control protocol between the server and storage devices is

also specified elsewhere, for example:

• SCSI Object-based Storage Device (OSD) over iSCSI

Client Storage

NFSv4.1 Server

Saudi HPC 23

PNFS LAYOUTS

Client gets a layout from the NFS Server

The layout maps the file onto storage devices and addresses

The client uses the layout to perform direct I/O to storage

At any time the server can recall the layout

Client commits changes and returns the layout when it’s done

pNFS is optional, the client can always use regular NFSv4 I/O

Clients

Storage

NFSv4.1 Server

layout

Saudi HPC 24

PNFS CLIENT

Common client for different storage back ends

Wider availability across operating systems

Fewer support issues for storage vendors

Client Apps

Layout

Driver

pNFS Client

pNFS Server

Cluster

Filesystem

1. SBC (blocks)

2. OSD (objects)

3. NFS (files)

4. PVFS2 (files)

5. Future backend…

Layout

metadata

grant & revoke

NFSv4.1

Saudi HPC 25

OBJECTS, FILES, AND BLOCKS

Why does the standard support different back ends?

• Inclusive protocol to gain widespread adoption

How are Objects different than Files?

• Object layout driver supports per-file RAID for advanced data integrity

features

• OSD interface has an efficient security protocol that optimizes metadata

operations

− iSCSI/OSD v1 standardized in 2005

− iSCSI/OSD v2 standardized in 2008

What about blocks?

• EMC (and others) have proprietary NFS variants that give clients direct

SAN access to block devices

Saudi HPC 27

PNFS STANDARD STATUS

IETF approved Internet Drafts in December 2008

RFCs for NFSv4.1, pNFS-objects, and pNFS-blocks

published January 2010 • RFC 5661 - Network File System (NFS) Version 4 Minor Version

1 Protocol

• RFC 5662 - Network File System (NFS) Version 4 Minor Version

1 External Data Representation Standard (XDR) Description

• RFC 5663 - Parallel NFS (pNFS) Block/Volume Layout

• RFC 5664 - Object-Based Parallel NFS (pNFS) Operations

Saudi HPC 30

LINUX RELEASE CYCLE, 2009 AND 2010

Kernel Merge Window

Date

What’s New

2.6.30 Mar 2009 RPC sessions, NVSv4.1 server, OSDv2 rev5, EXOFS

2.6.31 June 2009 NFSv4.1 client, sans pNFS

2.6.32 Sep 2009 130 server-side patches add back-channel

2.6.33 Dec 2009 43 pNFS patches

2.6.34 Feb 2010 21 NFS 4.1 patches

2.6.35 May 2010 1 client and 1 server patch (4.1 support)

2.6.36 Aug 2010 16 patches

2.6.37 Nov 2010 First chunk of pNFS beyond generic 4.1

infrastructure. pNFS disabled.

Saudi HPC 31

LINUX RELEASE CYCLE 2011

RHEL 7 and SLES 12 should have kernels with pNFS

RHEL 6.X and SLES 11.X are interested in pNFS backport

Kernel Merge Window

Date

What’s New

2.6.38 Jan 2011 More generic pNFS code, still disabled, not fully

functional

2.6.39 Apr 2011 Files-based back end, read, write, commit on the

client. Linux server is read-only via pNFS.

2.6.40

3.0

Jun 2011 Object-based back end

3.1 Sep 2011 Block-based back end

3.2 Dec 2011 (?) More stabilization and bug fixing

3.3 Apr 2012 (?) More stabilization and bug fixing

Saudi HPC 36

TECHNOLOGY TRENDS

Capacity vs. Bandwidth

Why Disks won’t die

Solid state applications

What’s next for Panasas

Saudi HPC 37

DISK ACCESS RATES OVER TIME

Thanks to R. Freitas of IBM Almaden Research Center for providing much of the data for this graph.

Saudi HPC 38

LARGE-SCALE DATA SETS

Application teams are beginning to generate 10s of Tbytes of data

in a single simulation. Keeping hundreds of Tbytes of data online

is increasingly common.

PI Project

On-line Data

(TBytes)

Off-line Data

(TBytes)

Khokhlov Combustion in Gaseous Mixtures 1 17

Baker Protein Structure 1 2

Hinkel Laser-Plasma Interactions 60 60

Lamb Type Ia Supernovae 75 300

Vary Nuclear Structure and Reactions 6 15

Fischer Fast Neutron Reactors 100 100

Mackenzie Lattice Quantum Chromodynamics 300 70

Vashishta Fracture Behavior in Materials 12 72

Moser Engineering Design of Fluid Systems 3 200

Lele Multi-material Mixing 215 100

Kurien Turbulent Flows 10 20

Jordan Earthquake Wave Propagation 1000 1000

Tang Fustion Reactor Design 50 100

Data requirements for select 2011 INCITE applications at ALCF

Saudi HPC 39

BLADE CAPACITY AND SPEED HISTORY

0

50

100

150

200

250

SB-160 SB-800 SB-2000 SB-4000

MB/Sec

0

1000

2000

3000

4000

5000

SB-160 SB-800 SB-2000 SB-4000

GB

0

5000

10000

15000

20000

25000

SB-160 SB-800 SB-2000 SB-4000

TimeCompare time to write a blade

(two disks) from end-to-end over

4 generations of Panasas blades

Capacity increased 25x

Bandwidth increased 3.4x

(function of CPU, memory, disk)

Time goes from 44 min to > 5 hrs

Saudi HPC 40

BANDWIDTH/CAPACITY GAP

Whole-device protection (traditional block RAID)

• Recovery based on time to read single device end-to-end

• Media defects during rebuild are a problem

• Long exposure times are a problem, and worse all the time

Per-file protection (Panasas object RAID)

• Declustering spreads rebuild workload among StorageBlades

− Read capacity of 1 device from 1/10 of 10 devices is

− 10x faster, or 10x less intrusive

• Per-file fault isolation

Saudi HPC 41

PAS-12 4TH GENERATION BLADE

2002 850 MHz / PC 100 80 GB PATA

2004 1.2 GHz / PC 100 250 GB SATA 330 MB/sec

2006 1.5 GHz / DDR 400 500 GB SATA 400 MB/sec

2008 10 GE shelf switch 750 GB SATA 600 MB/sec

2009 SSD Hybrid 1000 GB SATA, 32GB SSD 600 MB/sec

2010 1.67 GHz / DDR3 800 2000 GB SATA, (64GB SSD) 1.5 GB/sec

PS1

PS2

BAT

NET1

NET2

11x Blades

Saudi HPC 42

PAS-12 BLADE CONFIGURATIONS

Director Blade

10Gbe NIC x 2 ports i7 Multi-Core CPU

3 Memory Channels

Storage Blade

i7 Single

Core CPU

SSD Options

1) With SSD

2) No SSD

2 Memory

Channels

40TB

(2TB HDD x2)

Saudi HPC 43

WHY CAN’T WE JUNK ALL THE DISKS

10x the per-bit cost, and the gap isn’t closing

Cannot manufacture enough bits via Wafers vs. Disks

• Cost of semiconductor FAB is off the charts, per-bit, when compared to a

disk manufacturing facility

Your laptops may all have SSD, but not all of your compute

nodes will get one

Saudi HPC 44

DISK BECOMES THE NEW TAPE

And Tape doesn’t go away, either

• Still half the per-bit cost, and much less lifetime cost

• Tape is just different

no power at rest

physical mobility

higher per-device bandwidth (1.5x to 2x)

Disk capacity trends continue

• 3TB in 2011

• 4TB in 2012

• 8, 16, 32 TB drives are coming

− Patterned media

− Shingled recording

Saudi HPC 45

THE ROLE OF SOLID STATE

Response-time sensitive data

• Block-level metadata to know where the real data lives on HD

Efficient non-volatile storage

• Journals, etc.

Critical datasets

• Airline reservation systems (databae)

Personal devices

• Yeah, we all want some

Saudi HPC 46

THE IDEAL STORAGE BUILDING BLOCK

DRAM, SSD, Disk, CPU and Networking in one package

• Panasas PAS9 shipped in Q3 2009,

but constrained to one SSD and one HD

• Panasas actively engaged in next generation building block

High-level interface to storage devices

• iSCSI/OSDv2 includes high-level operations for

− Capacity management

− Snapshots

− High throughput

Industry standard access to high performance file system

• NFSv4.1 and pNFS

Saudi HPC 48

GETTING TO EXASCALE

Reliability first

• The ability to track and respond to failures is critical

• Panasas Realm Manager provides integrated fault management

• Harden individual components

• High availability fault model to stay online with multiple faults

• Sophisticated on-line error diagnostics and automated response

Speeds and feeds

• Performance scalability comes from base components and the

architecture to support aggregation of large numbers of them

• More sophisticated I/O scheduling to deal with QOS, jitter, and congestion

Saudi HPC 49

CONCLUSION

Thank you for the opportunity to speak at this event

Storage remains a challenging and interesting area

We look forward to a long partnership with the HPC user

community as we help advance the state of high performance I/O

systems

Panasas Background

Saudi HPC 51

TECHNICAL DIFFERENTIATION

Scalable Performance

• start with 12 servers -> grow to 12,000

Novel Data Integrity Protection

• File system and RAID are integrated

• highly reliable data w/ novel data protection systems

Maximum Availability

• built-in distributed system platform, tested under fire

• LANL measured > 99.9% availability across its Panasas systems for all

reasons

Simple to Deploy and Maintain

• integrated storage system with appliance model

Application Acceleration

• customer proven results

Saudi HPC 52

PANASAS FEATURES

Object RAID (2003-2004)

NFS w/ multiprotocol file locking (2005)

Replicated cluster management (2006)

Declustered Parallel Object RAID rebuild (2006)

Metadata Fail Over (2007)

Snapshots, NFS Fail Over, Tiered Parity (2008)

Async Mirror, Data Migration (2009)

SSD/SATA Hybrid Blade (2009)

64-bit multicore (2010)

User Group Quota (2010)

Dates are the year the feature shipped in a production release

Saudi HPC 53

PANASAS REFERENCES

www.panasas.com

Scalable Performance in the Panasas Parallel File System

http://www.usenix.org/event/fast08/tech/full_papers/welch/welch_html/

As complete a technical description as you can find if you want to know more

about how the system works

http://www.panasas.com/




Documents

HPC Storage Trends: Past, Present and Futurearchive.hpcsaudi.org/events/2011_khobar... · Visualization of entropy in Terascale Supernova Initiative application. Image from Kwan-Liu