31
Accelerating Lustre! with Cray DataWarp Steve Woods, Solutions Architect

with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

Embed Size (px)

Citation preview

Page 1: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

Accelerating Lustre!with Cray DataWarpSteve Woods, Solutions Architect

Page 2: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

Accelerate Your Storage!

● The Problem ● a new storage hierarchy

● DataWarp overview

● End user Perspectives

● Use cases

● Features

● Examples

● Configuration Considerations

● Summary

Page 3: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

The ProblemBuying Disk for Bandwidth is Expensive

HPC Wire, May 1, 2014Attributed to Gary Grider, LANL

Page 4: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

New Storage Hierarchy

CPU

Memory(DRAM)

Storage(HDD)

CPU

Near Memory(HBM/HMC)

Near Storage(SSD)

Far Memory(DRAM/NVDIMM)

Far Storage(HDD)

On Node

Off Node

On Node

Off Node

TraditionalToday

4Cray Storage and Data Management - 2015

Lowest effective costHighest latency

Highest effective costLowest latency

Page 5: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

New Storage Hierarchy

Cray Storage and Data Management - 2015

● DataWarp● Software defined storage● High performance storage pool

● Sonexion● Scalable file system● Resilient storage

● Problem solved! Scale bandwidth separately from

capacity Reduce overall solution cost Improve application run time

5

CPU

Near Memory(HBM/HMC)

Near Storage(SSD)

Far Memory(DRAM/NVDIMM)

Far Storage(HDD)

✓ Capacity needed

✓ Bandwidth needed

Cray Today

Page 6: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

Sonexion-only SolutionLots of SSU’s for bandwidth

Drives up the cost of bandwidth ($/GB/s)

Blended Solution DataWarp to satisfy the bandwidth needs

Sonexion to satisfy the capacity needsDrives down the cost of bandwidth ($/GB/s)

Blending Flash with DiskFor high Performance Lustre

Page 7: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

DataWarp Overview

● Software● Virtualizes the underlying HW● Single solution of flash & HDD● Automation via policy● Intuitive interface= Harnesses the performance

● Hardware● Intel Server● Block-based SSD ● Aries I/O blade

= Raw performance

Page 8: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

Software Phases of DataWarp

9/12/2016 Copyright 2015 Cray Inc

● Phase 0 (available 2014)● Statically configured compute node swap● Single server file systems, /flash/

● Phase 1 (fall 2015) [CLE 5.2UP04 + patches]● Dynamic allocation and configuration of DataWarp storage to jobs (WLM support)● Application controlled explicit movement of data between DataWarp and parallel

file system (stage_in and stage_out)● DVS striping across DataWarp nodes

● Phase 2 (late 2016) [CLE 6.0UP02]● DVS client caching● Implicit movement of data between DataWarp and PFS storage (cache)● No application changes required

8

Page 9: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

DataWarp Hardware● Package

● Standard XC I/O blade● SSDs instead of PCIe cables= Plugs right into the Aries

network

● Capacity● 2 nodes per blade● 2 SSD’s per node= 12.6 TB’s per blade (shown)

● Performance= Node processors are already

optimized for I/O and the Cray Aries network

A

A

A

A

A

A

CC

CCCC

CC

CC

CCCC

CC

DW SSDSSD

DW SSDSSD

LN HCAHCA

LN HCAHCA

Lustrestorage

3.2TB3.2TB

3.2TB3.2TB

=12.6TB

Page 10: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

Devices

DataWarp Software

Logical Volume Manager

Data Virtualization Service

Open Source FSDWFS

DataWarp Service

Application PFS

Distributed File system layer –Virtualizes the pool of Flash

Service layer (DWS) –Defines the user experience

Service layer (DVS) –Virtualizes I/O

User

WLM

File presentation

File presentation

File presentation

Page 11: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

DataWarp User Perspectives

Transparent• New user• No change to

their experience• e.g. PFS Cache

Active• Experienced

user• WLM script

cmds• Common for

most use cases

Optimized• Power user• Control Via

Lib/CLI• e.g. async

workflow

Page 12: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

DataWarp User Perspectives

● Workload Manager Integration (WLM)● Researcher/engineer inserts DataWarp commands into the job script

● “I need this much space in the DataWarp pool”● “I need the space in DataWarp to be shared”● “I need the results saved out to the Parallel File System”

● Job Script requests resources via WLM● DataWarp capacity● Compute nodes, files, file locations

● WLM automates clean up after the application completesWLM integration is the key Ease of use Dynamic provisioning

Page 13: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

Devices

DataWarp User Perspectives

Logical Volume Manager

Data Virtualization Service

XFSDWFS

DataWarp Service

Application PFS

User

WLM

● Supported Workload Managers● SLURM

● Moab/Torque

● PBS-Pro

Page 14: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

Use Cases for DataWarp

•Checkpoint Restart•Local Cache for the PFS

•Transparent user model

•Private scratch space

•Swap space

•Reference files•File interchange•High performance scratch

Shared Storage

Local Storage

Burst Buffer

PFS Cache

We’ll focus here

Page 15: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

Use Cases for DataWarp

ISC 2016 Copyright 2016 Cray Inc.

● Reference files ● Read intensive● commonly used by multi-compute nodes

● DataWarp Used directed behavior Automated provisioning of resources

Cray HPCCompute Nodes

DataWarp Nodes

Shared Storage

Page 16: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

Use Cases for DataWarp

ISC 2016 Copyright 2016 Cray Inc.

● File interchange● Sharing intermediate work

● DataWarp Used directed behavior Automated provisioning of resources

Cray HPCCompute Nodes

DataWarp Nodes

Shared Storage

Page 17: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

Use Cases for DataWarp

ISC 2016 Copyright 2016 Cray Inc.

● High performance scratch● Files are striped across the pool

● DataWarp User directed behavior Automated provisioning of resources

Cray HPCCompute Nodes

DataWarp Nodes

Shared Storage

Page 18: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

Use Cases for DataWarp

•Checkpoint Restart•Local Cache for the PFS

•Transparent user model

•Private scratch space

•Swap space

•Reference files•File interchange•High performance scratch

Shared Storage

Local Storage

Burst Buffer

PFS Cache

Page 19: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

DataWarp Application Flexibility

ISC 2016 Copyright 2016 Cray Inc.

Cray HPCCompute Nodes

DataWarp Nodes

Local Storage

Cray HPCCompute Nodes

DataWarp Nodes

Shared Storage

Cray HPCCompute Nodes

DataWarp Nodes

Burst

Sonexion Lustre

Trickle

Burst Buffer

Cray HPCCompute Nodes

Sonexion Lustre

DataWarp Nodes

PFS Cache

Page 20: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

#DW jobdw ...

● Requests a job DataWarp instance● Lifetime the same as batch job● Only usable by that batch job

● capacity=<size>● Indirect control over server count based on granularity.● Might help to request more space than you need.

● type=scratch● Selects use of DWFS file system

● type=cache● Selects use of DWCFS file system

20

Page 21: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

#DW jobdw ... (continued)

● access_mode=striped● All compute nodes see the same filesystem● Files are striped across all allocated DW server nodes● Files are visible to all compute nodes using the instance● Aggregates both capacity and bandwidth per file

● access_mode=private● All compute nodes see a different filesystem● Files only go to a single DW server node ● A compute node uses the same DW node and files only seen by that compute

node● access_mode=striped,private

● Two mount points created on each compute node● Share the same space

21

Page 22: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

Simple DataWarp job with Moab

9/12/2016 Copyright 2015 Cray Inc

#!/bin/bash#PBS -l walltime=2:00 -joe -l nodes=8#DW jobdw type=scratch access_mode=striped capacity=790GiB. /opt/modules/default/init/bashmodule load dwsdwstat most # show DW space available and allocatedcd $PBS_O_WORKDIR aprun -n 1 df -h $DW_JOB_STRIPED # only visible on compute nodesIOR=/home/users/dpetesch/bin/IOR.XCaprun -n 32 -N 4 $IOR -F -t 1m -b 2g -o $DW_JOB_STRIPED/IOR_file

22

Page 23: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

DataWarp scratch vs. cache

9/12/2016 Copyright 2015 Cray Inc

● Scratch (phase 1)#!/bin/bash#PBS -l walltime=4:00:00 -joe -l nodes=1#DW jobdw type=scratch access_mode=striped capacity=200GiBcd $PBS_O_WORKDIRexport TMPDIR=$DW_JOB_PRIVATENAST="/msc/nast20131/bin/nast20131 scr=yes bat=no sdir=$TMPDIR"ccmrun ${NAST} input.dat mem=16gb mode=i8 out=dw_out

● Cache (phase 2)#!/bin/bash#PBS -l walltime=4:00:00 -joe -l nodes=1#DW jobdw type=cache access_mode=striped pfs=/lus/scratch/dw_cache capacity=200GiBcd $PBS_O_WORKDIRexport TMPDIR=$DW_JOB_STRIPED_CACHENAST="/msc/nast20131/bin/nast20131 scr=yes bat=no sdir=$TMPDIR"ccmrun ${NAST} input.dat mem=16gb mode=i8 out=dw_cache_out

23

Page 24: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

DataWarp Bandwidth

The DataWarp bandwidth seen by an application depends on multiple factors:● Transfer size of the I/O requests● Number of Active Streams (files) per DataWarp server

(for File-per-Process I/O, equals number of processes)● Number of DataWarp server nodes

(which is related to capacity requested)● Other activity on the DW server nodes

Administrative and other user jobs. It is a shared resource.

24

Page 25: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

Minimize Compute Residence Time with Data Warp

ISC 2016 Copyright 2016 Cray Inc.

Wall Time

Nod

e Co

unt

Wall Time

DW Preload

DWPost Dump

InitialData Load

Final Data Writes

Com

pute

Compute Nodes

Compute Nodes - Idle

I/O Time Lustre

I/O Time DW

DW Nodes

KeyTimestep Writes (DW)

Timestep Writes

Nod

e Co

unt

Lustre

DataWarp

Page 26: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

DataWarp with MSC NASTRAN

ISC 2016 Copyright 2016 Cray Inc.

DataWarp

Lustre Only

Cray blog reference: http://www.cray.com/blog/io-accelerator-boosts-msc-nastran-simulations/

Job wall clock reduced by 2x with DataWarp

Page 27: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

9/12/2016 Copyright 2015 Cray Inc27

0

500

1000

1500

2000

2500

3000

3500

cpus=128 cpus=256 cpus=384 cpus=512 cpus=640 cpus=768 cpus=1024 cpus=1536

4 nodes 8 nodes 12 nodes 16 nodes 20 nodes 24 nodes 32 nodes 48nodes

Elap

sed

seco

nds f

or S

tand

ard

Abaqus 2016 s4e, 24M elements, 2 ranks per node 16-core 2.3 GHz Haswell, 128 GB nodes

XC40 ABI lustre XC40 ABI DW

CS400 lustre CS400 /tmp

Page 28: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

DataWarp Considerations

• Know your workload• Capacity requirement• Bandwidth requirement• Iteration interval

• Calculate ratio of DataWarp to Spinning disk• % of calculated bandwidth needed by DW vs HDD• Is excess bandwidth needed to sync to HDD• % of storage capacity needed by DW to maintain

performance – capacity for multiple iterations

• Budget

Page 29: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

DataWarp Bottom Line

• It is about reducing “Time to Solution”• Returning control back to compute

• Reducing the cost of “Time to Solution”

Page 30: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

DataWarp Summary

Faster time to insight

1

2

3

Easy to use Accelerates performance

Dynamic Flexible

2

Page 31: with Cray DataWarp Steve Woods, Solutions Architect · •Local Cache for •Checkpoint Restart the PFS •Transparent user model •Private scratch ... -accelerator-boosts-msc-nastran

Questions?