Upload
trinhkien
View
226
Download
0
Embed Size (px)
Citation preview
Accelerating Lustre!with Cray DataWarpSteve Woods, Solutions Architect
Accelerate Your Storage!
● The Problem ● a new storage hierarchy
● DataWarp overview
● End user Perspectives
● Use cases
● Features
● Examples
● Configuration Considerations
● Summary
The ProblemBuying Disk for Bandwidth is Expensive
HPC Wire, May 1, 2014Attributed to Gary Grider, LANL
New Storage Hierarchy
CPU
Memory(DRAM)
Storage(HDD)
CPU
Near Memory(HBM/HMC)
Near Storage(SSD)
Far Memory(DRAM/NVDIMM)
Far Storage(HDD)
On Node
Off Node
On Node
Off Node
TraditionalToday
4Cray Storage and Data Management - 2015
Lowest effective costHighest latency
Highest effective costLowest latency
New Storage Hierarchy
Cray Storage and Data Management - 2015
● DataWarp● Software defined storage● High performance storage pool
● Sonexion● Scalable file system● Resilient storage
● Problem solved! Scale bandwidth separately from
capacity Reduce overall solution cost Improve application run time
5
CPU
Near Memory(HBM/HMC)
Near Storage(SSD)
Far Memory(DRAM/NVDIMM)
Far Storage(HDD)
✓ Capacity needed
✓ Bandwidth needed
Cray Today
Sonexion-only SolutionLots of SSU’s for bandwidth
Drives up the cost of bandwidth ($/GB/s)
Blended Solution DataWarp to satisfy the bandwidth needs
Sonexion to satisfy the capacity needsDrives down the cost of bandwidth ($/GB/s)
Blending Flash with DiskFor high Performance Lustre
DataWarp Overview
● Software● Virtualizes the underlying HW● Single solution of flash & HDD● Automation via policy● Intuitive interface= Harnesses the performance
● Hardware● Intel Server● Block-based SSD ● Aries I/O blade
= Raw performance
Software Phases of DataWarp
9/12/2016 Copyright 2015 Cray Inc
● Phase 0 (available 2014)● Statically configured compute node swap● Single server file systems, /flash/
● Phase 1 (fall 2015) [CLE 5.2UP04 + patches]● Dynamic allocation and configuration of DataWarp storage to jobs (WLM support)● Application controlled explicit movement of data between DataWarp and parallel
file system (stage_in and stage_out)● DVS striping across DataWarp nodes
● Phase 2 (late 2016) [CLE 6.0UP02]● DVS client caching● Implicit movement of data between DataWarp and PFS storage (cache)● No application changes required
8
DataWarp Hardware● Package
● Standard XC I/O blade● SSDs instead of PCIe cables= Plugs right into the Aries
network
● Capacity● 2 nodes per blade● 2 SSD’s per node= 12.6 TB’s per blade (shown)
● Performance= Node processors are already
optimized for I/O and the Cray Aries network
A
A
A
A
A
A
CC
CCCC
CC
CC
CCCC
CC
DW SSDSSD
DW SSDSSD
LN HCAHCA
LN HCAHCA
Lustrestorage
3.2TB3.2TB
3.2TB3.2TB
=12.6TB
Devices
DataWarp Software
Logical Volume Manager
Data Virtualization Service
Open Source FSDWFS
DataWarp Service
Application PFS
Distributed File system layer –Virtualizes the pool of Flash
Service layer (DWS) –Defines the user experience
Service layer (DVS) –Virtualizes I/O
User
WLM
File presentation
File presentation
File presentation
DataWarp User Perspectives
Transparent• New user• No change to
their experience• e.g. PFS Cache
Active• Experienced
user• WLM script
cmds• Common for
most use cases
Optimized• Power user• Control Via
Lib/CLI• e.g. async
workflow
DataWarp User Perspectives
● Workload Manager Integration (WLM)● Researcher/engineer inserts DataWarp commands into the job script
● “I need this much space in the DataWarp pool”● “I need the space in DataWarp to be shared”● “I need the results saved out to the Parallel File System”
● Job Script requests resources via WLM● DataWarp capacity● Compute nodes, files, file locations
● WLM automates clean up after the application completesWLM integration is the key Ease of use Dynamic provisioning
Devices
DataWarp User Perspectives
Logical Volume Manager
Data Virtualization Service
XFSDWFS
DataWarp Service
Application PFS
User
WLM
● Supported Workload Managers● SLURM
● Moab/Torque
● PBS-Pro
Use Cases for DataWarp
•Checkpoint Restart•Local Cache for the PFS
•Transparent user model
•Private scratch space
•Swap space
•Reference files•File interchange•High performance scratch
Shared Storage
Local Storage
Burst Buffer
PFS Cache
We’ll focus here
Use Cases for DataWarp
ISC 2016 Copyright 2016 Cray Inc.
● Reference files ● Read intensive● commonly used by multi-compute nodes
● DataWarp Used directed behavior Automated provisioning of resources
Cray HPCCompute Nodes
DataWarp Nodes
Shared Storage
Use Cases for DataWarp
ISC 2016 Copyright 2016 Cray Inc.
● File interchange● Sharing intermediate work
● DataWarp Used directed behavior Automated provisioning of resources
Cray HPCCompute Nodes
DataWarp Nodes
Shared Storage
Use Cases for DataWarp
ISC 2016 Copyright 2016 Cray Inc.
● High performance scratch● Files are striped across the pool
● DataWarp User directed behavior Automated provisioning of resources
Cray HPCCompute Nodes
DataWarp Nodes
Shared Storage
Use Cases for DataWarp
•Checkpoint Restart•Local Cache for the PFS
•Transparent user model
•Private scratch space
•Swap space
•Reference files•File interchange•High performance scratch
Shared Storage
Local Storage
Burst Buffer
PFS Cache
DataWarp Application Flexibility
ISC 2016 Copyright 2016 Cray Inc.
Cray HPCCompute Nodes
DataWarp Nodes
Local Storage
Cray HPCCompute Nodes
DataWarp Nodes
Shared Storage
Cray HPCCompute Nodes
DataWarp Nodes
Burst
Sonexion Lustre
Trickle
Burst Buffer
Cray HPCCompute Nodes
Sonexion Lustre
DataWarp Nodes
PFS Cache
#DW jobdw ...
● Requests a job DataWarp instance● Lifetime the same as batch job● Only usable by that batch job
● capacity=<size>● Indirect control over server count based on granularity.● Might help to request more space than you need.
● type=scratch● Selects use of DWFS file system
● type=cache● Selects use of DWCFS file system
20
#DW jobdw ... (continued)
● access_mode=striped● All compute nodes see the same filesystem● Files are striped across all allocated DW server nodes● Files are visible to all compute nodes using the instance● Aggregates both capacity and bandwidth per file
● access_mode=private● All compute nodes see a different filesystem● Files only go to a single DW server node ● A compute node uses the same DW node and files only seen by that compute
node● access_mode=striped,private
● Two mount points created on each compute node● Share the same space
21
Simple DataWarp job with Moab
9/12/2016 Copyright 2015 Cray Inc
#!/bin/bash#PBS -l walltime=2:00 -joe -l nodes=8#DW jobdw type=scratch access_mode=striped capacity=790GiB. /opt/modules/default/init/bashmodule load dwsdwstat most # show DW space available and allocatedcd $PBS_O_WORKDIR aprun -n 1 df -h $DW_JOB_STRIPED # only visible on compute nodesIOR=/home/users/dpetesch/bin/IOR.XCaprun -n 32 -N 4 $IOR -F -t 1m -b 2g -o $DW_JOB_STRIPED/IOR_file
22
DataWarp scratch vs. cache
9/12/2016 Copyright 2015 Cray Inc
● Scratch (phase 1)#!/bin/bash#PBS -l walltime=4:00:00 -joe -l nodes=1#DW jobdw type=scratch access_mode=striped capacity=200GiBcd $PBS_O_WORKDIRexport TMPDIR=$DW_JOB_PRIVATENAST="/msc/nast20131/bin/nast20131 scr=yes bat=no sdir=$TMPDIR"ccmrun ${NAST} input.dat mem=16gb mode=i8 out=dw_out
● Cache (phase 2)#!/bin/bash#PBS -l walltime=4:00:00 -joe -l nodes=1#DW jobdw type=cache access_mode=striped pfs=/lus/scratch/dw_cache capacity=200GiBcd $PBS_O_WORKDIRexport TMPDIR=$DW_JOB_STRIPED_CACHENAST="/msc/nast20131/bin/nast20131 scr=yes bat=no sdir=$TMPDIR"ccmrun ${NAST} input.dat mem=16gb mode=i8 out=dw_cache_out
23
DataWarp Bandwidth
The DataWarp bandwidth seen by an application depends on multiple factors:● Transfer size of the I/O requests● Number of Active Streams (files) per DataWarp server
(for File-per-Process I/O, equals number of processes)● Number of DataWarp server nodes
(which is related to capacity requested)● Other activity on the DW server nodes
Administrative and other user jobs. It is a shared resource.
24
Minimize Compute Residence Time with Data Warp
ISC 2016 Copyright 2016 Cray Inc.
Wall Time
Nod
e Co
unt
Wall Time
DW Preload
DWPost Dump
InitialData Load
Final Data Writes
Com
pute
Compute Nodes
Compute Nodes - Idle
I/O Time Lustre
I/O Time DW
DW Nodes
KeyTimestep Writes (DW)
Timestep Writes
Nod
e Co
unt
Lustre
DataWarp
DataWarp with MSC NASTRAN
ISC 2016 Copyright 2016 Cray Inc.
DataWarp
Lustre Only
Cray blog reference: http://www.cray.com/blog/io-accelerator-boosts-msc-nastran-simulations/
Job wall clock reduced by 2x with DataWarp
9/12/2016 Copyright 2015 Cray Inc27
0
500
1000
1500
2000
2500
3000
3500
cpus=128 cpus=256 cpus=384 cpus=512 cpus=640 cpus=768 cpus=1024 cpus=1536
4 nodes 8 nodes 12 nodes 16 nodes 20 nodes 24 nodes 32 nodes 48nodes
Elap
sed
seco
nds f
or S
tand
ard
Abaqus 2016 s4e, 24M elements, 2 ranks per node 16-core 2.3 GHz Haswell, 128 GB nodes
XC40 ABI lustre XC40 ABI DW
CS400 lustre CS400 /tmp
DataWarp Considerations
• Know your workload• Capacity requirement• Bandwidth requirement• Iteration interval
• Calculate ratio of DataWarp to Spinning disk• % of calculated bandwidth needed by DW vs HDD• Is excess bandwidth needed to sync to HDD• % of storage capacity needed by DW to maintain
performance – capacity for multiple iterations
• Budget
DataWarp Bottom Line
• It is about reducing “Time to Solution”• Returning control back to compute
• Reducing the cost of “Time to Solution”
DataWarp Summary
Faster time to insight
1
2
3
Easy to use Accelerates performance
Dynamic Flexible
2
Questions?