Non-Volatile Memory for Next Generation I/O€¦ · Next Generation I/O Dr Michèle Weiland...

Preview:

Citation preview

Non-Volatile Memory for Next Generation I/O

Dr Michèle Weilandm.weiland@epcc.ed.ac.uk

Current trends & approaches

28/03/2017 39th ORAP Forum 2

Burst buffer

28/03/2017 39th ORAP Forum 3

highperformancenetwork

externalfilesystem

computenodes

highperformancenetwork

externalfilesystem

computenodes

burstfilesystem

Moving beyond burst buffer

• Non-volatile is coming to the node rather than the filesystem• Argonne Theta machine has 128GB SSD in each

compute node

28/03/2017 39th ORAP Forum 4

highperformancenetwork

externalfilesystem

computenodes

Non-volatile memory

• Non-volatile RAM• 3D XPoint technology is one example

• Much larger capacity than DRAM• Hosted in the DRAM slots (DIMM form

factor), controlled by a standard memory controller

• Slower than DRAM by a small factor, but significantly faster than SSDs

28/03/2017 39th ORAP Forum 5

Memory

Storage

Cache

SlowStorage

Cache

NVRAM

FastStorage

Memory

The NEXTGenIO approach

28/03/2017 39th ORAP Forum 6

NEXTGenIO key facts

• FETHPC Research & Innovation Action• Active for 18 months so

far• 8 partners, covering

• Hardware• HPC centres and users• Software• Tools developers

28/03/2017 39th ORAP Forum 7

Our objectives

• Hardware platform prototypeØDemonstrating the prototype’s broad applicability for both HPC

and data centric applications

• Exascale I/O investigationØUnderstanding how best to exploit NVRAM

• Systemware developmentØProducing the necessary software to enable (Exascale)

application execution on the hardware platform

• Application co-design ØUnderstanding individual application I/O profiles and typical I/O

workloads on shared systems running multiple different applications

28/03/2017 39th ORAP Forum 8

Systemware

• System software must understand extra level present in the memory hierarchy• Work on adapting job scheduler (SLURM)• Development of a data scheduler• Object stores as alternatives to file systems

• DAOS (Distributed Application Object Storage)• dataClay

• Multi-node NVRAM file system• echoFS

☛ Key goal: Platform must be usable “as is” for legacy applications

28/03/2017 39th ORAP Forum 9

Workloads & I/O

• Try and understand how different I/O behaviour and scheduling policies will impact job throughput• Three different workloads

• Generic à EPCC• Special purpose à ECMWF• Commercial à Arctur

• I/O Workload Simulator• Create benchmark of synthetic jobs generated from

real workloads à to be deployed on HPC system• Create simulator of workload schedule to test

impact of policies and I/O performance à to be deployed on laptop

28/03/2017 39th ORAP Forum 10

ARCHER workload

28/03/2017 39th ORAP Forum 11

metadata

write

read

ARCHER workload

28/03/2017 39th ORAP Forum 12

metadata

write

read

Tools support

• Profiling and debugging tools need to be able to understand implications of an additional (potentially persistent) memory layer

28/03/2017 39th ORAP Forum 13

Applications

• Traditional HPC• OpenFOAM à CFD• CASTEP à Chemistry• IFS à weather forecasting• MONC à cloud modelling

• Novel uses• OSPRay à ray tracing, rendering engine• Halvade à genome sequencing• Tiramisu à deep learning (based on Caffe)• K-means à ML

28/03/2017 39th ORAP Forum 14

Usage models

28/03/2017 39th ORAP Forum 15

NVRAM usage models

• The “memory” usage model allows for the extension of the main memory • The data is volatile like normal DRAM based main memory

• The “storage” usage model which supports the use of NVRAM like a classic block device• E.g. like a very fast SSD

• The “application direct” usage model maps persistent storage from the NVRAM directly into the main memory address space• Direct CPU load/store instructions for persistent main

memory regions

28/03/2017 39th ORAP Forum 16

Exploiting distributed storage

Filesystem

Memory Memory Memory Memory Memory Memory

Node Node Node Node Node Node

Network

Filesystem

Network

Memory

NodeNVRAM

Memory

NodeNVRAM

Memory

NodeNVRAM

Memory

NodeNVRAM

Memory

NodeNVRAM

Memory

NodeNVRAM

28/03/2017 17Filesystem

Network

Memory

NodeNVRAM

Memory

Node

Memory

NodeNVRAM

Memory

Node

Memory

NodeNVRAM

Memory

Node

Using distributed storage

• Without changing applications• Large memory space/in-memory database etc…• Local filesystem

ØUsers manage data themselvesØNo global data access/namespace, large number of filesØStill require global filesystem for persistence

28/03/2017 39th ORAP Forum 18

Filesystem

Network

Memory

Node/tmp

Memory

Node/tmp

Memory

Node/tmp

Memory

Node/tmp

Memory

Node/tmp

Memory

Node/tmp

Using distributed storage

• Without changing applications• Filesystem buffer

ØPre-load data into NVRAM from filesystemØUse NVRAM for I/O and write data back to filesystem at the

endØRequires systemware to preload and postmove dataØUses filesystem as namespace manager

28/03/2017 39th ORAP Forum 19

Filesystem

Network

Memory

Nodebuffer

Memory

Nodebuffer

Memory

Nodebuffer

Memory

Nodebuffer

Memory

Nodebuffer

Memory

Nodebuffer

Using distributed storage

• Without changing applications• Global filesystem

ØRequires functionality to create and tear down global filesystems for individual jobs

ØRequires filesystem that works across nodesØRequires functionality to preload and postmove filesystemsØNeed to be able to support multiple filesystems across system

28/03/2017 39th ORAP Forum 20

Filesystem

Network

Memory Memory

Node

Memory Memory Memory Memory

Node

Node NodeNodeNode

Filesystem

Using distributed storage

• With changes to applications• Object store

ØNeeds same functionality as global filesystemØRemoves need for POSIX, or POSIX-like functionality

28/03/2017 39th ORAP Forum 21

Filesystem

Network

Memory Memory

Node

Memory Memory Memory Memory

Node

Node NodeNodeNode

Objectstore

Using distributed storage

• Without changing applications• Automatic check-pointing

• Resiliency• Local check-pointing without hitting the filesystem

• Pause and restart• Just-in-time scheduling/high priority jobs• Waiting for something else to happen…

28/03/2017 39th ORAP Forum 22

Using distributed storage

• New usage models• Resident data sets

• Sharing preloaded data across a range of jobs• Data analytic workflows• How to control access/authorisation/security/etc….?

• Workflows• Producer-consumer model

• Remove filesystem from intermediate stages

28/03/2017 39th ORAP Forum 23

Job1

Filesystem

Job2 Job3 Job4

Using distributed storage

• Workflows• How to enable different sized applications?

• How to schedule these jobs fairly?• How to enable secure access?

28/03/2017 39th ORAP Forum 24

Job1

Filesystem

Job2Job3

Job4Job2

Job2 Job2 Job4

The Challenge of distributed storage

• Enabling all the use cases in multi-user, multi-job environment is the real challenge• Heterogeneous scheduling mix• Different requirements on the NVRAM• Scheduling across these resources• Enabling sharing of nodes• Not impacting on node compute performance

• Enabling applications to do more I/O• Large numbers of our applications don’t heavily

use I/O at the moment• What can we enable if I/O is significantly cheaper?

28/03/2017 39th ORAP Forum 25

Questions?

28/03/2017 39th ORAP Forum 26

Recommended