Efficient Scientific Data Management on Supercomputers · 12.07.2018 · –Lustre and Spectrum Scale (GPFS) Typical building blocks of parallel file systems –Storage hardware

Efficient Scientific Data Management on Supercomputers

Suren Byna

Scientific Data Management Group, LBNL

▪ Simulations

▪ Experiments

▪ Observations

2

Scientific Data - Where is it coming from?

3

Life of scientific data

Generation In situ analysis

Processing

Storage

Analysis

Preservation

Sharing

Refinement

4

Supercomputing systems

5

Typical supercomputer architecture

Cori system

Blade&&=&2&x&Burst&Buffer&Node&(2x&SSD)&

Lustre&OSSs/OSTs&CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

CN&

BB&SSD&SSD&

BB&SSD&SSD&

BB&SSD&SSD&

BB&SSD&SSD&

ION&IB&IB&

ION&IB&IB&

Storage&Fabric&(InfiniBand)&

Storage&Servers&

Compute&Nodes&

Aries&HighHSpeed&Network&

I/O&Node&(2x&InfiniBand&HCA)&

InfiniBand&Fabric&

▪ Data representation –  Metadata, data structures, data models

▪ Data storage –  Storing and retrieving data and metadata to file systems fast

▪ Data access –  Improving performance of data access that scientists desire

▪ Facilitating analysis –  Strategies for supporting finding the meaning in the data

▪ Data transfers –  Transfer data within a supercomputing system and between

different systems

6

Scientific Data Management in supercomputers

▪ Data representation –  Metadata, data structures, data models

▪ Data storage –  Storing and retrieving data and metadata to file systems fast

▪ Data access –  Improving performance of data access that scientists desire

▪ Facilitating analysis –  Strategies for supporting finding the meaning in the data

▪ Data transfers –  Transfer data within a supercomputing system and between

different systems

7

Scientific Data Management in supercomputers

▪ Storing and retrieving data – Parallel I/O –  Software stack –  Modes of parallel I/O –  Tuning parallel I/O performance

▪ Autonomous data management system –  Proactive Data Containers (PDC) system –  Metadata management service –  Data management service

8

Focus of this presentation

Trends – Storage system transformation

9

IO Gap

Memory

Parallel file system (Lustre, GPFS)

Archival Storage (HPSS tape)

IO Gap

Shared burst buffer

Memory

Parallel file system (Lustre, GPFS)

Archival Storage (HPSS tape)

Memory

Parallel file system

Archival storage (HPSS tape)

Shared burst buffer

Node-local storage

Conventional Current Eg. Cori @ NERSC

Upcoming Eg. Aurora @ ALCF

Campaign storage

• IO performance gap in HPC storage is a significant bottleneck because of slow disk-based storage

• SSD and new memory technologies are trying to fill the gap, but increase the depth of storage hierarchy

Applications

High Level I/O Libraries

I/O Middleware

I/O Forwarding

Parallel File System

I/O Hardware

10

Contemporary Parallel I/O ST Stack

Applications

High Level I/O Library (HDF5, NetCDF, ADIOS)

I/O Middleware (MPI-IO)

I/O Forwarding

Parallel File System (Lustre, GPFS,..)

I/O Hardware

11

Parallel I/O software stack

§  I/O Libraries –  HDF5 (The HDF Group) [LBL, ANL] –  ADIOS (ORNL) –  PnetCDF (Northwestern, ANL) –  NetCDF-4 (UCAR)

•  Middleware – POSIX-IO, MPI-IO (ANL) •  I/O Forwarding

•  File systems: Lustre (Intel), GPFS (IBM), DataWarp (Cray), …

§  I/O Hardware (disk-based, SSD-based, …)

▪ Types of parallel I/O •  1 writer/reader, 1 file •  N writers/readers, N files

(File-per-process) •  N writers/readers, 1 file •  M writers/readers, 1 file

–  Aggregators –  Two-phase I/O

•  M aggregators, M files (file-per-aggregator) –  Variations of this mode

12

Parallel I/O – Application view

P0 P1 Pn-1 Pn …

file.0

1 Writer/Reader, 1 File

P0 P1 Pn-1 Pn …

file.0

n Writers/Readers, n Files

file.1 file.n-1 file.n

P0 P1 Pn-1 Pn …

n Writers/Readers, 1 File File.1

P0 P1 Pn-1 Pn …

file.0

M Writers/Readers, M Files

file.m

P0 P1 Pn-1 Pn …

M Writers/Readers, 1 File File.1

▪ Parallel file systems – Lustre and Spectrum Scale (GPFS)

▪ Typical building blocks of parallel file systems – Storage hardware – HDD or SSD RAID

– Storage servers (in Lustre, Object Storage Servers [OSS], and object storage targets [OST]

– Metadata servers – Client-side processes and interfaces

▪ Management – Stripe files for parallelism – Tolerate failures

13

Parallel I/O – System view

OST 0 OST 1 OST 2 OST 3

File

File

Physical view on a parallel file system

Logical view

Communication network

14

How to achieve peak parallel I/O performance?

HDF5(Alignment, Chunking, etc.)

MPI I/O(Enabling collective buffering, Sieving buffer size, collective buffer size, collective buffer nodes, etc.)

Application

Parallel File System(Number of I/O nodes, stripe size, enabling prefetching

buffer, etc.)

Storage HardwareStorage Hardware

▪ Parallel I/O software stack provides options for performance optimization

▪  Challenge: Complex inter-dependencies among SW and HW layers

Tuning parameter space

The$whole$space$visualized$

128&64&32&16&8&4&Stripe_Count&

Stripe_Size&(MB)& 32&16&8&4&2&1& 128&64&

32&16&8&4&2&1&cb_nodes&

cb_buffer_size&(MB)& 32&16&8&4&2&1& 128&64&

1048576&524288&alignment&

1&MB&512&256&128&64&siv_buf_size&(KB)&

64&524288&1&1&1&4&

1&MB&

1048576&128&32&128&128&

…$ 23040$

July 30, 2012

CScADS Workshop 15

▪ Simulation of magnetic reconnection (a space weather phenomenon) with VPIC code – 120,000 cores – 8 arrays (HDF5 datasets) – 32 TB to 42 TB files at 10 time steps

▪ Extracted I/O kernel ▪ M Aggregators to 1 shared file ▪ Trial-and-error selection of

Lustre file system parameters while scaling the problem size ▪ Reached peak performance in

many instances in a real simulation

16

Tuning for writing trillion particle datasets

More details: SC12 and CUG 2013 papers

Tuning combinations are abundant

• Searching through all combinations manually is impractical

• Users, typically domain scientists, should not be burdened with tuning

• Performance auto-tuning has been explored heavily for optimizing matrix operations

• Auto-tuning for parallel I/O is challenging due to shared I/O subsystem and slow I/O

• Need a strategy for reduce the search space with some knowledge

17

Our solution: I/O Auto-tuning

•  Auto-tuning framework to search the parameter space with a reduced number of combinations

•  HDF5 I/O library sets the optimization parameters

•  H5Tuner: Dynamic interception of HDF5 calls

•  H5Evolve: –  Genetic algorithm based

selection –  Model-based selection

18

Dynamic Model-driven Auto-tuning

• Auto-tuning using empirical performance models of I/O

• Steps – Training phase to develop an I/O model

– Pruning phase to select the top-K configurations

– Exploration phase to select the best configuration

– Refitting step to refine performance model

Overview of DynamicModel-driven I/O tuning

Exploration

Pruning

Model Generation

HPC System

Training Phase

Storage System

Develop anI/O Model

Training Set

I/O Kernel

Top k Configurations

Refi

t the

mod

el(C

ontro

led

by u

ser)

Performance ResultsSelect the Best

Performing Configuration

I/O ModelAll Possible

Values

Refitting

19

Empirical Performance Model

• Non-linear regression model

•  Linear combinations of nb non-linear, low polynomial basis functions (ϕk), and hyper-parameters β (selected with standard regression approach) for a parameter configuration of x

•  For example:

•  f: file size; a: number of aggregators; c: stripe count; s: stripe size

m(x;β) = βkφk (x)k=1

nb

∑

βi = [10.59, 68.99, 59.83, −1.23, 2.26, 0.18, 0.01]

0.1

0.3

0.4

1

10

20

30

40

VPIC-IO VORPAL-IO GCRM-IO

I/O

Ba

nd

wid

th (

GB

/s)

EdisonHopper

StampedeDefault VPIC-IO

Default VORPAL I/ODefault GCRM I/O

(a) 4K cores

0.1

0.2

0.3

1

10

20

30


I/O

Ba

nd

wid

th (

GB

/s)

EdisonHopper

Default VPIC-IODefault VORPAL I/O

(b) 8K cores

0

5

10

15

20

25

30


I/O

Ba

nd

wid

th (

GB

/s)

EdisonHopper

(c) 16K cores

Figure 4: Summary of the best I/O performance obtained in the top-20 configurations for each I/O benchmark running on(a) 4K cores, (b) 8K cores, and (c) 16K cores. Note that (a) and (b) are log-scale plots.

formed our model on the basis of the remaining 216 con-figurations. As illustrated in Figure 3, the five-term modelis as follows:

m(c, s, a) = �1 + �2c

a+ �3

s

a+ �4

1

a+ �5

a

cs. (2)

We observed this model reproducing the training datawell. The accuracy is assessed by having acceptable sta-tistical measures (e.g. R-squared of 0.95 for VPIC on Stam-pede). Furthermore, even though the models we consider inthis paper do not directly account for the variability, weobserve that our realized predictions tend to yield moreaccurate predictions for those configurations where littlevariability is seen as can be seen in Figure 2.

Thus far, we have only considered a single file sizewhen building nonlinear regression models. This modelingapproach reflects the typical workflow in automatic empir-ical performance tuning, where one wishes to determineparameter values for actionable decisions. One of the mainbenefits of our models is that their simple, parameterized,algebraic form allows us to very quickly solve optimizationproblems involving them.

B. Training the Performance ModelsWe now consider models for multiple different file sizes.

The four independent variables, i.e., x = (c, s, a, f), form atotal of 81 possible terms in the basis set.

We conducted experiments for all the three I/O kernelsmentioned in Section III-B and different file sizes on all thethree platforms, i.e., Hopper, Edison, and Stampede. Thetraining set size on 512 cores was 336 configurations, on1028 cores it was 180 and on 2048 cores it was 96. The sizeof the training set is decreased as the core counts and filesizes increase due to the increase in the required resources.

The selection of training set can be automatic with simpleheuristics of limits on the allowable value ranges in order tocover the parameter space well. For example, the maximumnumber of aggregators are limited by the number of MPIprocesses of the application. Additionally, commands suchas “lfs osts” obtains the number of OSTs available on a

Lustre file system, which can be stored in a configurationfile. Once the limits are known, to establish a training setone can use all discrete integer values as possible tunableparameter values. Another strategy is to use powers-of-two or halves-of-max-allowable values. An expert can setthese values more judiciously. Since the training is doneinfrequently, this can be decided based on the training setexploration time budget.

Following the forward-selection approach on the entiretraining data set, as we defined in [1], we obtain one modelfor each application on each platform. Due to lack of space,we only provide the model for VPIC-IO on Edison:

m(x) = �1+�21

s+�3

1

a+�4

c

s+�5

f

c+�6

f

s+�7

cf

a, (3)

with a fit to the data yielding

�̂ = [10.59, 68.99, 59.83, �1.23, 2.26, 0.18, 0.01] .

The terms in (3) are interpretable from the parallel I/Opoint of view. For instance, write time would have an inverserelationship with number of aggregators and stripe count,because as we increase those the I/O performance tend toincrease; It should have a linear relationship with file size asincreasing the file size causes an increase in the write time.We describe the detailed validation of this model to sectionV-B. Additionally, in the next section we will analyze indetail this model’s ability to perform space reduction andoptimization for a variety of I/O tuning tasks.

C. Refitting the Performance ModelsAfter training the model for the search space pruning

step, the process of choosing the top k configurations onlyinvolves evaluating the model, a task whose computationalexpense is negligible (relative to evaluation of a configu-ration) for our simple choice of models. Therefore, usingsuch an approach will only require an evaluation of a fewconfigurations on the platform, decreasing the optimizationtime significantly. In our experiments, the top twenty config-urations always resulted in high I/O bandwidth. As opposedto our existing GA-based approach, our approach does not

20

Performance Improvement: 4K cores

0.1

0.30.4

1

10

20

3040


I/O

Bandw

idth

(G

B/s

)EdisonHopper

StampedeDefault VPIC-IO on Hopper

Default VORPAL-IO on HopperDefault GCRM-IO on Hopper

21

Performance Improvement: 8K cores

0.1

0.2

0.3

1

10

20

30


I/O

Bandw

idth

(G

B/s

)

94x

EdisonHopper

Default VPIC-IO on HopperDefault VORPAL-IO on Hopper

22

23

Autonomous data management

Storage Systems and I/O: Current status

24

Hardware Software

High-level lib (HDF5, etc.)

IO middleware (POSIX, MPI-IO)

IO forwarding

Parallel file systems

Applications

Usage

… Data (in memory)

IO software

… Files in file system

•  Challenges –  Multi-level hierarchy complicates data movement, especially if user has

to be involved –  POSIX-IO semantics hinder scalability and performance of file systems

and IO software

Tune middleware Tune file systems

Memory



Shared burst buffer

Node-local storage

Campaign storage

HPC data management requirements

Use case Domain Sim/EOD/analysis

Data size

I/O Requirements

FLASH High-energy density physics

Simulation ~1PB Data transformations, scalable I/O interfaces, correlation among simulation and experimental data

CMB / Planck

Cosmology Simulation, EOD/Analysis

10PB Automatic data movement optimizations

DECam & LSST

Cosmology EOD/Analysis ~10TB Easy interfaces, data transformations

ACME Climate Simulation ~10PB Async I/O, derived variables, automatic data movement

TECA Climate Analysis ~10PB Data organization and efficient data movement

HipMer Genomics EOD/Analysis ~100TB Scalable I/O interfaces, efficient and automatic data movement

25

Easy interfaces and superior performance


Information capture and management

25

Storage Systems and I/O: Next Generation

Hardware Software

High-level API Applications

Usage

… Data (in memory)

•  Next generation IO software – Autonomous, proactive data management

system beyond POSIX restrictions •  Transparent data movement

•  Proactive analysis

–  Object-centric storage interface •  Rich metadata •  Data and metadata accessible

through queries –  Transparent data object

placement and organization across storage hardware layers

26

Memory



Shared burst buffer

Node-local storage

Campaign storage

What is an object store? Simple

POSIX File System Object Store chmod open read lseek write close stat

unlink

Slide from Glenn Lockwood 27

get

put

delete

What is an object?

•  Chunks of a file

•  Files (images, videos, etc.)

•  Array

•  Key-value pairs

•  File + Metadata

Current parallel file systems

Cloud services (S3, etc.)

HDF5, DAOS, etc.

OpenStack Swift, MarFS, Ceph, etc.

PDC Interpretation of Objects

Data + Metadata + Provenance + Analysis operations + Information (data products)

Proactive Data Containers (PDC)

Proactive Data Containers

Container Collection

PDC Locus

30

▪  Interface –  Programming and client-level

interfaces

▪  Services –  Metadata management –  Autonomous data movement –  Analysis and transformation task

execution

▪  PDC locus services –  Object mapping –  Local metadata management –  Locus task execution

PDC System – High-level Architecture

31

▪  Interface –  Programming and client-level

interfaces

▪  Services –  Metadata management –  Autonomous data movement –  Analysis and transformation task

execution

▪  PDC locus services –  Object mapping –  Local metadata management –  Locus task execution

Persistent Storage API

BB FS Lustre DAOS

…

PDC System – High-level Architecture

32

Data Management Using the PDC System

33

PDC system processes

Application processes

…•  Storing data –  Application declares persistent data objects → PDC creates metadata

objects –  Application adds ‘tags’ / properties to identify objects in future → PDC

adds these as metadata –  Application processes map memory buffers to regions of objects – When data in objects are ready, PDC system moves to data to storage

and updates metadata → Asynchronous and autonomous •  Retrieving data

–  Application queries metadata to find desired objects ← PDC system returns handles to the desired objects

–  Application maps to a region of the object or give query condition ← PDC system brings desired data to memory

•  Create & Open objects –  Create sets object properties (metadata): name, lifetime, user info,

provenance, tags, dimensions, data type, transformations, etc. •  Create an object region

–  Similar to HDF5 hyperslab selections • Map / Unmap an object region

– Object region <=> memory region •  Lock / Unlock a Mapped Region

– Read / Write Locks

–  Transparently update memory buffer / object, asynchronously

–  Transforms occur “outside” of lock time, managed by PDC system •  Close & Release (delete) objects

PDC API – Object Manipulation

34

PDC API – I/O

35

PDC API – I/O

36

PDC API – I/O

37

PDC API – I/O

38

PDC operation

•  Create query with conditions

– Set up for query execution, invokes query optimization framework in future

– Allows application developers to search for named objects, as well as objects with particular characteristics

•  Execute query

– Query execution can occur at multiple tiers, and locally execute on sharded / striped objects

•  Iterate_start / Iterate_next

– Iterate over objects from query results, as well as generic actions •  Get_object_handle / Get_object_info

– Retrieve metadata for object

PDC API – Object Access

40

Metadata Object Management

Capabilities ●  Create, update, search, and delete metadata objects. ●  All tags are searchable. ●  Maintain extended attributes and object relationships.

A collection of tags (key-value pairs)

PDC Namespace Management

PDC Metadata Management

Metadata Search

●  Exact match search ○  Similar to stat. ○  Require all ID attributes. ○  Retrieve single metadata object,

directly from one target server.

●  Partial match search ○  Similar to find or grep. ○  Any tag can be specified. ○  Retrieve multiple metadata objects, need to scan all

servers. ■  Done in parallel. ■  Indexing is being implemented.

Performance: Metadata Creation

SoMeta 1: all objects have same name but different values in other ID attributes (timestep). SoMeta 4: four unique names are used and each name is used by a quarter of metadata objects. The objects with an identical name have different ID attributes. SoMeta Unique: each metadata object has a unique name.

Performance of scaling SoMeta by creating 10000 to 100 million metadata objects with 512 servers and 2560 clients on Cori.

Performance: Metadata Search

Exact match search. Partial match search.

Search up to 20% of 1 million objects takes less than a fraction of a second with 128 servers. Network transfer time dominates the total time. Exact match search requires much more small network transfers.

Searching for BOSS objects

Total elapsed time to group objects by adding tags(SoMeta), attributes(SciDB), symlink(Lustre) with different selectivity.

Total elapsed time for searching and retrieving the metadata of previously assigned tags/

attributes with different selectivity.

SoMeta is 10 to 90X faster for metadata grouping (tagging), and 2 to 16X faster in searching attributes (tags) than SciDB and MongoDB, up to 800X faster with 80 clients searching in parallel.

PDC System - High-level Architecture

Persistent Storage API

BB FS Lustre DAOS

…

Asynchronous I/O

●  PDC supports asynchronous I/O through client server architecture ○  Client sends I/O request ○  Server confirms receipt of request ○  Client continues to next computation

Write

Read

VPIC-IO (Weak Scaling) Multi-timestep Write

Total time to write 5 timesteps from the VPIC-IO kernel to Lustre and Burst Buffer on Cori. PDC is 5x faster than HDF5 and 23x over PLFS.

BD-CATS-IO (Weak scaling) Multi-timestep Read

Total time for reading data of 5 timesteps from the BD-CATS-IO kernel from the Lustre and from the burst buffer. PDC is 11X faster than PLFS and HDF5.

Conclusions

Easy interfaces and superior performance


Information capture and management

54

•  Simpler object interface •  Applications produce data objects and declare to keep them persistent •  Applications request for desired data

•  Asynchronous and autonomous data movement •  Bring interesting data to apps

•  Manage rich metadata and enhance search capabilities •  Perform analysis and transformations in the data path

▪ Contact: • Suren Byna (sdm.lbl.gov/~sbyna/) [[email protected]]

▪ Contributions to this presentation • ExaHDF5 project team (sdm.lbl.gov/exahdf5) • Proactive Data Containers (PDC) team (sdm.lbl.gov/pdc) • SDM group: sdm.lbl.gov

55

Documents

Efficient Scientific Data Management on Supercomputers · 12.07.2018 · –Lustre and Spectrum Scale (GPFS) Typical building blocks of parallel file systems –Storage hardware