Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Efficient Scientific Data Management on Supercomputers
Suren Byna
Scientific Data Management Group, LBNL
▪ Simulations
▪ Experiments
▪ Observations
2
Scientific Data - Where is it coming from?
3
Life of scientific data
Generation In situ analysis
Processing
Storage
Analysis
Preservation
Sharing
Refinement
4
Supercomputing systems
5
Typical supercomputer architecture
Cori system
Blade&&=&2&x&Burst&Buffer&Node&(2x&SSD)&
Lustre&OSSs/OSTs&CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
CN&
BB&SSD&SSD&
BB&SSD&SSD&
BB&SSD&SSD&
BB&SSD&SSD&
ION&IB&IB&
ION&IB&IB&
Storage&Fabric&(InfiniBand)&
Storage&Servers&
Compute&Nodes&
Aries&HighHSpeed&Network&
I/O&Node&(2x&InfiniBand&HCA)&
InfiniBand&Fabric&
▪ Data representation – Metadata, data structures, data models
▪ Data storage – Storing and retrieving data and metadata to file systems fast
▪ Data access – Improving performance of data access that scientists desire
▪ Facilitating analysis – Strategies for supporting finding the meaning in the data
▪ Data transfers – Transfer data within a supercomputing system and between
different systems
6
Scientific Data Management in supercomputers
▪ Data representation – Metadata, data structures, data models
▪ Data storage – Storing and retrieving data and metadata to file systems fast
▪ Data access – Improving performance of data access that scientists desire
▪ Facilitating analysis – Strategies for supporting finding the meaning in the data
▪ Data transfers – Transfer data within a supercomputing system and between
different systems
7
Scientific Data Management in supercomputers
▪ Storing and retrieving data – Parallel I/O – Software stack – Modes of parallel I/O – Tuning parallel I/O performance
▪ Autonomous data management system – Proactive Data Containers (PDC) system – Metadata management service – Data management service
8
Focus of this presentation
Trends – Storage system transformation
9
IO Gap
Memory
Parallel file system (Lustre, GPFS)
Archival Storage (HPSS tape)
IO Gap
Shared burst buffer
Memory
Parallel file system (Lustre, GPFS)
Archival Storage (HPSS tape)
Memory
Parallel file system
Archival storage (HPSS tape)
Shared burst buffer
Node-local storage
Conventional Current Eg. Cori @ NERSC
Upcoming Eg. Aurora @ ALCF
Campaign storage
• IO performance gap in HPC storage is a significant bottleneck because of slow disk-based storage
• SSD and new memory technologies are trying to fill the gap, but increase the depth of storage hierarchy
Applications
High Level I/O Libraries
I/O Middleware
I/O Forwarding
Parallel File System
I/O Hardware
10
Contemporary Parallel I/O ST Stack
Applications
High Level I/O Library (HDF5, NetCDF, ADIOS)
I/O Middleware (MPI-IO)
I/O Forwarding
Parallel File System (Lustre, GPFS,..)
I/O Hardware
11
Parallel I/O software stack
§ I/O Libraries – HDF5 (The HDF Group) [LBL, ANL] – ADIOS (ORNL) – PnetCDF (Northwestern, ANL) – NetCDF-4 (UCAR)
• Middleware – POSIX-IO, MPI-IO (ANL) • I/O Forwarding
• File systems: Lustre (Intel), GPFS (IBM), DataWarp (Cray), …
§ I/O Hardware (disk-based, SSD-based, …)
▪ Types of parallel I/O • 1 writer/reader, 1 file • N writers/readers, N files
(File-per-process) • N writers/readers, 1 file • M writers/readers, 1 file
– Aggregators – Two-phase I/O
• M aggregators, M files (file-per-aggregator) – Variations of this mode
12
Parallel I/O – Application view
P0 P1 Pn-1 Pn …
file.0
1 Writer/Reader, 1 File
P0 P1 Pn-1 Pn …
file.0
n Writers/Readers, n Files
file.1 file.n-1 file.n
P0 P1 Pn-1 Pn …
n Writers/Readers, 1 File File.1
P0 P1 Pn-1 Pn …
file.0
M Writers/Readers, M Files
file.m
P0 P1 Pn-1 Pn …
M Writers/Readers, 1 File File.1
▪ Parallel file systems – Lustre and Spectrum Scale (GPFS)
▪ Typical building blocks of parallel file systems – Storage hardware – HDD or SSD RAID
– Storage servers (in Lustre, Object Storage Servers [OSS], and object storage targets [OST]
– Metadata servers – Client-side processes and interfaces
▪ Management – Stripe files for parallelism – Tolerate failures
13
Parallel I/O – System view
OST 0 OST 1 OST 2 OST 3
File
File
Physical view on a parallel file system
Logical view
Communication network
14
How to achieve peak parallel I/O performance?
HDF5(Alignment, Chunking, etc.)
MPI I/O(Enabling collective buffering, Sieving buffer size, collective buffer size, collective buffer nodes, etc.)
Application
Parallel File System(Number of I/O nodes, stripe size, enabling prefetching
buffer, etc.)
Storage HardwareStorage Hardware
▪ Parallel I/O software stack provides options for performance optimization
▪ Challenge: Complex inter-dependencies among SW and HW layers
Tuning parameter space
The$whole$space$visualized$
128&64&32&16&8&4&Stripe_Count&
Stripe_Size&(MB)& 32&16&8&4&2&1& 128&64&
32&16&8&4&2&1&cb_nodes&
cb_buffer_size&(MB)& 32&16&8&4&2&1& 128&64&
1048576&524288&alignment&
1&MB&512&256&128&64&siv_buf_size&(KB)&
64&524288&1&1&1&4&
1&MB&
1048576&128&32&128&128&
…$ 23040$
July 30, 2012
CScADS Workshop 15
▪ Simulation of magnetic reconnection (a space weather phenomenon) with VPIC code – 120,000 cores – 8 arrays (HDF5 datasets) – 32 TB to 42 TB files at 10 time steps
▪ Extracted I/O kernel ▪ M Aggregators to 1 shared file ▪ Trial-and-error selection of
Lustre file system parameters while scaling the problem size ▪ Reached peak performance in
many instances in a real simulation
16
Tuning for writing trillion particle datasets
More details: SC12 and CUG 2013 papers
Tuning combinations are abundant
• Searching through all combinations manually is impractical
• Users, typically domain scientists, should not be burdened with tuning
• Performance auto-tuning has been explored heavily for optimizing matrix operations
• Auto-tuning for parallel I/O is challenging due to shared I/O subsystem and slow I/O
• Need a strategy for reduce the search space with some knowledge
17
Our solution: I/O Auto-tuning
• Auto-tuning framework to search the parameter space with a reduced number of combinations
• HDF5 I/O library sets the optimization parameters
• H5Tuner: Dynamic interception of HDF5 calls
• H5Evolve: – Genetic algorithm based
selection – Model-based selection
18
Dynamic Model-driven Auto-tuning
• Auto-tuning using empirical performance models of I/O
• Steps – Training phase to develop an I/O model
– Pruning phase to select the top-K configurations
– Exploration phase to select the best configuration
– Refitting step to refine performance model
Overview of DynamicModel-driven I/O tuning
Exploration
Pruning
Model Generation
HPC System
Training Phase
Storage System
Develop anI/O Model
Training Set
I/O Kernel
Top k Configurations
Refi
t the
mod
el(C
ontro
led
by u
ser)
Performance ResultsSelect the Best
Performing Configuration
I/O ModelAll Possible
Values
Refitting
19
Empirical Performance Model
• Non-linear regression model
• Linear combinations of nb non-linear, low polynomial basis functions (ϕk), and hyper-parameters β (selected with standard regression approach) for a parameter configuration of x
• For example:
• f: file size; a: number of aggregators; c: stripe count; s: stripe size
m(x;β) = βkφk (x)k=1
nb
∑
βi = [10.59, 68.99, 59.83, −1.23, 2.26, 0.18, 0.01]
0.1
0.3
0.4
1
10
20
30
40
VPIC-IO VORPAL-IO GCRM-IO
I/O
Ba
nd
wid
th (
GB
/s)
EdisonHopper
StampedeDefault VPIC-IO
Default VORPAL I/ODefault GCRM I/O
(a) 4K cores
0.1
0.2
0.3
1
10
20
30
VPIC-IO VORPAL-IO GCRM-IO
I/O
Ba
nd
wid
th (
GB
/s)
EdisonHopper
Default VPIC-IODefault VORPAL I/O
(b) 8K cores
0
5
10
15
20
25
30
VPIC-IO VORPAL-IO GCRM-IO
I/O
Ba
nd
wid
th (
GB
/s)
EdisonHopper
(c) 16K cores
Figure 4: Summary of the best I/O performance obtained in the top-20 configurations for each I/O benchmark running on(a) 4K cores, (b) 8K cores, and (c) 16K cores. Note that (a) and (b) are log-scale plots.
formed our model on the basis of the remaining 216 con-figurations. As illustrated in Figure 3, the five-term modelis as follows:
m(c, s, a) = �1 + �2c
a+ �3
s
a+ �4
1
a+ �5
a
cs. (2)
We observed this model reproducing the training datawell. The accuracy is assessed by having acceptable sta-tistical measures (e.g. R-squared of 0.95 for VPIC on Stam-pede). Furthermore, even though the models we consider inthis paper do not directly account for the variability, weobserve that our realized predictions tend to yield moreaccurate predictions for those configurations where littlevariability is seen as can be seen in Figure 2.
Thus far, we have only considered a single file sizewhen building nonlinear regression models. This modelingapproach reflects the typical workflow in automatic empir-ical performance tuning, where one wishes to determineparameter values for actionable decisions. One of the mainbenefits of our models is that their simple, parameterized,algebraic form allows us to very quickly solve optimizationproblems involving them.
B. Training the Performance ModelsWe now consider models for multiple different file sizes.
The four independent variables, i.e., x = (c, s, a, f), form atotal of 81 possible terms in the basis set.
We conducted experiments for all the three I/O kernelsmentioned in Section III-B and different file sizes on all thethree platforms, i.e., Hopper, Edison, and Stampede. Thetraining set size on 512 cores was 336 configurations, on1028 cores it was 180 and on 2048 cores it was 96. The sizeof the training set is decreased as the core counts and filesizes increase due to the increase in the required resources.
The selection of training set can be automatic with simpleheuristics of limits on the allowable value ranges in order tocover the parameter space well. For example, the maximumnumber of aggregators are limited by the number of MPIprocesses of the application. Additionally, commands suchas “lfs osts” obtains the number of OSTs available on a
Lustre file system, which can be stored in a configurationfile. Once the limits are known, to establish a training setone can use all discrete integer values as possible tunableparameter values. Another strategy is to use powers-of-two or halves-of-max-allowable values. An expert can setthese values more judiciously. Since the training is doneinfrequently, this can be decided based on the training setexploration time budget.
Following the forward-selection approach on the entiretraining data set, as we defined in [1], we obtain one modelfor each application on each platform. Due to lack of space,we only provide the model for VPIC-IO on Edison:
m(x) = �1+�21
s+�3
1
a+�4
c
s+�5
f
c+�6
f
s+�7
cf
a, (3)
with a fit to the data yielding
�̂ = [10.59, 68.99, 59.83, �1.23, 2.26, 0.18, 0.01] .
The terms in (3) are interpretable from the parallel I/Opoint of view. For instance, write time would have an inverserelationship with number of aggregators and stripe count,because as we increase those the I/O performance tend toincrease; It should have a linear relationship with file size asincreasing the file size causes an increase in the write time.We describe the detailed validation of this model to sectionV-B. Additionally, in the next section we will analyze indetail this model’s ability to perform space reduction andoptimization for a variety of I/O tuning tasks.
C. Refitting the Performance ModelsAfter training the model for the search space pruning
step, the process of choosing the top k configurations onlyinvolves evaluating the model, a task whose computationalexpense is negligible (relative to evaluation of a configu-ration) for our simple choice of models. Therefore, usingsuch an approach will only require an evaluation of a fewconfigurations on the platform, decreasing the optimizationtime significantly. In our experiments, the top twenty config-urations always resulted in high I/O bandwidth. As opposedto our existing GA-based approach, our approach does not
20
Performance Improvement: 4K cores
0.1
0.30.4
1
10
20
3040
VPIC-IO VORPAL-IO GCRM-IO
I/O
Bandw
idth
(G
B/s
)EdisonHopper
StampedeDefault VPIC-IO on Hopper
Default VORPAL-IO on HopperDefault GCRM-IO on Hopper
21
Performance Improvement: 8K cores
0.1
0.2
0.3
1
10
20
30
VPIC-IO VORPAL-IO GCRM-IO
I/O
Bandw
idth
(G
B/s
)
94x
EdisonHopper
Default VPIC-IO on HopperDefault VORPAL-IO on Hopper
22
23
Autonomous data management
Storage Systems and I/O: Current status
24
Hardware Software
High-level lib (HDF5, etc.)
IO middleware (POSIX, MPI-IO)
IO forwarding
Parallel file systems
Applications
Usage
… Data (in memory)
IO software
… Files in file system
• Challenges – Multi-level hierarchy complicates data movement, especially if user has
to be involved – POSIX-IO semantics hinder scalability and performance of file systems
and IO software
Tune middleware Tune file systems
Memory
Parallel file system
Archival storage (HPSS tape)
Shared burst buffer
Node-local storage
Campaign storage
HPC data management requirements
Use case Domain Sim/EOD/analysis
Data size
I/O Requirements
FLASH High-energy density physics
Simulation ~1PB Data transformations, scalable I/O interfaces, correlation among simulation and experimental data
CMB / Planck
Cosmology Simulation, EOD/Analysis
10PB Automatic data movement optimizations
DECam & LSST
Cosmology EOD/Analysis ~10TB Easy interfaces, data transformations
ACME Climate Simulation ~10PB Async I/O, derived variables, automatic data movement
TECA Climate Analysis ~10PB Data organization and efficient data movement
HipMer Genomics EOD/Analysis ~100TB Scalable I/O interfaces, efficient and automatic data movement
25
Easy interfaces and superior performance
Autonomous data management
Information capture and management
25
Storage Systems and I/O: Next Generation
Hardware Software
High-level API Applications
Usage
… Data (in memory)
• Next generation IO software – Autonomous, proactive data management
system beyond POSIX restrictions • Transparent data movement
• Proactive analysis
– Object-centric storage interface • Rich metadata • Data and metadata accessible
through queries – Transparent data object
placement and organization across storage hardware layers
26
Memory
Parallel file system
Archival storage (HPSS tape)
Shared burst buffer
Node-local storage
Campaign storage
What is an object store? Simple
POSIX File System Object Store chmod open read lseek write close stat
unlink
Slide from Glenn Lockwood 27
get
put
delete
What is an object?
• Chunks of a file
• Files (images, videos, etc.)
• Array
• Key-value pairs
• File + Metadata
Current parallel file systems
Cloud services (S3, etc.)
HDF5, DAOS, etc.
OpenStack Swift, MarFS, Ceph, etc.
PDC Interpretation of Objects
Data + Metadata + Provenance + Analysis operations + Information (data products)
Proactive Data Containers (PDC)
Proactive Data Containers
Container Collection
PDC Locus
30
▪ Interface – Programming and client-level
interfaces
▪ Services – Metadata management – Autonomous data movement – Analysis and transformation task
execution
▪ PDC locus services – Object mapping – Local metadata management – Locus task execution
PDC System – High-level Architecture
31
▪ Interface – Programming and client-level
interfaces
▪ Services – Metadata management – Autonomous data movement – Analysis and transformation task
execution
▪ PDC locus services – Object mapping – Local metadata management – Locus task execution
Persistent Storage API
BB FS Lustre DAOS
…
PDC System – High-level Architecture
32
Data Management Using the PDC System
33
PDC system processes
Application processes
…• Storing data – Application declares persistent data objects → PDC creates metadata
objects – Application adds ‘tags’ / properties to identify objects in future → PDC
adds these as metadata – Application processes map memory buffers to regions of objects – When data in objects are ready, PDC system moves to data to storage
and updates metadata → Asynchronous and autonomous • Retrieving data
– Application queries metadata to find desired objects ← PDC system returns handles to the desired objects
– Application maps to a region of the object or give query condition ← PDC system brings desired data to memory
• Create & Open objects – Create sets object properties (metadata): name, lifetime, user info,
provenance, tags, dimensions, data type, transformations, etc. • Create an object region
– Similar to HDF5 hyperslab selections • Map / Unmap an object region
– Object region <=> memory region • Lock / Unlock a Mapped Region
– Read / Write Locks
– Transparently update memory buffer / object, asynchronously
– Transforms occur “outside” of lock time, managed by PDC system • Close & Release (delete) objects
PDC API – Object Manipulation
34
PDC API – I/O
35
PDC API – I/O
36
PDC API – I/O
37
PDC API – I/O
38
PDC operation
• Create query with conditions
– Set up for query execution, invokes query optimization framework in future
– Allows application developers to search for named objects, as well as objects with particular characteristics
• Execute query
– Query execution can occur at multiple tiers, and locally execute on sharded / striped objects
• Iterate_start / Iterate_next
– Iterate over objects from query results, as well as generic actions • Get_object_handle / Get_object_info
– Retrieve metadata for object
PDC API – Object Access
40
Metadata Object Management
Capabilities ● Create, update, search, and delete metadata objects. ● All tags are searchable. ● Maintain extended attributes and object relationships.
A collection of tags (key-value pairs)
PDC Namespace Management
PDC Metadata Management
Metadata Search
● Exact match search ○ Similar to stat. ○ Require all ID attributes. ○ Retrieve single metadata object,
directly from one target server.
● Partial match search ○ Similar to find or grep. ○ Any tag can be specified. ○ Retrieve multiple metadata objects, need to scan all
servers. ■ Done in parallel. ■ Indexing is being implemented.
Performance: Metadata Creation
SoMeta 1: all objects have same name but different values in other ID attributes (timestep). SoMeta 4: four unique names are used and each name is used by a quarter of metadata objects. The objects with an identical name have different ID attributes. SoMeta Unique: each metadata object has a unique name.
Performance of scaling SoMeta by creating 10000 to 100 million metadata objects with 512 servers and 2560 clients on Cori.
Performance: Metadata Search
Exact match search. Partial match search.
Search up to 20% of 1 million objects takes less than a fraction of a second with 128 servers. Network transfer time dominates the total time. Exact match search requires much more small network transfers.
Searching for BOSS objects
Total elapsed time to group objects by adding tags(SoMeta), attributes(SciDB), symlink(Lustre) with different selectivity.
Total elapsed time for searching and retrieving the metadata of previously assigned tags/
attributes with different selectivity.
SoMeta is 10 to 90X faster for metadata grouping (tagging), and 2 to 16X faster in searching attributes (tags) than SciDB and MongoDB, up to 800X faster with 80 clients searching in parallel.
PDC System - High-level Architecture
Persistent Storage API
BB FS Lustre DAOS
…
Asynchronous I/O
● PDC supports asynchronous I/O through client server architecture ○ Client sends I/O request ○ Server confirms receipt of request ○ Client continues to next computation
Write
Read
VPIC-IO (Weak Scaling) Multi-timestep Write
Total time to write 5 timesteps from the VPIC-IO kernel to Lustre and Burst Buffer on Cori. PDC is 5x faster than HDF5 and 23x over PLFS.
BD-CATS-IO (Weak scaling) Multi-timestep Read
Total time for reading data of 5 timesteps from the BD-CATS-IO kernel from the Lustre and from the burst buffer. PDC is 11X faster than PLFS and HDF5.
Conclusions
Easy interfaces and superior performance
Autonomous data management
Information capture and management
54
• Simpler object interface • Applications produce data objects and declare to keep them persistent • Applications request for desired data
• Asynchronous and autonomous data movement • Bring interesting data to apps
• Manage rich metadata and enhance search capabilities • Perform analysis and transformations in the data path
▪ Contact: • Suren Byna (sdm.lbl.gov/~sbyna/) [[email protected]]
▪ Contributions to this presentation • ExaHDF5 project team (sdm.lbl.gov/exahdf5) • Proactive Data Containers (PDC) team (sdm.lbl.gov/pdc) • SDM group: sdm.lbl.gov
55