Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Gabriele Paciucci
HPC Solution Architect
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others. 2
Legal Information
All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com/content/www/us/en/software/intel-solutions-for-lustre-software.html.
Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer.
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
© 2016 Intel Corporation
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Market trends: The Evolving Hierarchy of Storage
Hot Tier
Warm Tier
Cold Tier
NAND
HDDs
HDDs
HDDs + TAPEs
3
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Market trends: The Evolving Hierarchy of Storage
Hot Tier
Warm Tier
Cold Tier
HDDs
3D
XPoint
3D
NAND
HDDs (+TAPEs ?)
Memory Domain
Driving the Transition
to the All FLASH
Data Center
3D NAND Economics
Drive the Warm Tier
to Transition from HDDs
to FLASH
4
NAND
HDDs
HDDs + TAPEs
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Market trends: The Evolving Hierarchy of StorageThe Hot Get’s Hotter and the Warm Tier Tips
5
3D
XPoint
3D
NAND
HDDs (+TAPEs ?)
Enabling highest
performance SSDs
3D XPoint™
Technology
Selector
Memory
Cell
Enabling high capacity
SSDs at lowest price
3D MLC and TLC
NAND
Memory Domain
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others. 6
Cluster topologies Compute Nodes
I/O Nodes
Storage Nodes
• Two common HPC system architectures in the industry:
• Dragonfly (or similar) network connects Compute Nodes. I/O
Nodes to access the Storage Nodes who are exporting a
parallel file system
• Island based design. Compute Node’s islands are connected to
the I/O islands by a top level core switch. Normally
oversubscribed at the top level.
Burst Buffer Nodes
• Compute Nodes: commodity high dense server or special compute blade
• I/O Nodes: LNet routers, Cray DVS nodes, CIOD forwarders
• Storage Nodes: Lustre Object Servers or Spectrum Scale Network Shared
Disk
• Burst Buffer Nodes: Specialized nodes running burst buffer server software
and hosting NV devices
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others. 7
• Required high endurance, dual port NVMe devices
• Normally used to accelerate metadata performance and small file
I/O
• Storage vendors are providing transparent cache algorithm to
accelerate specific use cases
• Parallel file systems can provide similar software solutions:
• Lustre with ZFS/L2ARC and Data on Metadata
• Spectrum Scale Storage Pools and tiering
Flash and storage Compute Nodes
I/O Nodes
Storage Nodes
300
4400
735
4500
200 4.5 100
seq 4kwrite
seq 4kread
rand 4kwrite
rand 4kread
OPA bulk OPAmeta
NVMe
Average latency (usec) on Lustre NVMe
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others. 8
Glimpse-ahead Lock on Open Read on Open
Client1 MDS
Client2
MDS_GETARRT+
LOCK w/o size. blocks
GLIMPSEOST
Client1 MDSOPEN
OST
Client1 MDS
OPEN
OST
EXTEND LOCK
READ
IO BULK
Tra
dit
ion
al L
ustr
e
2 RPCs (3 with GLIMPSE) 2 RPCs 3 RPCs + BULK
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others. 9
Glimpse-ahead Lock on Open Read on Open
Client1 MDS
Client2
MDS_GETATTR+
LOCK
Client1 MDS
Client2
MDS_GETARRT+
LOCK w/o size. blocks
GLIMPSEOST
Client1 MDSOPEN + IO LOCK
Client1 MDSOPEN
OST
Client1 MDS
OPEN
IO LOCK + DATA
IO BULK >128k
Client1 MDS
OPEN
OST
EXTEND LOCK
READ
IO BULK
Tra
dit
ion
al L
ustr
eN
ew
Do
M
1 RPC (2 with GLIMPSE)
2 RPCs (3 with GLIMPSE) 2 RPCs
1 RPCs
3 RPCs + BULK
1 RPC + BULK if size >128k
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others. 10
Small file creates directly
on the Lustre MDT
Now in Lustre master branch, planned for Lustre 2.11
Allows for data to be written directly to metadata instead
of going to OST’s
Up to 1M sized files can now stored directly on the MDT
Reduce the number of RPCs & access time
60000
80000
100000
120000
140000
160000
180000
1x MDT 2x MDT 3x MDT 4x MDT
File
Cre
ate
s p
er
Second
File creation 4KiB: DoM and DNE II
20039.38
44403.23
79689.85
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
File
Cre
ate
s p
er
Second
File Creates 4KiB: HDD vs. DoM
1 SSD MDT + 1 HDD OST 1 SSD MDT + 1 SSD OST 1 SSD MDT (DoM)
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others. 11
NVMe in the
Storage Fabric
Compute Nodes
I/O Nodes
Storage Nodes
Burst Buffer Nodes
• Dedicated Burst Buffer Nodes:
• DDN’s IME
• Spectrum Scale with AFM
• Lustre with new File system Level Replication (FLR)
• Absorb the peak performance and design the legacy parallel file system for
capacity only
• Transparent for users and integrated in the scheduler (DDN only)
• Erasure code calculated on the client enable single port low cost NVMe
devices (DDN only)
• It is not solve the latency problem (Flash is far away from the compute nodes
and it is masked by the software stack)
• It is not solve the network design problem (ION number and Island bisec bw)
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
HDD
HDD
HDD
HDD
HDD
HDD
DDN’s IME LUSTRE FLR
Different approaches to the check point use-case
IME 2PiB
SCRATCH 50PiB
/mnt/IME /mnt/SCRATCH
FILE A FILE B
SCRATCH 50+2PiB
/mnt/SCRATCH
FILE A FILE B
NVM
eHDD
NVM
eNVM
HDD
HDD
1
2
mirror0 mirror1
3
async
async
1
POOL: BBPOOL: DEFAULT
2 3
12
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others. 13
NV attached to
the CNs Fabric
Compute Nodes
I/O Nodes
Storage Nodes
Burst Buffer Nodes
• Dedicated Burst Buffer Nodes who are not routing the I/O:
• Cray DataWarp
• Intel’s DAOS
• Flash devices are very close to the Compute Nodes and will be
integrated into the I/O Nodes in the near future
• Dramatically reduce the cost of the network and the parallel file
system
• Cray DataWarp
• Specific and integrated with Cray’s echo-system, cache solution, no
resilience, near POSIX, “private” file system
• Intel’s DAOS
• Open Source and Open Architecture solution, resilience, need integration
with I/O middleware, POSIX available as middleware, parallel file system
is optional
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others. 14
NVMe Storage
on the Compute Nodes
Compute Nodes
I/O Nodes
Storage Nodes
• Local NVMe devices or NVM-o-Fabric exported devices
• Local acceleration of parallel file system:
• Local Read Only Cache with GPFS; High Available Write Cache
with GPFS
• “Private” file system:
• Bee-OND
• Local XFS/Ext4
• Extremely high peak I/O performance with very low variation
• Limited support for shared-file I/O, data movement outside of jobs becomes
difficult, node failures become more problematic, performance cannot be
decoupled from job size
• Low latency of the NV devices masked by the file system software stack
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others. 15
NV DIMM on the
Compute Nodes
Compute Nodes
I/O Nodes
Storage Nodes
Burst Buffer Nodes
• Some applications can take advantage of on-package
memory and not need any more DDR memory.
• More cost effective NV DIMMs can be used for I/O:
• Intel’s CPPR + SCR library (focus on specific use case,
POSIX compatible)
• Intel’s DAOS
• Extremely high peak I/O performance with very low variation
• Limited support for shared-file I/O, data movement outside of jobs becomes
difficult, node failures become more problematic, performance cannot be
decoupled from job size
• Low latency of the NVMe devices masked by the file system software stack
support for ultra low latency and “byte” granularity provided by NV DIMM
devices
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Distributed Asynchronous Object Storage
Lightweight Storage Stack
Ultra-fine grained I/O
New Storage Model
Extreme Scale-out & Resilience
Multi-Tier
New Workflow Methodologies
Open source - APACHE 2.0 License
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
DAOS Deployment
Data Model Library
Parallel FilesystemClient
HPC Application
POSIX
HDD NVMe
Kernel
User
Data Model Library
DAOS Client
HPC Application
DAOS API
NVMe(Optional)
NVDIMM(Optional)
NEW
User
User
Parallel FilesystemServer
DAOS Server
NVDIMM
NEW
LibFabric Mercury Fabric RDMA
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Lightweight Storage Stack
Mercury user space function shipping
MPI equivalent communications latency
Built over libfabric
Applications link directly with DAOS lib
Direct call, no context switch
No locking, caching or data copy
Userspace DAOS server
Mmap non-volatile memory (NVML)
NVMe access through SPDK/BlobFS
User-level thread with Argobots
18
Application
I/O Middleware
Storage Backend
HPC Application
DAOS library
Mercuryover libfabric
NVMe
Bulktransfers
SPDK
NVML
RPC
HDF5
NVDIMM
POSIX ADIOS…
DAOSXstreams
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Ultra-fine grained I/O
Mix of storage technologies
NV DIMM (3D XPointTM DIMMs)
– DAOS metadata & application metadata
– Byte-granular application data
NVMe (*NAND, 3D XPointTM)
– Cheaper storage for bulk data
– Multi-KB
I/Os are logged & inserted into persistent index
All I/O operations tagged/indexed by version
– Non-destructive write & consistent read
No alignment constraints
– no read-modify-write
19
Application
I/O Middleware
Storage Backend
v1
v2
v3
read@v3
Index
Record extentsVers
ion =
epoch
Being written
Committed
NVRAM
NVMe
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Scalability & Resilience
Scalable communications
Scalable I/O
Shared-nothing distribution &
redundancy schema
Storage system failure
Failed target(s) evicted
Automated online rebuild & rebalancing
End-to-end integrity
Checksums can be provided by I/O middleware
Stored and verified on read & scrubbing
Application failure
Scalable distributed transaction exported to I/O middleware
Automated recovery to roll back uncommitted changes
20
Hash (object.Dkey)
Hash (object.Dkey)
Fault
domain
separation
Application
I/O Middleware
Storage Backend
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
DAOS Ecosystem
21
Application
I/O Middleware
Storage Backend
DAOS
Application
HDF5 + ExtensionsHPC
LegionHPC
NetCDFHPC
HDFS/Spark RDD
Analytics
POSIXHPC
DataSpace
sHPC
MPI-IOHPC …
S3/SwiftCloud
NFS GaneshaCloud
Block DeviceCloud
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
ESSIO Storage Stack
22
Application
I/O Middleware
Storage backend
New storage API (DAOS) provides extended
capabilities and low-latency I/O to middleware
Application
I/O Middleware
Storage Backend
Port I/O middleware (HDF5, ADIOS, …) to new storage backend and augment API to take advantage of new capabilities
Evaluate applications (HACC, ACME, CLAMR) and new programming model (Legion) over enhanced I/O middleware
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright ® Intel Corporation 2016. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others. 23
Conclusion and Reference
Parallel file systems are adopting 3DNAND technologies to accelerate
boundaries workloads.
Actual BB technologies are too far from the compute nodes and are designed to
solve the checkpoint use-case only.
Leverage smart user space libraries can change the way HPC system are
designed and take advantage of low latency, high bandwidth local storage.
• https://glennklockwood.blogspot.co.uk/2017/03/reviewing-state-of-art-of-burst-buffers.html
• https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Flash%20Storage
• https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/performance-gains-ibm-spectrum-scale.pdf
• https://wiki.hpdd.intel.com/download/attachments/36966823/MS54_CORAL_NRE_DAOSM_HLD_v2.3_final.pdf?version=1&modificationDate=1460746910000&api=v2