16
An Exploration into Object Storage for Exascale Supercomputers Raghu Chandrasekar

Raghu Chandrasekar - CUGRaghu Chandrasekar [email protected]. Title: saroja Created Date: 5/11/2017 8:21:04 PM

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Raghu Chandrasekar - CUGRaghu Chandrasekar raghu@cray.com. Title: saroja Created Date: 5/11/2017 8:21:04 PM

An Exploration into Object Storagefor Exascale SupercomputersRaghu Chandrasekar

Page 2: Raghu Chandrasekar - CUGRaghu Chandrasekar raghu@cray.com. Title: saroja Created Date: 5/11/2017 8:21:04 PM

Agenda

● Introduction

● Trends and Challenges

● Design and Implementation of SAROJA

● Preliminary evaluations

● Summary and Conclusion

CUG 2017 Copyright 2017 Cray Inc. 2

Page 3: Raghu Chandrasekar - CUGRaghu Chandrasekar raghu@cray.com. Title: saroja Created Date: 5/11/2017 8:21:04 PM

Safe Harbor Statement

This presentation may contain forward-looking statements that are basedon our current expectations. Forward looking statements may includestatements about our financial guidance and expected operating results,our opportunities and future potential, our product development and newproduct introduction plans, our ability to expand and penetrate ouraddressable markets and other statements that are not historicalfacts. These statements are only predictions and actual results maymaterially vary from those projected. Please refer to Cray's documents filedwith the SEC from time to time concerning factors that could affect theCompany and these forward-looking statements.

CUG 2017 Copyright 2017 Cray Inc. 3

Page 4: Raghu Chandrasekar - CUGRaghu Chandrasekar raghu@cray.com. Title: saroja Created Date: 5/11/2017 8:21:04 PM

Storage Hierarchy Data Path Concepts

CUG 2017 Copyright 2017 Cray Inc. 4

NVRAM

CPUs

DRAM

HBM

IONodes

NAND

ColdStorageServers

HDD

ArchiveStorageServers

Tape

High SpeedCompute Fabric

Site Network

WAN / Cloud

…Switch

…Switch

…Switch

O(1µs) Nonvolatile StoragePrivate Scratch NamespaceRelaxed POSIX or Key-Value APISharable Namespace (upon flush to backing store)App-Controlled Caching; Close-to-Open ConsistencyInter-Node Cache-Consistent in Small Clusters

O(100µs) Nonvolatile StorageSharable Across ComputersLightweight Object or POSIX InterfacePrimary Resilient Random StorageFor HPC / Analytics / General

Distributed Storage ClientNode-Local Cache & Working MemoryRelaxed POSIX or Key-Value APILocal Cache ControlSharding and Resiliency Controls

O(10ms) Backing StoreSite-Wide AccessBulk Object/Cloud Storage APIHigh-9’s Data ResiliencyStreaming Sequential Throughput

NamespaceServers

NAND

Scalable Metadata ServiceServes Metadata to One Or More TiersProvide Key-Value, POSIX, namespace APIsAttributes and Rich Metadata StructuresMap App Structures to One AnotherCollections and Manifests of Large Data Sets

Multi-Second Data Distribution & ProtectionMulti-Site ReachBulk Object Storage APICloud Bursting, Disaster Recovery, Archival

O(5ns) In-Package MemoryCustomer Workload

O(50ns) DDR MemoryCustomer Workload

Page 5: Raghu Chandrasekar - CUGRaghu Chandrasekar raghu@cray.com. Title: saroja Created Date: 5/11/2017 8:21:04 PM

Storage Media Latencies and IOPs

CUG 2017 Copyright 2017 Cray Inc. 5

525l

Disk-Based(~5 msec)

With Pmem(~0.03 msec)

32K IOPS*

200 IOPS*

* Max potential 1-thread random sector

Storage

5000us

100usNetwork

200usSofttware

25 25lFlash+RDMA(~0.05 msec)

20k IOPS*

Flash/pmem(~0.0x msec)

Software becomes the largest fraction of latency when usingpersistent memory, even with 4x improved software efficiency

Page 6: Raghu Chandrasekar - CUGRaghu Chandrasekar raghu@cray.com. Title: saroja Created Date: 5/11/2017 8:21:04 PM

Cray Compute and Fabric Topology

CUG 2017 Copyright 2017 Cray Inc. 6

Group0 Group1 Group2 Group3 Group4 Group5 Group6 Group7

Flexiblecompute Highdensitycompute

Compute Node-Local StoragePotential 256k Nodes

Enclosure-Based StoragePotentially 64k (or more) Devices

High bandwidth DragonflyFabric

Page 7: Raghu Chandrasekar - CUGRaghu Chandrasekar raghu@cray.com. Title: saroja Created Date: 5/11/2017 8:21:04 PM

Analytics and HPC Software Convergence

CUG 2017 Copyright 2017 Cray Inc. 7

256k NodeManagement,

Monitor,Service

Infrastructure

Analytics Frameworkwith Local Caching

ScalableMetadataServices

User Application

High-speed dragonfly fabricComputeStore

HPC File or Objectwith Optional Caching

Flash

Pmem

Flash

Flash

Flash

Flash

Flash

RDMATransport

Flash

Flash

Flash

Flash

Flash

User Application

Flash

Pmem

RDMATransport

POSIX Files,HDF5 Containers,

K/V

Spark RDDs,K/V, or Other

Discover,Query

Page 8: Raghu Chandrasekar - CUGRaghu Chandrasekar raghu@cray.com. Title: saroja Created Date: 5/11/2017 8:21:04 PM

SAROJA Proof-of-Concept

CUG 2017 Copyright 2017 Cray Inc. 8

Page 9: Raghu Chandrasekar - CUGRaghu Chandrasekar raghu@cray.com. Title: saroja Created Date: 5/11/2017 8:21:04 PM

Scalable And Resilient ObJect StorAge

CUG 2017 Copyright 2017 Cray Inc.

ObjectStorageClusterNoSQLService

PersistentMemory

NVMeFlashNVMe Flash

ComputeNodes

ParallelApplications

POSIX Object

SAROJAclient(libsaroja)

NativeAPI

PAXOS/RAFT/ZabCluster

DatapathMetadataConsensus

9

Page 10: Raghu Chandrasekar - CUGRaghu Chandrasekar raghu@cray.com. Title: saroja Created Date: 5/11/2017 8:21:04 PM

Preliminary Evaluations

CUG 2017 Copyright 2017 Cray Inc. 10

Page 11: Raghu Chandrasekar - CUGRaghu Chandrasekar raghu@cray.com. Title: saroja Created Date: 5/11/2017 8:21:04 PM

Metadata Evaluations

CUG 2017 Copyright 2017 Cray Inc. 11

10,000

12,000

1 2 4 8 16 32 64 128

Cre

ates

per

sec

ond

Number of client processes

Ceph vs Lustre: File Creation Rates

ceph−fuse cephfs (kernel) Lustre

0

2,000

4,000

6,000

8,000

Ceph POSIX support still has a long way to go

(Higher is better)

Page 12: Raghu Chandrasekar - CUGRaghu Chandrasekar raghu@cray.com. Title: saroja Created Date: 5/11/2017 8:21:04 PM

Metadata Evaluations

CUG 2017 Copyright 2017 Cray Inc. 12

Scaling trends not ideal; but promising approach functionally

SAROJA File Creates vs. Cassandra Servers

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

1 2 4 8

Cre

ates

per

sec

on

d

Number of Cassandra Servers

Peak Lustre file creation rate(w/o DNE)

• POSIX over SAROJA• 4480 MPI ranks• 56 XC compute nodes• 500 files/rank• TCP over GNI• Replication disabled

Page 13: Raghu Chandrasekar - CUGRaghu Chandrasekar raghu@cray.com. Title: saroja Created Date: 5/11/2017 8:21:04 PM

Data Path Evaluation

CUG 2017 Copyright 2017 Cray Inc. 13

32 64 128

Thro

ughput

(MB

/s)

Number of client processes

Ceph vs Lustre Throughput

Ceph (mean) Lustre (mean) Ceph (peak) Lustre (peak)

0

2,000

4,000

6,000

8,000

10,000

12,000

1 2 4 8 16

Viable for use in the data path;Plenty of opportunities for tuning

Page 14: Raghu Chandrasekar - CUGRaghu Chandrasekar raghu@cray.com. Title: saroja Created Date: 5/11/2017 8:21:04 PM

Summary

● Inflection point in storage system design

● Three-tier storage topology for supercomputers

● Promising early investigations with object storage tech

● Gradual transition

● Call for feedback

CUG 2017 Copyright 2017 Cray Inc. 14

Page 15: Raghu Chandrasekar - CUGRaghu Chandrasekar raghu@cray.com. Title: saroja Created Date: 5/11/2017 8:21:04 PM

Legal Disclaimer

Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document.

Cray Inc. may make changes to specifications and product descriptions at any time, without notice.

All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user.

Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, and URIKA. The following are trademarks of Cray Inc.: APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, REVEAL, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners.

CUG 2017 Copyright 2017 Cray Inc. 15

Page 16: Raghu Chandrasekar - CUGRaghu Chandrasekar raghu@cray.com. Title: saroja Created Date: 5/11/2017 8:21:04 PM

Questions & Answers

Raghu [email protected]