38
In search of the right way for extreme - scale HPC file system metadata Qing Zheng 1 , Kai Ren 1 , Garth Gibson 1 , Bradley W. Settlemyer 2 1 Carnegie Mellon University 2 Los Alamos National Laboratory [LA-UR-15-25703] ++

In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

In search of the right way forextreme-scale HPC file system metadata

Qing Zheng1, Kai Ren1, Garth Gibson1, Bradley W. Settlemyer21Carnegie MellonUniversity

2Los AlamosNationalLaboratory[LA-UR-15-25703]

++

Page 2: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

HPC defined by …• parallel scientific applications

o running on compute nodes with pooled storage for highly parallel I/O

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 2

compute nodes w/ fast interconnect(10,000+)

pooled storage nodes(100+)

App3App2

App1

App4App5

Page 3: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

HPC Checkpointing• copying memory from compute nodes to a backend file system

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 3

compute nodes w/ fast interconnect(10,000+)

pooled storage nodes(100+)

LustreFS storage(80PB – 1.45TB/s)

memory(1PB)App3

App2

App1

App4App5

Page 4: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

HPC Checkpointing• copying memory from compute nodes to a parallel file system

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 4

compute nodes w/ fast interconnect(10,000+)

pooled storage nodes(100+)

LustreFS storage(80PB – 1.45TB/s)

memory(1PB)App3

App2

App1

App4App5

Page 5: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

HPC Checkpointing• two-tier storage for higher bandwidth

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 5

memory(1PB)App3

App2

App1

App4App5

compute nodes w/ fast interconnect(10,000+)

pooled storage nodes(100+)

LustreFS storage(80PB – 1.45TB/s)

flash burst buffer(4PB – 3.3TB/s)

Page 6: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

HPC Checkpointing• metadata can also become a bottleneck

6

App3

App1

App4App5

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/

compute nodes w/ fast interconnect320,000 CPU cores – 640,000 CPU threads

pooled storage nodes(100+)

head node

LustreFS metadata(5,000 creates/s)

memory(1PB)

App2

Page 7: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

HPC Checkpointing• file creations are non-trivial in extrema-scale

7

App3

App1

App4App5

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/

128 s to createall files

compute nodes w/ fast interconnect320,000 CPU cores – 640,000 CPU threads

pooled storage nodes(100+)

head node

LustreFS metadata(5,000 creates/s)

memory(1PB)

App2

Page 8: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 8

s sFILE CREATES DATA

Page 9: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 9

s sFILE CREATES DATA

2X CPU cores and 2X storage bandwidth

Page 10: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

Bias for data … Lustre File System

metadata service data service

metadata eventually a problem !!LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 10

[a single (or a few) machines] [a large collection of machines]

Page 11: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 11

SCALING-UP …

30,000creates/s

[6X faster]

Harden Lustre File System

Page 12: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

SCALING-UP …

30,000creates/s

[6X faster]

expensive w/ limited improvementsLANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 12

Page 13: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

SCALING-OUT …

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 13

Use multiple dedicated machines

to serve file system metadata

DISTRIBUTED METADATA

Page 14: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

SCALING-OUT …

bws

/

home

zhengq garth

src f1 f2 f3

f1

ROOT

f3

f2

home

garth

zhengqsrc

bws

hierarchical directory structure

dynamic directoryallocation & splitting

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 14

Page 15: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

SCALING-OUT …

bws

/

home

zhengq garth

src f1 f2 f3

f1

ROOT

f3

f2

home

garth

zhengqsrc

bws

directory structure

dynamic directoryallocation & splitting

requires a large # of dedicated machinesLANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 15

Page 16: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

!Cost-effective highly-parallel metadata

Orders of magnitude faster then LustreFS

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 16

GOAL

Page 17: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 17

++

Page 18: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

Traditional Metadata Model

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 18

bws

/

home

zhengq garth

src f1 f2 f3

compute nodes

App2

App1App3

shared file system view

dedicated mediator

global coordination

Page 19: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

Traditional Metadata Model

bws

/

home

zhengq garth

src f1 f2 f3

compute nodes

App2

App1App3

shared file system view

dedicated mediator

global coordination

A single authoritative view of the file systemAllows different apps to shared data and achieve synchronization

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 19

Page 20: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

HPC I/O is much simpler …

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 20

HPC ApplicationINPUT OUTPUT

CHECKPOINT DATA

Page 21: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

HPC I/O is much simpler …

HPC ApplicationINPUT OUTPUT

CHECKPOINT DATA

HPC applications don’t need the FS to achieve synchronizationDifferent apps operate on different sets of files

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 21

Page 22: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

HPC I/O is much simpler …

HPC ApplicationINPUT OUTPUT

CHECKPOINT DATA

HPC applications don’t need the FS to achieve synchronizationDifferent apps operate on different sets of files

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 22

A single unified file system view overkill for HPC applicationsGlobal coordination via a dedicated service bad for performance

Page 23: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

Middleware Design

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 23

Parallel Scientific Applications

Shared Underlying Storage[parallel data access | oblivious of file system semantics]

metadata ops

data storage metadata storage

++

Page 24: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

Serverless Metadata

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 24

compute nodes

App1

metadata

/

projs3d

dataApp3

metadata

/

projmpas

repo

App2

metadata

/

projocean

input

Page 25: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

Serverless Metadata

compute nodes

App1

metadata

/

projs3d

dataApp3

metadata

/

projmpas

repo

App2

metadata

/

projocean

input

No such global namespace managed by a centralized mediatorEach app only communicates with its private metadata servers

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 25

Page 26: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

Serverless Metadata

compute nodes

App1

metadata

/

projs3d

dataApp3

metadata

/

projmpas

repo

App2

metadata

/

projocean

input

No such global namespace managed by a centralized mediatorEach app only communicates with its private metadata servers

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 26

Fundamental AssumptionHPC applications don’t communicate with each other

Page 27: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

snapshot

App Execution Model

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 27

SNAPSHOT

CATALOG

[snapshot = static view of a file system]

HPC Application

new snapshotImplemented as database, k/v-store, zookeeper, …

Page 28: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

snapshot

App Execution Model

SNAPSHOT

CATALOG

[snapshot = static view of a file system]

HPC Application

new snapshotImplemented as database, k/v-store, zookeeper, …

Keeps each app isolated from other running appsTruly parallel metadata processing among different apps

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 28

Page 29: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

Sharing Data Between Apps

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 29

HPC workflow

snapshot_1SNAPSHOT

CATALOG

APP1

snapshot_0

[snapshot = static view of a file system]

Page 30: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

Sharing Data Between Apps

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 30

snapshot_2

HPC workflow

snapshot_1SNAPSHOT

CATALOG

APP1

snapshot_0

[snapshot = static view of a file system]

App2

Page 31: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

snapshot_3

Sharing Data Between Apps

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 31

snapshot_2

HPC workflow

snapshot_1SNAPSHOT

CATALOG

APP1

snapshot_0

[snapshot = static view of a file system]

App2 APP3

Page 32: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

snapshot_3

Sharing Data Between Apps

snapshot_2

HPC workflow

snapshot_1SNAPSHOT

CATALOG

APP1

snapshot_0

[snapshot = static view of a file system]

App2 APP3Converts online to batched offline sharingAllows apps to decide file visibility and timings for communication

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 32

Page 33: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

snapshot_3

Sharing Data Between Apps

snapshot_2

HPC workflow

snapshot_1SNAPSHOT

CATALOG

APP1

snapshot_0

[snapshot = static view of a file system]

App2 APP3Converts online to batched offline sharingAllows apps to decide file visibility and timings for communication

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 33

Avoids unnecessary coordinationsynchronously enforced by a centralized metadata service

Page 34: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

Snapshot Merging

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 34

snapshot_145 snapshot_8 snapshot_54

file system namespace

Page 35: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

Snapshot Merging

snapshot_145 snapshot_8 snapshot_54

file system viewConflicts resolved at read path to avoid unnecessary computationResolution performed by apps instead of a super authority

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 35

Page 36: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

HPC in 10 years …• two-tier storage for cost-effective fast data• client-funded namespace for highly-parallel metadata

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 36

App3App2

App1

App4App5

compute nodes w/ fast interconnect(10,000+)

pooled storage nodes(100+)

capacity tier

performance tier

Page 37: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

Reference

Scaling the File System Control Plane with Client-Funded Metadata Servers (PDSW14)

Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion (SC14)

LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 37

Page 38: In search of the right way for extreme-scale HPC file ...qingzhen/talk/batchfs_talk_minihpc15.pdf · In search of the right way for extreme-scale HPC file system metadata Qing Zheng

QUESTIONS