Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
In search of the right way forextreme-scale HPC file system metadata
Qing Zheng1, Kai Ren1, Garth Gibson1, Bradley W. Settlemyer21Carnegie MellonUniversity
2Los AlamosNationalLaboratory[LA-UR-15-25703]
++
HPC defined by …• parallel scientific applications
o running on compute nodes with pooled storage for highly parallel I/O
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 2
compute nodes w/ fast interconnect(10,000+)
pooled storage nodes(100+)
App3App2
App1
App4App5
HPC Checkpointing• copying memory from compute nodes to a backend file system
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 3
compute nodes w/ fast interconnect(10,000+)
pooled storage nodes(100+)
LustreFS storage(80PB – 1.45TB/s)
memory(1PB)App3
App2
App1
App4App5
HPC Checkpointing• copying memory from compute nodes to a parallel file system
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 4
compute nodes w/ fast interconnect(10,000+)
pooled storage nodes(100+)
LustreFS storage(80PB – 1.45TB/s)
memory(1PB)App3
App2
App1
App4App5
HPC Checkpointing• two-tier storage for higher bandwidth
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 5
memory(1PB)App3
App2
App1
App4App5
compute nodes w/ fast interconnect(10,000+)
pooled storage nodes(100+)
LustreFS storage(80PB – 1.45TB/s)
flash burst buffer(4PB – 3.3TB/s)
HPC Checkpointing• metadata can also become a bottleneck
6
App3
App1
App4App5
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/
compute nodes w/ fast interconnect320,000 CPU cores – 640,000 CPU threads
pooled storage nodes(100+)
head node
LustreFS metadata(5,000 creates/s)
memory(1PB)
App2
HPC Checkpointing• file creations are non-trivial in extrema-scale
7
App3
App1
App4App5
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/
128 s to createall files
compute nodes w/ fast interconnect320,000 CPU cores – 640,000 CPU threads
pooled storage nodes(100+)
head node
LustreFS metadata(5,000 creates/s)
memory(1PB)
App2
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 8
s sFILE CREATES DATA
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 9
s sFILE CREATES DATA
2X CPU cores and 2X storage bandwidth
Bias for data … Lustre File System
metadata service data service
metadata eventually a problem !!LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 10
[a single (or a few) machines] [a large collection of machines]
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 11
SCALING-UP …
30,000creates/s
[6X faster]
Harden Lustre File System
SCALING-UP …
30,000creates/s
[6X faster]
expensive w/ limited improvementsLANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 12
SCALING-OUT …
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 13
Use multiple dedicated machines
to serve file system metadata
DISTRIBUTED METADATA
SCALING-OUT …
bws
/
home
zhengq garth
src f1 f2 f3
f1
ROOT
f3
f2
home
garth
zhengqsrc
bws
hierarchical directory structure
dynamic directoryallocation & splitting
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 14
SCALING-OUT …
bws
/
home
zhengq garth
src f1 f2 f3
f1
ROOT
f3
f2
home
garth
zhengqsrc
bws
directory structure
dynamic directoryallocation & splitting
requires a large # of dedicated machinesLANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 15
!Cost-effective highly-parallel metadata
Orders of magnitude faster then LustreFS
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 16
GOAL
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 17
++
Traditional Metadata Model
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 18
bws
/
home
zhengq garth
src f1 f2 f3
compute nodes
App2
App1App3
shared file system view
dedicated mediator
global coordination
Traditional Metadata Model
bws
/
home
zhengq garth
src f1 f2 f3
compute nodes
App2
App1App3
shared file system view
dedicated mediator
global coordination
A single authoritative view of the file systemAllows different apps to shared data and achieve synchronization
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 19
HPC I/O is much simpler …
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 20
HPC ApplicationINPUT OUTPUT
CHECKPOINT DATA
HPC I/O is much simpler …
HPC ApplicationINPUT OUTPUT
CHECKPOINT DATA
HPC applications don’t need the FS to achieve synchronizationDifferent apps operate on different sets of files
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 21
HPC I/O is much simpler …
HPC ApplicationINPUT OUTPUT
CHECKPOINT DATA
HPC applications don’t need the FS to achieve synchronizationDifferent apps operate on different sets of files
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 22
A single unified file system view overkill for HPC applicationsGlobal coordination via a dedicated service bad for performance
Middleware Design
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 23
Parallel Scientific Applications
Shared Underlying Storage[parallel data access | oblivious of file system semantics]
metadata ops
data storage metadata storage
++
Serverless Metadata
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 24
compute nodes
App1
metadata
/
projs3d
dataApp3
metadata
/
projmpas
repo
App2
metadata
/
projocean
input
Serverless Metadata
compute nodes
App1
metadata
/
projs3d
dataApp3
metadata
/
projmpas
repo
App2
metadata
/
projocean
input
No such global namespace managed by a centralized mediatorEach app only communicates with its private metadata servers
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 25
Serverless Metadata
compute nodes
App1
metadata
/
projs3d
dataApp3
metadata
/
projmpas
repo
App2
metadata
/
projocean
input
No such global namespace managed by a centralized mediatorEach app only communicates with its private metadata servers
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 26
Fundamental AssumptionHPC applications don’t communicate with each other
snapshot
App Execution Model
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 27
SNAPSHOT
CATALOG
[snapshot = static view of a file system]
HPC Application
new snapshotImplemented as database, k/v-store, zookeeper, …
snapshot
App Execution Model
SNAPSHOT
CATALOG
[snapshot = static view of a file system]
HPC Application
new snapshotImplemented as database, k/v-store, zookeeper, …
Keeps each app isolated from other running appsTruly parallel metadata processing among different apps
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 28
Sharing Data Between Apps
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 29
HPC workflow
snapshot_1SNAPSHOT
CATALOG
APP1
snapshot_0
[snapshot = static view of a file system]
Sharing Data Between Apps
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 30
snapshot_2
HPC workflow
snapshot_1SNAPSHOT
CATALOG
APP1
snapshot_0
[snapshot = static view of a file system]
App2
snapshot_3
Sharing Data Between Apps
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 31
snapshot_2
HPC workflow
snapshot_1SNAPSHOT
CATALOG
APP1
snapshot_0
[snapshot = static view of a file system]
App2 APP3
snapshot_3
Sharing Data Between Apps
snapshot_2
HPC workflow
snapshot_1SNAPSHOT
CATALOG
APP1
snapshot_0
[snapshot = static view of a file system]
App2 APP3Converts online to batched offline sharingAllows apps to decide file visibility and timings for communication
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 32
snapshot_3
Sharing Data Between Apps
snapshot_2
HPC workflow
snapshot_1SNAPSHOT
CATALOG
APP1
snapshot_0
[snapshot = static view of a file system]
App2 APP3Converts online to batched offline sharingAllows apps to decide file visibility and timings for communication
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 33
Avoids unnecessary coordinationsynchronously enforced by a centralized metadata service
Snapshot Merging
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 34
snapshot_145 snapshot_8 snapshot_54
file system namespace
Snapshot Merging
snapshot_145 snapshot_8 snapshot_54
file system viewConflicts resolved at read path to avoid unnecessary computationResolution performed by apps instead of a super authority
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 35
HPC in 10 years …• two-tier storage for cost-effective fast data• client-funded namespace for highly-parallel metadata
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 36
App3App2
App1
App4App5
compute nodes w/ fast interconnect(10,000+)
pooled storage nodes(100+)
capacity tier
performance tier
Reference
Scaling the File System Control Plane with Client-Funded Metadata Servers (PDSW14)
Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion (SC14)
LANL/HPC_showcaseParallel Data Lab - http://www.pdl.cmu.edu/ 37
QUESTIONS