67
Let’s Make Parallel File System More Parallel [LA-UR-15-25811] Qing Zheng 1 , Kai Ren 1 , Garth Gibson 1 , Bradley W. Settlemyer 2 1 Carnegie Mellon University 2 Los Alamos National Laboratory

Let’s Make Parallel File System More Parallelqingzhen/talk/batchfs_talk_ci15.pdfLet’s Make Parallel File System More Parallel ... type=[file] time=2015-07-27 … id ... convert

  • Upload
    doannhu

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Let’s Make Parallel File System More Parallel[LA-UR-15-25811]

Qing Zheng1, Kai Ren1, Garth Gibson1, Bradley W. Settlemyer21Carnegie MellonUniversity

2Los AlamosNationalLaboratory

HPC defined by …

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 2

Parallel scientific appslow-latency network for msg passing

tired cluster deployments

PFS for highly scalable storage I/O

compute nodes(10,000+)

storage nodes(100+)

App2

App1App3

Parallel File System[Lustre]

Failure Handling …

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 3

compute nodes(10,000+)

storage nodes(100+)

App2

App1App3

Parallel File System[Lustre]Nodes/network will fail

apps use checkpoints to avoid

complete re-execution

each proc dumps its memory to a file

Failure Handling …

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 4

compute nodes(10,000+)

storage nodes(100+)

App2

App1App3

Parallel File System[Lustre]When failure happens

an app is simply re-scheduled

and resumes

execution from a latest checkpoint

Checkpointing …

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 5

compute nodes(10,000+)

storage nodes(100+)

App2

App1App3

Parallel File System[Lustre]

640K open()/close()N * 640K write()

• Assuming 20,000 nodes and 32 CPUs per node

1 if (proc_id == 0) {2 mkdir(“/proj/a/chk/001”);3 }4 sync();5 int fd = open(“/proj/a/chk/001/<proc_id>”,6 O_CREAT | O_EXCL | O_WRONLY);7 write(fd, “<…..>”);8 write(fd, “<…..>”);9 close(fd);

YES? NO?[ DATA ] [ METADATA]

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 6

Will existing PFS deliver sufficient perf?

NO?

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 7

[ METADATA]open(), close(), unlink(),

mkdir(), rmdir(),rename(), getattr(), chmod(), readdir(), …

Metadata

Namespace Tree1

Data Location3

File Attributes2

1• hierarchical directory structure

• file name, file size, last modification time, …

• where to find file/directory data ?

Decoupled PFSParallel File System

metadata service data service[a single (or a few) machines] [a large collection of machines]

Allow data to scale without scaling metadata

e.g. Lustre MDS e.g. Lustre OSS

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 8

NOFS only stores

large files

Isn’t Metadata a Problem?

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 9

NOmetadata is small in size

NO90% of ops are

I/O

NOFS only stores

large files

Isn’t Metadata a Problem?

Median file size in actually tiny/small

• < 64KB in cloud computing data centers

• < 64MB in super computing environments

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 10

64MB is the default block size for Google File System

Isn’t Metadata a Problem?

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 11

NOmetadata is small in size

NO90% of ops are

I/O

bigger & bigger cluster• # app processes

• metadata size

• # of metadata op

HPC is growing Fast

Tomorrow we will have EXASCALE computing facilities

more intensive METADATA WORKLOADS

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 12

Metadataeventually a huge problem !!

NO!![ METADATA]

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 13

Will existing PFS deliver sufficient perf?

PARALLEL DATA/METADATA

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 14

GOAL

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 15

Middleware Design

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 16

Parallel Scientific Applications

Underlying Storage Infrastructure[Object Storage/Parallel File System]

metadata ops

data storage metadata storage

Middleware Design

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 17

Underlying Storage Infrastructure[Object Storage/Parallel File System]

metadata storagedata/metadata storage

Parallel Scientific Application

Primary Server

metadata operations

fast interconnect

Client Proc Private Server

Middleware Design

Underlying Storage Infrastructure[Object Storage/Parallel File System]

metadata storagedata/metadata storage

Parallel Scientific Application

Primary Server

metadata operations

fast interconnect

Client Proc Private Server

Enables metadata to be potentially servedfrom compute nodes

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 18

Client-funded File System Metadata Architecture

Agenda

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 19

BulkInsertion

Metadata Representation

1 2

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 20

Metadata Representation1

Block-based Metadata

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 21

data block mapinode map

superblock

inode blocks data blocks

id=161

size=64

type=[file]

time=2015-07-27

id=157

size=4096

type=[directory]

time=2015-07-27

inode [..] -> 132

[.] -> 157

zhengq-> 158

kair -> 159

garth -> 160

bws -> 161

directory entry list

UNIX Model

Block-based Metadata

data block mapinode map

superblock

inode blocks data blocks

id=161

size=64

type=[file]

time=2015-07-27

id=157

size=4096

type=[directory]

time=2015-07-27

[..] -> 132

[.] -> 157

zhengq-> 158

kair -> 159

garth -> 160

bws -> 161

inode directory entry list

file creates -> disk seeks, liner directory entry search costzero per-directory concurrency

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 22

Table-based Metadata

KEY = parent_id + hash(fname), VALUE = an embedded inode + fname

proj (id=1)

batchfs (id=5)

src (id=2)

fs.h fs.c

ROOT (id=0)

orde

red

KV p

airs

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 23

0,h(src) id=2, type=dir, fname=src

0,h(proj) id=1, type=dir, fname=proj

1,h(batchfs) id=5, type=dir, fname=batchfs

2,h(fs.h) id=3, type=file, fname=fs.h

2,h(fs.c) id=4, type=file, fname=fs.c

key valuereaddir “[ROOT]“

readdir “/src “

Table-based Metadata

KEY = parent_id + hash(fname), VALUE = an embedded inode + fname

0,h(src) id=2, type=dir, fname=src

0,h(proj) id=1, type=dir, fname=proj

proj (id=1)

batchfs (id=5)

src (id=2)

fs.h fs.c

ROOT (id=0)

1,h(batchfs) id=5, type=dir, fname=batchfs

2,h(fs.h) id=3, type=file, fname=fs.h

2,h(fs.c) id=4, type=file, fname=fs.c

orde

red

KV p

airs

readdir “[ROOT]“

readdir “/src “

key value

A large distributed sorted directory entry tablewith embedded inodes

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 24

Table RepresentationLog-structured Merge Trees [LSM]

create file/directory

In-mem B-Tree

A collection of B-trees at different levels

k/v

k/v k/v

k/v

k/v k/v

k/v k/v k/v

k/v

k/v k/v

k/v k/v k/v

k/v k/v k/v k/v

merge merge

Level-0 Level-1

Level-2

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 25

level-0 always sits in memory

Table RepresentationLog-structured Merge Trees [LSM]

create file/directory

In-mem B-Tree

A collection of B-trees at different levels

k/v

k/v k/v

k/v

k/v k/v

k/v k/v k/v

k/v

k/v k/v

k/v k/v k/v

k/v k/v k/v k/v

merge merge

Level-0 Level-1

Level-2

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 26

merge level-0 into level-1

FULL

Table RepresentationLog-structured Merge Trees [LSM]

create file/directory

In-mem B-Tree

A collection of B-trees at different levels

k/v

k/v k/v

k/v

k/v k/v

k/v k/v k/v

k/v

k/v k/v

k/v k/v k/v

k/v k/v k/v k/v

merge merge

Level-0 Level-1

Level-2

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 27

merge partial level-1 into level-2

FULL

Table RepresentationLog-structured Merge Trees [LSM]

create file/directory

In-mem B-Tree

A collection of B-trees at different levels

k/v

k/v k/v

k/v

k/v k/v

k/v k/v k/v

k/v

k/v k/v

k/v k/v k/v

k/v k/v k/v k/v

merge merge

Level-0 Level-1

Level-2

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 28

convert random disk I/O into sequential I/Oavoids disk seeks

(optimized for K/V insertion)

LSM - Updates

Convert K/V updates to K/V insertion operations

chmod(“/proj/batchfs”, …)

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 29

proj (id=1)

batchfs (id=5)

src (id=2)

fs.h fs.c

ROOT (id=0)

1,h(batchfs) perm=xxx, fname=batchfs, seq=245

1,h(batchfs) perm=yyy, fname=batchfs, seq=361

seq 361>245no write in-place

LSM - Deletions

Convert K/V deletions to K/V insertion operations

rmdir(“/proj/batchfs”, …)

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 30

proj (id=1)

batchfs (id=5)

src (id=2)

fs.h fs.c

ROOT (id=0)

1,h(batchfs) live=true, fname=batchfs, seq=245

1,h(batchfs) live=false, fname=batchfs, seq=361

seq 361>245no explicit deletion

LSM - Deletions

Convert K/V deletions to an K/V insertion operations

rmdir(“/proj/batchfs”, …)

proj (id=1)

batchfs (id=5)

src (id=2)

fs.h fs.c

ROOT (id=0)

1,h(batchfs) live=true, fname=batchfs, seq=245

1,h(batchfs) live=false, fname=batchfs, seq=361

seq 361>245no explicit deletion1. immutable data structure

2. snapshotting a file system image is trivialLANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 31

Underlying Storage Infrastructure[Object Storage/Parallel File System]

LSM - Storage

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 32

k/v

k/v k/vk/v

k/v k/v

k/v k/v k/v

k/v

k/v k/v

k/v k/v k/v

k/v k/v k/v k/v

represented formatted

namespace

LSM-Tree

T1 T2

T3 T4

32MB each

LSM - Storagek/v

k/v k/vk/v

k/v k/v

k/v k/v k/v

k/v

k/v k/v

k/v k/v k/v

k/v k/v k/v k/v

represented formatted

namespace

LSM-Tree

T1 T2

T3 T4

e.g. 32MB each

Pack metadata into large filesReuse data path to deliver scalable metadata

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 33

Experiments

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 34

ExperimentsEach client process creates 1 private directoryand inserts a set of empty files into that directory

(CHECKPOINT WORKLOAD)

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 35

Hadoop File System (HDFS) ClusterDataNode

DataNode

DataNode

DataNode

DataNode

DataNode

DataNode

DataNode

NameNode

Each node has two CPUs, 8GM RAM, one HDD SATA disk, and one 1Gb Ethernet port

[metadata node]

ExperimentsEach client process creates 1 private directoryand inserts a set of empty files into that directory

(CHECKPOINT WORKLOAD)

Hadoop File System (HDFS) ClusterDataNode

DataNode

DataNode

DataNode

DataNode

DataNode

DataNode

DataNode

NameNode

Each node has two CPUs, 8GM RAM, one HDD SATA disk, and one 1Gb Ethernet port

[metadata node]The original Hadoop file system gives 600 op/s

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 36

Experiment Settings

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 37

HDFS Data Node

HDFSName Node 1 BatchFS Server

1-8 BatchFS clients

HDFS Data Node

1-8 BatchFS clients

HDFS Data Node

1-8 BatchFS clients

DISK DISK DISK

1 million files inserted without bulk insertion

HDFS Baseline v.s. BatchFS

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 38

0.6 0.6 0.6 0.6

11

13 1312

0

2

4

6

8

10

12

14

Thro

ughp

ut (K

op/

s)

20X 20X 20X 20X

8 client processes 16 client processes 32 client processes 64 client processesEfficient Metadata Representation

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 39

Bulk Insertion2

Traditional Model

ParallelScientific Application

DedicatedMetadata Server

mkdir(), create()

T1 T2 T3 T4

on-disk namespace storage

write tree files

Shared Underlying Storage Infrastructure

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 40

Traditional Model

ParallelScientific Application

DedicatedMetadata Server

mkdir(), create()

T1 T2 T3 T4

on-disk namespace storage

write tree files

Shared Underlying Storage Infrastructure

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 41

Sync. Interface

Strong Consistent

Traditional Model

ParallelScientific Application

DedicatedMetadata Server

mkdir(), create()

T1 T2 T3 T4

on-disk namespace storage

write tree files

Shared Underlying Storage Infrastructure1. Dedicated service doesn’t work in exascale

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 42

320K client processes

Sync. Interface

Strong Consistent

bottleneck

Traditional Model

ParallelScientific Application

DedicatedMetadata Server

mkdir(), create()

T1 T2 T3 T4

on-disk namespace storage

write tree files

Shared Underlying Storage Infrastructure1. Dedicated service doesn’t work in exascale

2. Traditional model overkill for scientific applications LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 43

320K client processes

Sync. Interface

Strong Consistent

bottleneck

Bulk Insertion

ParallelScientific Application

DedicatedMetadata Server

on-disk namespace storage

write tree files

T5 T6

client’s metadata mutations

(1) write tree files

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 44

T1 T2 T3 T4

mkdir()create()

via private servers

Bulk Insertion

ParallelScientific Application

DedicatedMetadata Server

on-disk namespace storage

write tree files

T5 T6

(2) bulk submit

client’s metadata mutations

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 45

T1 T2 T3 T4

finishes execution by as easily as picking upall submitted tree files

Bulk Insertion

ParallelScientific Application

DedicatedMetadata Server

on-disk namespace storage

write tree files

T5 T6

(2) bulk submit

client’s metadata mutations

T1 T2 T3 T4

finishes execution by as easily as picking upall submitted tree files

Similar to database pre-loadingData inserted via a low-level protocol instead of SQL

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 46

Bulk Insertion

ParallelScientific Application

DedicatedMetadata Server

on-disk namespace storage

write tree files

T5 T6

(2) bulk submit

client’s metadata mutations

T1 T2 T3 T4

finishes execution by as easily as picking upall submitted tree files

Similar to database pre-loadingData inserted via a low-level protocol instead of SQL

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 47

1. More efficient h/w utilization2. less calls to dedicated servers: more scalable metadata

Concurrency Control

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 48

Total ordering of mutations from different clients

client1

1 chmod(“/proj”, …)

client2

1 rmdir(“/proj”, …)

client1

1 chmod(“/proj”, …)

client2

1 chmod(“/proj”, …)

client1

1 mkdir(“/proj”, …)

client2

1 mkdir(“/proj”, …)

client1

1 rename(“/proj”, “/a”)

client2

1 rename(“/proj”, “/b”)

Optimistic Locking

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 49

proj

batchfs

src

fs.h fs.c

ROOT

proj

batchfs

src

fs.h fs.c

ROOT

checkpoint

ck1checkpoint

ck1

batchfs

SNAPSHOT

BatchFSClient

BOOTSTRAP

CHECK/MERGE

SUBMIT

Optimistic Locking

proj

batchfs

src

fs.h fs.c

ROOT

SNAPSHOT CHECK/MERGE proj

batchfs

src

fs.h fs.c

ROOT

checkpoint

ck1checkpoint

ck1

batchfs

BatchFSClient

BOOTSTRAP SUBMIT

Similar to source code control (github/svn)Except there is no data copying (we do copy-by-ref)

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 50

Optimistic Locking

proj

batchfs

src

fs.h fs.c

ROOT

SNAPSHOT CHECK/MERGE proj

batchfs

src

fs.h fs.c

ROOT

checkpoint

ck1checkpoint

ck1

batchfs

BatchFSClient

BOOTSTRAP SUBMIT

Similar to source code control (github/svn)Except there is no data copying (we do copy-by-ref)

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 51

Fundamental AssumptionScientific applications rarely produce conflicts

Phase 1: BranchingClient instantiates a private namespace

from a global snapshot

global branch

bulk_insert(…)

Client snapshot(…)

mkdir(…)chmod(…)

T

T1 T2 T3

global namespace

client’s private branch

a global snapshot

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 52

KV pairs

T1 T2 T3 T4 T5

Phase 2: MergingServer picks up and schedules a check

on client’s metadata mutations

global branch

bulk_insert(…)

Client snapshot(…)

mkdir(…)chmod(…)

T1 T2 T3

global namespace

client’s private branch

a global snapshot

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 53

KV pairs

T1 T2 T3 T4 T5

open(…) Client2T

tentative accepted,subject to future rejection

Phase 3: Verification

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 54

T5

T6

T7

client’s metadata mutations

Log

metadata operation log view

SST Interpreter

global namespace

soft re-execution

T1 T2 T3 T4

concurrent updatesthat mostly don’t produce conflicts

T1 T2 T3 T4

COMMIT

T5 T6 T7 T8

conflict resolution

Experiments

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 55

Previous Setting

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 56

HDFS Data Node

HDFSName Node 1 BatchFS Server

1-8 BatchFS clients

HDFS Data Node

1-8 BatchFS clients

HDFS Data Node

1-8 BatchFS clients

DISK DISK DISK

1 million files inserted without bulk insertion

New Setting

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 57

HDFS Data Node

HDFSName Node 1 BatchFS Server

1-8 BatchFS clients

HDFS Data Node

1-8 BatchFS clients

HDFS Data Node

1-8 BatchFS clients

DISK DISK DISK

8 million files inserted with bulk insertion

No v.s. w/ Bulk Insertion

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 58

0.6 0.6 0.6 0.611 13 13 12

139

188203

216

0

50

100

150

200

250

Thro

ughp

ut (K

op/

s)

8 client processes 16 client processes 32 client processes 64 client processes

8X15X 15X 18X

Bulk Insertion - 20X * 18X = 360X faster then HDFS

Client-funded File System Metadata Architecture

Agenda

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 59

BulkInsertion

Metadata Representation

1 2

Why FS is slow?

Inefficient metadata representation

At least one RPC per operation

Synchronous metadata interface

Pessimistic concurrency control

Dedicated authorization service

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 60

Client-funded HPC

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 61

PrimaryMetadata Server

compute nodes

pre-executesmetadata ops privately

Underlying Storage

not in critical path

per-batchsynchronizationExascale PFS architecture

• Move metadata computation from servers to apps

• Better h/w utilization

• FS scales w/ # of clients

App1App3

App2

Client-funded HPCPrimary

Metadata Server

compute nodes

pre-executesmetadata ops privately

Underlying Storage

not in critical path

per-batchsynchronizationExascale PFS architecture

• Move metadata computation from servers to apps

• Better h/w utilization

• FS scales w/ # of clients

App1App3

App2

Apps have long had rich h/w resourcesNow they can buy themselves scalable metadata

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 62

Future Work

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 63

ImplementationLANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 64

Metadata TracesLANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 65

Reference

Scaling the File System Control Plane with Client-Funded Metadata Servers (PDSW14)

Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion (SC14)

LANL/Summer_SchoolParallel_Data_Lab - http://www.pdl.cmu.edu/ 66

QUESTIONS