19

Distributed Namespace Status Phase I - Remote Directories

Embed Size (px)

DESCRIPTION

Distributed Namespace Status Phase I - Remote Directories. Wang Di Whamcloud, Inc. DNE Phase I - Remote Directory. Subdirectories on a remote metadata target Scales MDT namespace, like OSTs can today Dedicated performance for users/jobs All MDTs can use any/all OSTs to create objects. MDT1. - PowerPoint PPT Presentation

Citation preview

Page 1: Distributed Namespace Status Phase I - Remote Directories
Page 2: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

Distributed Namespace StatusPhase I - Remote Directories

• Wang DiWhamcloud, Inc.

Page 3: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

• Subdirectories on a remote metadata target• Scales MDT namespace, like OSTs can today• Dedicated performance for users/jobs• All MDTs can use any/all OSTs to create objects

DNE Phase I - Remote Directory

MDT1

rose

dir1file

MDT0

root

file bill

MDT2

frank

file dir2

Lustre User Group 20123

Page 4: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

• Remote directory creation by administrator only– Remote directory creation is a synchronous disk operation

lfs mkdir -i {mdtidx} /path/to/remote_dir

• Files/subdirs created in remote dir stay on MDT– Local operations (create, unlink, open, close) at maximum performance– Limit RPCs that need to communicate with multiple MDTs– Simplifies implementation for initial deployment

Remote Directory Implementation

4 Lustre User Group 2012

Page 5: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

• Failed/disabled MDT affects all of its subtrees– Accessing failed/disabled MDT will return EIO– Disabling MDT0 causes whole namespace to be inaccessible

• Remote directory can only be created on MDT0– Otherwise, failure of one MDT would isolate other MDTs

• Rename or link across MDTs returns –EXDEV• Deliberate limitation of complexity

– Limit testing, recovery, failure scenarios for initial deployment– Restrictions relaxed as experience is gained, or via override

Remote Directory Limitations

5 Lustre User Group 2012

Page 6: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

• MDT disk format must use ldiskfs dir_data feature– Default for any 2.x formatted filesystem– Allows storing remote directory entry pointers– Enable on 1.x filesystems: tune2fs -O dir_data /dev/mdt0

• Upgrade clients, MGS, MDS, OSS to Lustre 2.4+– Not required to enable DNE when upgrading to Lustre 2.4+– Once DNE is enabled, downgrade to older Lustre difficult

• requires copying/deleting all files not on MDT0

• Add new MDTs to running filesystem– Clients without DNE support evicted at this point– New MDTs only used once a remote directory entry is createdmkfs.lustre --reformat --mgsnode={mgsnode} --mdt --index=N

/dev/{mdtN}mount –t lustre /dev/{mdtN} /mnt/{mdtN}

Enable DNE on new/existing filesystem

6 Lustre User Group 2012

Page 7: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

• Hash a single directory across multiple MDTs• Multiple servers active for directory/inodes• Improve performance for large directories

MDT0 MDT3MDT2

DNE Phase II - Shard/Stripe Directory

MDT1

dir.0

dir.2

carcat

dir.1

bobdog

dir.3

dale beealeace

master

slave

slaveslave

Lustre User Group 20127

Page 8: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

• Unique cluster-wide identifier for file/directory– Introduced in Lustre 2.0– Three components form object address {f_seq, f_oid, f_ver}– Large sequence range is allocated to each server– Sequences are large, so FIDs are never re-used

• FID Location Database (FLDB) maps FID->server– FLDB is known to all clients and servers– Kept small due to few sequence ranges– Sequence is looked up in FLDB to find MDT/OST index

• Object Index (OI) maps FID->inode on server– OI maps FID to local inode number

Lustre File IDentifier (FID)

8 Lustre User Group 2012

VersionSequence # Object #

64 bits 32 bits 32 bits

Page 9: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

• Client does filename lookup in parent directory– Root directory lives on MDT0

• Client maps FID to Master MDT via FLDB– If request only involves one MDT, same as current single MDT

• Some operations need to access Slave MDTs– Called cross-MDT operations– Master MDT forwards update(s) other MDT(s) to finish the request– Create/unlink remote directory are only cross-MDT operations today

DNE Master and Slave MDTs

9

clientclient

FLDBFLDBGet Master MDT for this operation

Master MDTMaster MDT

Slave MDT2Slave MDT2

request

reply

Lustre User Group 2012

Page 10: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

• Create Remote Directory

DNE Operation

10 Lustre User Group 2012

Page 11: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

• Master MDT checks RPC XID against last_rcvd file– Determines whether the operation was committed to disk or not– Committed: Master MDT reconstructs RPC reply from last_rcvd entry– Uncommitted: Master MDT redoes creation

• Resend same directory creation RPC to Slave MDT using same FID

• Slave MDT checks if remote directory was created– Looks up FID requested by Master in local OI– Creates new subdirectory with FID if missing– Returns success to Master

Create Resend between MDTs

11 Lustre User Group 2012

Page 12: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

• Unlink Remote Directory

DNE Operation

12 Lustre User Group 2012

Page 13: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

• Master MDT checks RPC XID against last_rcvd file– Determines whether the operation was committed to disk or not– Committed: Master MDT reconstructs RPC reply from last_rcvd entry– Uncommitted: Master unlinks, deletes name, adds destroy log, etc.

• If Slave MDT fails during this process– llog sync thread on Master MDT will resend destroy to Slave MDT– Directory unlinks are idempotent, can be retried

Unlink Resend between MDTs

13 Lustre User Group 2012

Page 14: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

• FID is packed into the name entry• Each remote entry will have a local agent inode• Real object (inode) on Remote MDT found via OI

Remote Directory Entry

14 Lustre User Group 2012

Page 15: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

• Two directories (AGENT and REMOTE) added• AGENT

– Each remote entry has a local agent inode– Agent inodes located under /AGENT/MDTn, one for each remote MDT

• REMOTE– Remote directories on Slave MDT created under /REMOTE

• Keeps local disk filesystem consistent• Allows efficient checking of cross links by LFSCK

MDT Disk Layout

15 Lustre User Group 2012

Page 16: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

• Active-Active MDT failover available with DNE– Allows multiple MDTs to be exported from one MDS– Ensures file system remains available in face of MDS node failure– Prevents isolation of large parts of the filesystem

DNE High Availability

16 Lustre User Group 2012

MDS1

MDT1 MDT2

MDS2

Take over

Page 17: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

Internal Architecture

17 Lustre User Group 2012

Page 18: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

• Testing done on LLNL Hyperion– 100 clients, 8 mount points– Separate directory per mount point– One stripe per file

Early Test Results

18 Lustre User Group 2012

Page 19: Distributed Namespace Status Phase I - Remote Directories

© 2012 Whamcloud, Inc.

• Wang DiWhamcloud, Inc.

Thank You