34
Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

Embed Size (px)

Citation preview

Page 1: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

Design of an Active Storage Cluster File System for DAG Workflows

Patrick Donnelly and Douglas Thain

University of Notre Dame2013 November 18th

DISCS-2013

Page 2: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

2

Task-based Workflow Engines

8/15/2013

part1 part2 part3: input.data split.py ./split.py input.dataout1: part1 mysim.exe ./mysim.exe part1 >out1out2: part2 mysim.exe ./mysim.exe part2 >out2out3: part3 mysim.exe ./mysim.exe part3 >out3result: out1 out2 out3 join.py ./join.py out1 out2 out3 > result

• Works on: Work Queue, SGE, Condor, Local

• Similar to Makeflow: Pegasus, Condor’s DAGMan, Dryad

Page 3: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

3

Today’s (DAG-structured) Workflows

8/15/2013

Page 4: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

4

Big Data is “hard”

8/15/2013

Master

WorkerWorker

Scenario A: Master Worker

WorkerWorkerWorkerWorker Worker

Worker

WorkerWorkerWorker

1TB

Scenario B: Distributed File System

WMS

Cloud

Grid

AND/ORDFS

DFSDFS

DFS

Page 5: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

5

Data Size Increases? Turn to DFSDistributing task dependencies over network costly as data-size increases.

Many data-sets are too large to be distributed from a master node.

What is used? NFS, AFS, Ceph, PVFS, GPFS – Generic POSIX-compliant cluster file system.

Problem: Data-Locality hard to achieve by workflowContributing Factor: File systems do not offer an interface to

locate storage nodes with data.Problem: Parallel applications’ accessing data is a Denial-of-Service waiting to happen. (herd effect)

Contributing Factor: Maintaining POSIX semantics.

8/15/2013

Page 6: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

6

Other Options?

Specialized Workflow Abstractions

8/15/2013

Page 7: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

7

Map-Reduce

8/15/2013 Source: developers.google.com

Page 8: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

8

Distributed File Systems -- Specialized

Specialized cluster file system for executing Map-Reduce.

Hadoop Distributed File System:

8/15/2013

#include “hdfs.h”hdfsFile writeFile = hdfsOpenFile(fs, writePath, O_WRONLY|O_CREAT, 0, 0, 0);tSize nwritten = hdfsWrite(fs, writeFile, “hello”, 6);hdfsCloseFile(fs, writeFile);

Page 9: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

98/15/2013 Source: hadoop.apache.org

Page 10: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

10

Task-based Workflows on Hadoop?

• Single task whole-file access inefficient.• Hadoop job execution not built for single-tasks.• Single file dependencies.

8/15/2013

Hadoop the Elephant

Your DAG Workflow

Page 11: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

11

DAG Execution on Hadoop

8/15/2013

Makefloww.makeflow

Batch Job

Work Queue Hadoop Hadoop

NameNode

Hadoop DataNode

Hadoop DataNode

Hadoop DataNode

submit job

Job 1234Map: ./split.py input.dataMap Input: input.dataReduce: <none>

Page 12: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

128/15/2013

Hadoop Job Throughput

SWEET’12: Makeflow: Portable Workflow Management for Distributed Computing, Albrecht et al.

Page 13: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

13

Summary: Running Workflows onLarge Datasets is Hard

Today, users have two solutions:• Use a generic POSIX Distributed File System.– Problem: Data-Locality hard to achieve by workflow

managers.– Problem: Parallel applications accessing data is a Denial-of-

Service waiting to happen.• Use a specialized file system that executes a specific

workflow abstraction.– Problem: Users must rewrite applications to use the

workflow pattern (abstraction).– Problem: Task-based Workflows inefficient.

8/15/2013

Page 14: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

14

Observations on DAG-Structured Workflows

1. Scientific workflows often re-use large datasets in multiple workflows.

2. Metadata interactions occur at task start/end.

3. Tasks consume whole files.

8/15/2013

Page 15: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

15

Cluster File System Overview

We have designed Confuga, an active storage cluster file system purposed for running task-based workflows.

Distinguishing Features• Data-locality-aware scheduling with multiple

dependencies.• Drop-in-replacement for other compute engines.• Consistency maintained at task boundaries.

8/15/2013

Page 16: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

16

Confuga: An Active Storage Cluster File System

S1 S2 S3

MDS

Single MDS with multiple storage nodes.

8/15/2013

RMNM

F1: S1, S2

F2: S1, S3

F3: S2

F1 F2 F1 F3 F2

Replica Manager: File granularity

/|__ readme.txt --> F1

|__ users/ |__ patrick/      |__ blast.db --> F2

      |__ blast    --> F3

Namespace Manager: Regular Directory Hierarchy, Files Point to File Identifiers

Page 17: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

17

Replica Manager

• Files are indexed using content-addressable-storage.

• Tasks:– Ensure sufficient replication of files, restripe

cluster as necessary.– Garbage collect extra unneeded replicas.

8/15/2013

MDS

RM

F1: S1, S2

F2: S1, S3

F3: S2

SHA1: abcdef123456789

s2.cluster.nd.edu

Page 18: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

18

Namespace Manager

• Maintains a mirror file system layout on the head node.– Regular files hold file identifiers (checksums).

• Global synchronization point for file system updates.

8/15/2013

MDS

NM

/|__ readme.txt --> F1

|__ users/ |__ patrick/      |__ blast.db --> F2

      |__ blast    --> F3

Page 19: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

19

Job Scheduler

8/15/2013

S1

F1 F2

S2

F1 F3

S3

F3 F2

Head Node

Job Descriptioncommand: “blast blast.db > out”,inputs: { blast: “/users/patrick/blast”, blast.db: “/users/patrick/blast.db”},outputs: { out: “/users/patrick/out”}

Step 1: submit job

Step 2: copy F3 to S3

Step

3

Step 4:execute T1

Step 5:result T1

Page 20: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

20

Job Scheduler: Task Namespace

• Context-free execution.• Atomic side-effect-free commits.

8/15/2013

Task 1command: “blast blast.db > out”,namespace: blast.db: F2

blast: F3

out: OUTPUT

drwxr-x--- 2 user users 4K 8:00 .lrwxrwxrwx 1 user users 49 8:00 blast.db -> ../store/F2

lrwxrwxrwx 1 user users 49 8:00 blast -> ../store/F3

-rw-r----- 1 user users 0 8:00 out

$ ./blast blast.db > out

Task 1 Resultexit status: 0namespace: out: Fout

S3

F2

F1 Fout

Page 21: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

21

Use

r Mac

hine

Use

r Mac

hine

Exploiting DAG-structured Workflow Semantics

8/15/2013

Storage Node

open

write

close

POSIX AFS (commit-on-close)

Storage Node

open

write

flush + close

Page 22: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

22

read-on-exec/commit-on-exit

8/15/2013

Use

r Mac

hine

Storage Node

open

write

close

open

read

close

open

write

close

Open + read + close

Open + write + close

Eliminates inter-task

synchronization

requirements

Batches metadata operations

Page 23: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

23

Why Confuga?

• Integrates cleanly with current DAG Workflows.– Task namespace encapsulates data dependencies– Writes/Reads resolve at task boundaries

• Global namespace allows data sharing and workflow checkpointing

• Express multiple dependencies for a task• Minimize unnecessary metadata interactions

8/15/2013

Page 24: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

24

Feature Comparison

Solution Data-Locality

Metadata Scaling

Large File Support

Application as Abstraction

Task-Based Workflows

Workflow on DFS

Hadoop ??

Confuga ??

8/15/2013

Page 25: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

25

Implementation: Confuga Using Chirp

• Why Chirp?– Most of Confuga can

be implemented using a (slightly modified) remote file server.

– Interoperates with existing distributed computation tools

8/15/2013

Chirp

Local FS

Network

RPC

ACL Policy

User App

libchirp

Network RPC

FUSE

Page 26: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

26

Extending Chirp

8/15/2013

libchirp

FUSE

Chirp RPC

Quota ACL

Local FS

Standard Chirp

libchirp

FUSE

Confuga Storage Node

Chirp RPC

Quota

Local FS

ACL Job

Scheduler

libchirp

FUSE

Confuga Head Node

Chirp RPC

Quota

LocalFS

ACL Job

Namespace Manager

Replica ManagerChirp File System

libchirp

Sched-uler

Confuga FS

Page 27: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

27

Concluding Thoughts

• Smart adaptation to workflow semantics allows the file system to reduce metadata operations and to minimize cluster synchronization steps.

• Task namespace is explicit as part of the job description, allowing the file system to schedule tasks near multiple dependencies.

8/15/2013

Page 28: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

28

Questions?

Patrick Donnelly ([email protected])Douglas Thain ([email protected])

Have a challenging distributed system problem?Visit our lab at http://www.nd.edu/~ccl/ !!

Source Code: http://www.github.com/cooperative-computing-lab/cctools

8/15/2013

Page 29: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

29

Why content-addressable-storage?

8/15/2013

Page 30: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

30

Batch Job Submission

Integrating Chirp with Makeflow

8/15/2013

Makeflow

Local FS

SGE, Condor, Work Queue,

./exe

./data/a.db

./data/b.db

submit,

put, get

Makeflow

./exe

./data/a.db

./data/b.db

Chirp

ACL Policy

stat, submit

• Requires Makeflow to abstract access to the workflow namespace.

• Chirp needs to support a job submission interface.

stat

Page 31: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

31

SN

S2

Rearchitect Chirp’s multi interface

• multi volume management in Chirp– Files are striped

round-robin across a static set of nodes

– No replication– File location requires

traversal of namespace

– Access not provided by Chirp itself

8/15/2013

Chirp“Head Node”

./volume/hosts

./volume/root/users/pdonnel3/blast--> S1:/abcd

ChirpS1

./abcd

Client

multi library

Page 32: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

32

Changes to Chirp

8/15/2013

• Two services within Confuga back-end file system:• Replica Manager• Namespace Manager

Page 33: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

33

Publications

• Attaching Cloud Storage to a Campus Grid Using Parrot, Chirp, and Hadoop, 2010 IEEE CloudCom.

• Makeflow: A Portable Abstraction for Data Intensive Computing on Clusters, Clouds, and Grids, 2012 SWEET at ACM SIGMOD.

• Fine-Grained Access Control in the Chirp Distributed File System, 2012 IEEE CCGrid.

8/15/2013

Page 34: Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

34

Map-Reduce

8/15/2013

• HDFS architecture influenced by Map-Reduce:– Block

oriented; no whole-file access.

– No multiple file access.

Source: yahoo.com