Upload
ashlyn-day
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Design of an Active Storage Cluster File System for DAG Workflows
Patrick Donnelly and Douglas Thain
University of Notre Dame2013 November 18th
DISCS-2013
2
Task-based Workflow Engines
8/15/2013
part1 part2 part3: input.data split.py ./split.py input.dataout1: part1 mysim.exe ./mysim.exe part1 >out1out2: part2 mysim.exe ./mysim.exe part2 >out2out3: part3 mysim.exe ./mysim.exe part3 >out3result: out1 out2 out3 join.py ./join.py out1 out2 out3 > result
• Works on: Work Queue, SGE, Condor, Local
• Similar to Makeflow: Pegasus, Condor’s DAGMan, Dryad
3
Today’s (DAG-structured) Workflows
8/15/2013
4
Big Data is “hard”
8/15/2013
Master
WorkerWorker
Scenario A: Master Worker
WorkerWorkerWorkerWorker Worker
Worker
WorkerWorkerWorker
1TB
Scenario B: Distributed File System
WMS
Cloud
Grid
AND/ORDFS
DFSDFS
DFS
5
Data Size Increases? Turn to DFSDistributing task dependencies over network costly as data-size increases.
Many data-sets are too large to be distributed from a master node.
What is used? NFS, AFS, Ceph, PVFS, GPFS – Generic POSIX-compliant cluster file system.
Problem: Data-Locality hard to achieve by workflowContributing Factor: File systems do not offer an interface to
locate storage nodes with data.Problem: Parallel applications’ accessing data is a Denial-of-Service waiting to happen. (herd effect)
Contributing Factor: Maintaining POSIX semantics.
8/15/2013
6
Other Options?
Specialized Workflow Abstractions
8/15/2013
7
Map-Reduce
8/15/2013 Source: developers.google.com
8
Distributed File Systems -- Specialized
Specialized cluster file system for executing Map-Reduce.
Hadoop Distributed File System:
8/15/2013
#include “hdfs.h”hdfsFile writeFile = hdfsOpenFile(fs, writePath, O_WRONLY|O_CREAT, 0, 0, 0);tSize nwritten = hdfsWrite(fs, writeFile, “hello”, 6);hdfsCloseFile(fs, writeFile);
98/15/2013 Source: hadoop.apache.org
10
Task-based Workflows on Hadoop?
• Single task whole-file access inefficient.• Hadoop job execution not built for single-tasks.• Single file dependencies.
8/15/2013
Hadoop the Elephant
Your DAG Workflow
11
DAG Execution on Hadoop
8/15/2013
Makefloww.makeflow
Batch Job
Work Queue Hadoop Hadoop
NameNode
Hadoop DataNode
Hadoop DataNode
Hadoop DataNode
submit job
Job 1234Map: ./split.py input.dataMap Input: input.dataReduce: <none>
128/15/2013
Hadoop Job Throughput
SWEET’12: Makeflow: Portable Workflow Management for Distributed Computing, Albrecht et al.
13
Summary: Running Workflows onLarge Datasets is Hard
Today, users have two solutions:• Use a generic POSIX Distributed File System.– Problem: Data-Locality hard to achieve by workflow
managers.– Problem: Parallel applications accessing data is a Denial-of-
Service waiting to happen.• Use a specialized file system that executes a specific
workflow abstraction.– Problem: Users must rewrite applications to use the
workflow pattern (abstraction).– Problem: Task-based Workflows inefficient.
8/15/2013
14
Observations on DAG-Structured Workflows
1. Scientific workflows often re-use large datasets in multiple workflows.
2. Metadata interactions occur at task start/end.
3. Tasks consume whole files.
8/15/2013
15
Cluster File System Overview
We have designed Confuga, an active storage cluster file system purposed for running task-based workflows.
Distinguishing Features• Data-locality-aware scheduling with multiple
dependencies.• Drop-in-replacement for other compute engines.• Consistency maintained at task boundaries.
8/15/2013
16
Confuga: An Active Storage Cluster File System
S1 S2 S3
MDS
Single MDS with multiple storage nodes.
8/15/2013
RMNM
F1: S1, S2
F2: S1, S3
F3: S2
F1 F2 F1 F3 F2
Replica Manager: File granularity
/|__ readme.txt --> F1
|__ users/ |__ patrick/ |__ blast.db --> F2
|__ blast --> F3
Namespace Manager: Regular Directory Hierarchy, Files Point to File Identifiers
17
Replica Manager
• Files are indexed using content-addressable-storage.
• Tasks:– Ensure sufficient replication of files, restripe
cluster as necessary.– Garbage collect extra unneeded replicas.
8/15/2013
MDS
RM
F1: S1, S2
F2: S1, S3
F3: S2
SHA1: abcdef123456789
s2.cluster.nd.edu
18
Namespace Manager
• Maintains a mirror file system layout on the head node.– Regular files hold file identifiers (checksums).
• Global synchronization point for file system updates.
8/15/2013
MDS
NM
/|__ readme.txt --> F1
|__ users/ |__ patrick/ |__ blast.db --> F2
|__ blast --> F3
19
Job Scheduler
8/15/2013
S1
F1 F2
S2
F1 F3
S3
F3 F2
Head Node
Job Descriptioncommand: “blast blast.db > out”,inputs: { blast: “/users/patrick/blast”, blast.db: “/users/patrick/blast.db”},outputs: { out: “/users/patrick/out”}
Step 1: submit job
Step 2: copy F3 to S3
Step
3
Step 4:execute T1
Step 5:result T1
20
Job Scheduler: Task Namespace
• Context-free execution.• Atomic side-effect-free commits.
8/15/2013
Task 1command: “blast blast.db > out”,namespace: blast.db: F2
blast: F3
out: OUTPUT
drwxr-x--- 2 user users 4K 8:00 .lrwxrwxrwx 1 user users 49 8:00 blast.db -> ../store/F2
lrwxrwxrwx 1 user users 49 8:00 blast -> ../store/F3
-rw-r----- 1 user users 0 8:00 out
$ ./blast blast.db > out
Task 1 Resultexit status: 0namespace: out: Fout
S3
F2
F1 Fout
21
Use
r Mac
hine
Use
r Mac
hine
Exploiting DAG-structured Workflow Semantics
8/15/2013
Storage Node
open
write
close
POSIX AFS (commit-on-close)
Storage Node
open
write
flush + close
…
22
read-on-exec/commit-on-exit
8/15/2013
Use
r Mac
hine
Storage Node
open
write
close
open
read
close
open
write
close
Open + read + close
Open + write + close
Eliminates inter-task
synchronization
requirements
Batches metadata operations
23
Why Confuga?
• Integrates cleanly with current DAG Workflows.– Task namespace encapsulates data dependencies– Writes/Reads resolve at task boundaries
• Global namespace allows data sharing and workflow checkpointing
• Express multiple dependencies for a task• Minimize unnecessary metadata interactions
8/15/2013
24
Feature Comparison
Solution Data-Locality
Metadata Scaling
Large File Support
Application as Abstraction
Task-Based Workflows
Workflow on DFS
Hadoop ??
Confuga ??
8/15/2013
25
Implementation: Confuga Using Chirp
• Why Chirp?– Most of Confuga can
be implemented using a (slightly modified) remote file server.
– Interoperates with existing distributed computation tools
8/15/2013
Chirp
Local FS
Network
RPC
ACL Policy
User App
libchirp
Network RPC
FUSE
26
Extending Chirp
8/15/2013
libchirp
FUSE
Chirp RPC
Quota ACL
Local FS
Standard Chirp
libchirp
FUSE
Confuga Storage Node
Chirp RPC
Quota
Local FS
ACL Job
Scheduler
libchirp
FUSE
Confuga Head Node
Chirp RPC
Quota
LocalFS
ACL Job
Namespace Manager
Replica ManagerChirp File System
libchirp
Sched-uler
Confuga FS
27
Concluding Thoughts
• Smart adaptation to workflow semantics allows the file system to reduce metadata operations and to minimize cluster synchronization steps.
• Task namespace is explicit as part of the job description, allowing the file system to schedule tasks near multiple dependencies.
8/15/2013
28
Questions?
Patrick Donnelly ([email protected])Douglas Thain ([email protected])
Have a challenging distributed system problem?Visit our lab at http://www.nd.edu/~ccl/ !!
Source Code: http://www.github.com/cooperative-computing-lab/cctools
8/15/2013
29
Why content-addressable-storage?
8/15/2013
30
Batch Job Submission
Integrating Chirp with Makeflow
8/15/2013
Makeflow
Local FS
SGE, Condor, Work Queue,
…
./exe
./data/a.db
./data/b.db
submit,
put, get
Makeflow
./exe
./data/a.db
./data/b.db
Chirp
ACL Policy
stat, submit
• Requires Makeflow to abstract access to the workflow namespace.
• Chirp needs to support a job submission interface.
stat
31
SN
S2
Rearchitect Chirp’s multi interface
• multi volume management in Chirp– Files are striped
round-robin across a static set of nodes
– No replication– File location requires
traversal of namespace
– Access not provided by Chirp itself
8/15/2013
Chirp“Head Node”
./volume/hosts
./volume/root/users/pdonnel3/blast--> S1:/abcd
ChirpS1
./abcd
Client
multi library
32
Changes to Chirp
8/15/2013
• Two services within Confuga back-end file system:• Replica Manager• Namespace Manager
33
Publications
• Attaching Cloud Storage to a Campus Grid Using Parrot, Chirp, and Hadoop, 2010 IEEE CloudCom.
• Makeflow: A Portable Abstraction for Data Intensive Computing on Clusters, Clouds, and Grids, 2012 SWEET at ACM SIGMOD.
• Fine-Grained Access Control in the Chirp Distributed File System, 2012 IEEE CCGrid.
8/15/2013
34
Map-Reduce
8/15/2013
• HDFS architecture influenced by Map-Reduce:– Block
oriented; no whole-file access.
– No multiple file access.
Source: yahoo.com