Upload
delphia-porter
View
225
Download
7
Tags:
Embed Size (px)
Citation preview
1
Towards a High-Performance and Scalable Storage System for Workflow Applications
Emalayan Vairavanathan
The University of British Columbia
Department of Electrical and Computer Engineering
Background: Workflow Applications
modFTDock workflow
• Large number of independent tasks collectively work on a problem
• Common Characteristics
File based communication
Large number of tasks
Large amount of storage I/O
Regular data access patterns
2
Background – ModFTDock in Argonne Blue Gene/P
3
Central Storage System (e.g., GPFS, NFS)
Scale: 40960 Compute nodes
File based communication
Large IO volumeWorkflow Runtime
Engine
1.2 M Docking
Tasks
IO throughput < 1MBps / core
App. task
Local storage
App. task
Local storage
App. task
Local storage
App. task
Local storage
App. task
Local storage
Z. Zhang et. al, SC’12Background –Central Storage Bottleneck
Montage workflow (512 BG/P CPU cores, GPFS)
4
5
Contributions - Alleviating storage I/O bottleneck
Intermediate Storage System Designed and implemented a
prototype Integrated with workflow runtime Evaluated with applications on BG/P
The Case for Cross-Layer Optimizations in Storage: A
Workflow-Optimized Storage System. S. Al-Kiswany,
Emalayan Vairavanathan, L. B. Costa, H. Yang, M.
Ripeanu. Submitted - FAST '13.
Workflow-aware Storage System Identified new data access patterns Studied the viability of a workflow-aware
storage
A Workflow-Aware Storage System: An Opportunity
Study. Emalayan Vairavanathan, S. Al-Kiswany, L. B.
Costa, Z.Zhang, D.Katz, M.Wilde, M. Ripeanu. CCGRID
'12. Acceptance Rate : 27%.
A case for Workflow-Aware Storage: An Opportunity
Study using MosaStore. Emalayan Vairavanathan, S.
Al-Kiswany, A. Barros, L. B. Costa1 H. Yang, G. Fedak,
D.Katz, M.Wilde, M. Ripeanu. Submitted - FGCS Journal
MosaStore Storage System Experimental platform for other
studies
Predicting Intermediate Storage Performance for
Workflow Applications. L. B. Costa, A. Barros, Emalayan
Vairavanathan, S. Al-Kiswany, M. Ripeanu. Submitted –
CCGRID '13.
Intermediate Storage System
6Central Storage System (e.g., GPFS, NFS)
App. task
Local storage
App. task
Local storage
App. task
Local storage
Intermediate Storage
…
POSIX API
Workflow Runtime
Engine
Stage In
Stage Out
Opportunities:
Underutilized resources
Compute Nodes
7
Evaluation - modFTDock on Blue Gene/P
20- 40% improvement
2x improvement
8
Contributions - Alleviating storage I/O bottleneck
Intermediate Storage System Designed and implemented a
prototype Integrated with workflow runtime Evaluated with applications on BG/P
The Case for Cross-Layer Optimizations in Storage: A
Workflow-Optimized Storage System. S. Al-Kiswany,
Emalayan Vairavanathan, L. B. Costa, H. Yang, M.
Ripeanu. Submitted - FAST '13.
Workflow-aware Storage System Identified new data access patterns Studied the viability of a workflow-aware
storage
A Workflow-Aware Storage System: An Opportunity
Study. Emalayan Vairavanathan, S. Al-Kiswany, L. B.
Costa, Z.Zhang, D.Katz, M.Wilde, M. Ripeanu. CCGRID
'12. Acceptance Rate : 27%.
A case for Workflow-Aware Storage: An Opportunity
Study using MosaStore. Emalayan Vairavanathan, S.
Al-Kiswany, A. Barros, L. B. Costa1 H. Yang, G. Fedak,
D.Katz, M.Wilde, M. Ripeanu. Submitted - FGCS Journal
MosaStore Storage System Experimental platform for other
studies
Predicting Intermediate Storage Performance for
Workflow Applications. L. B. Costa, A. Barros, Emalayan
Vairavanathan, S. Al-Kiswany, M. Ripeanu. Submitted –
CCGRID '13.
A Workflow-aware Storage System
Central Storage System (e.g., GPFS)
Task scheduling
POSIX APIApp. task
Local storage
App. task
Local storage
App. task
Local storage
Intermediate storage (shared)
Compute Nodes
…
Stage In/Out
Workflow Runtime
Engine
Deploy intermediate storage
Opportunities
Dedicated intermediate storage
Exposing data location
Regular data access patterns
Workflow-aware Intermediate Storage
10
Data Access Patterns in Workflow Applications
• Pipeline
• Broadcast
• Reduce
• Scatter
and Gather
Locality andlocation-aware scheduling
Replication
Collocation and location-aware scheduling
Block-level data placement
Wozniak et al PDSW’09, Katz et al BlueWater, Shibata et al. HPDC’10
Data Access Patterns in ModFTDock
Broadcast pattern
Reduce pattern
Pipelinepattern
11
ModFTDock
12
Evaluation - Baselines
MosaStore, NFS and Node-local storage
vs Workflow-aware storage
Central Storage System (e.g., GPFS, NFS)
App. task
Local storage
App. task
Local storage
App. task
Local storage
Intermediate storage (shared)
Compute Nodes
…
Stage In/Out
MosaStore
NFS
Local storage
Workflow-aware storage
13
Evaluation - Platform
• Cluster of 20 machines. Intel Xeon 4-core, 2.33-GHz CPU, 4-GB RAM, 1-Gbps NIC, and a RAID-
1 on two 300-GB 7200-rpm SATA disks
• Central storage NFS server Intel Xeon E5345 8-core, 2.33-GHz CPU, 8-GB RAM, 1-Gbps NIC, and
a 6 SATA disks in a RAID 5 configuration
NFS server is better provisioned
14
Evaluation – Benchmarks and Application
Synthetic benchmark
Application and workflow run-time engine Montage modFTDock
Workload Pipeline Broadcast Reduce
Small 100KB, 200KB, 10KB 100KB, 1KB 10KB, 100KB
Medium 100 MB, 200 MB, 1MB 100 MB, 1MB 10MB, 200 MB
Large 1GB, 2GB, 10MB 1 GB, 10 MB 100MB, 2 GB
15
Synthetic Benchmark - Pipeline
Average runtime for medium workload
Optimization: Locality and location-aware scheduling
3x improvement in workflow time
Synthetic Benchmarks - Broadcast
16
Optimization: Replication
Average runtime for medium workload on disk
60% improvement in the runtime
17
Evaluation – Montage
Montage workflow
Total application time on five different systems
10% improvement in the runtime
18
Contributions - Alleviating storage I/O bottleneck
Intermediate Storage System Designed and implemented a
prototype Integrated with workflow runtime Evaluated with applications on BG/P
The Case for Cross-Layer Optimizations in Storage: A
Workflow-Optimized Storage System. S. Al-Kiswany,
Emalayan Vairavanathan, L. B. Costa, H. Yang, M.
Ripeanu. Submitted - FAST '13.
Workflow-aware Storage System Identified new data access patterns Studied the viability of a workflow-aware
storage
A Workflow-Aware Storage System: An Opportunity
Study. Emalayan Vairavanathan, S. Al-Kiswany, L. B.
Costa, Z.Zhang, D.Katz, M.Wilde, M. Ripeanu. CCGRID
'12. Acceptance Rate : 27% (one of the top 15 papers).
A case for Workflow-Aware Storage: An Opportunity
Study using MosaStore. Emalayan Vairavanathan, S.
Al-Kiswany, A. Barros, L. B. Costa1 H. Yang, G. Fedak,
D.Katz, M.Wilde, M. Ripeanu. Submitted - FGCS Journal
MosaStore Storage System Experimental platform for other
studies
Predicting Intermediate Storage Performance for
Workflow Applications. L. B. Costa, A. Barros, Emalayan
Vairavanathan, S. Al-Kiswany, M. Ripeanu. Submitted –
CCGRID '13.
19
THANK YOU
20
BACKUP SLIDES
21
Background –Many-task workflows
Large amount of legacy code
Rapid application development
Portability (workstation – supercomputers)
Easy to debug
Implicit fault-tolerance
Expression of natural parallelism
22
Background – Motivation
Many-task applications are becoming popular
Better utilization of costly hardware, Energy saving (lot of time is spend to execute workflow applications)
Better scalability and high performance will help to solve large problems more accurately
Large number of available workflow applications
23
Blue Gene/P Architecture
40960 compute nodes (160K cores)
10 Gbps Switch
Complex
10 Gbps Switch
Complex
GPFS: deployed on 128 file server nodes (3 Petabytes
storage capacity)
640 IO NodesTorus N
etwork
6.4 Gbpsper link.
Tree network(850 MBps x 640)
10 Gb/s x 128
24
Example Workflow Software Stack
Shared Storage System
Swift script
Intermediate Code
Task dispatching service (e.g. Coasters)
Worker Worker Worker Worker
Worker Worker Worker Worker
Workflow runtime engine (e.g. Swift)
Tasks / Notifications
Tasks / Notifications
Performs Storage IO
Swift Compiler
25
Intermediate Storage System
MosaStore
• File is divided into fixed size chunks.
• Chunks: stored on the storage nodes.
• Manager maintains a block-map for each file
• POSIX interface for accessing the system
MosaStore distributed storage architecture
26
Contribution - Intermediate Storage System
Support a set of POSIX APIs (random read and write, delete, close)
Garbage-collection
Replication (eager and lazy)
Client side caching
MosaStore Storage System
27
Viability study – Changes in MosaStore
• Optimized data placement for the pipeline pattern
Priority to local writes and reads
• Optimized data placement for the reduce pattern
Collocating files in a single benefactor
• Replication mechanism optimized for the broadcast pattern
Parallel replication
• Data block placement for the scatter and gather patterns
28
Evaluation - Synthetic Benchmark on Blue Gene/P
100% performance gain in the application runtime
Pipeline benchmark Runtime at different scale
29
Synthetic Benchmarks - Reduce
Optimization: Collocation and location-aware scheduling
Average runtime for medium workload
2x improvement in the runtime
30
Synthetic benchmarks – Small workload
Reduce benchmark Broadcast benchmark
31
Evaluation – ModFTDock
ModFTDock workflowTotal application time on three
different systems
20% improvement in the runtime
32
Evaluation – Montage per stage time
Total application time five different systems