Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Accelerating Parallel Analysis of Scientific Simulation Data via Zazen
Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico Sacerdoti, Ron O. Dror, and David E. Shaw
D. E. Shaw Research
2
MotivationGoal: To model biological processes that occur on the millisecond time scale
Approach: A specialized, massively parallel super-computer called Anton (2009 ACM Gordon Bell Award for Special Achievement)
3
Millisecond-scale MD Trajectories
Simulation length: 1x 10-3 s
÷ Output interval: 10 x 10-12 s
Number of frames: 100 M frames
A biomolecular system: 25 K atoms
× Position and velocity: 24 bytes/atom
Frame size: 0.6 MB/frame
4
Part I: How We Analyze Simulation Data in Parallel
5
An MD Trajectory Analysis Example: Ion Permeation
6
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Ion A Ion B
A Hypothetic Trajectory20,000 atoms in total; two ions of interest
7
Ion State Transition
Abovechannel
Insidechannel
Into channel
from above
Belowchannel
Into channel
from below
S
S
S
8
Typical Sequential Analysis
Maintain a main-memory resident data structure to record states and positions
Process frames in ascending simulated physical time order
Strong inter-frame data dependence: Data analysis tightly coupled with data acquisition
9
Problems with Sequential AnalysisMillisecond-scale trajectory size : 60 TB
Local disk read bandwidth : 100 MB / s
Time to fetch data to memory : 1 week
Analysis time : Varied
Time to perform data analysis : Weeks
Sequential analysis lack the computational, memory, and I/O
capabilities!
10
A Parallel Data Analysis ModelSpecify which frames to be accessed
Trajectory definition
Stage1: Per-frame data acquisition
Stage 2: Cross-frame data analysis
Decouple data acquisition from data analysis
11
Trajectory Definition
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Ion A Ion B
Every other frame in the trajectory
12
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Ion A Ion B
Per-frame Data Acquisition (stage 1)P0 P1
13
Cross-frame Data Analysis (stage 2)
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Ion A Ion B
Analyze ion A on P0 and ion B on P1 in parallel
14
Inspiration: Google’s MapReduce
reduce(K1, ...) reduce(K2, ...)
Google File System
Input files Input files Input files
map(...)
K1: {v1j}K2: {v2j}
K1: {v1j, v1i, v1k}
map(...)
K1: {v1i}K2: {v2i}
map(...)
K1: {v1k}K2: {v2k}
K2: {v2k, v2j, v2i}
Output file Output file
15
Trajectory Analysis Cast Into MapReduce
Per-frame data acquisition (stage 1): map()
Cross-frame data analysis (stage 2): reduce()
Key-value pairs: connecting stage1 and stage2
Keys: categorical identifiers or names
Values: including timestamps
Examples: (ion_idj, (tk, xik, yjk, zjk))
Key Value
16
The HiMach LibraryA MapReduce-style API that allows users to write Python programs to analyze MD trajectories
A parallel runtime that executes HiMach user programs in parallel on a Linux cluster automatically
Performance results on a Linux cluster:
2 orders of magnitude faster on 512 cores than on a single core
17
Typical Simulation–Analysis Storage Infrastructure
Parallel supercomputer
File servers
Analysis node
Analysis cluster
I/O node
Analysis node
Local disks
Local disks
Parallel analysis programs
18
Part II: How We Overcome the I/O Bottleneck in Parallel Analysis
19
Trajectory CharacteristicsA large number of small frames
Write once, read many
Distinguishable by unique integer sequence numbers
Amenable to out-of-order parallel access in the map phase
20
Our Main IdeaAt simulation time, actively cache
frames in the local disks of the analysis nodes as the frames become available
At analysis time, fetch data from local disk caches in parallel
21
LimitationsRequire large aggregate disk
capacity on the analysis cluster
Assume relatively low average simulation data output rate
22
An ExampleAnalysis node 0
/
sim0
/
bodhi
Analysis node 1
/
bodhi
sim0 sim1sim1
f0 f202
seq
sim1
f0 f2f1 f3
sim0
f11
seq f3
3
0 1 0 1 Local bitmap1 0 1 0 Remote bitmap
NFS server
1 0 1 00 1 0 1
1 1 1 1 1 1 1 1 Merged bitmap
Local bitmapRemote bitmapMerged bitmap
23
The Zazen Protocol
How to guarantee that each frame is read by one and only one node in the face of node failure and recovery?
24
The Zazen ProtocolExecute a distributed consensus protocol before performing actual disk I/O
Assign data retrieval tasks in a location-aware manner
Read data from local disks if the data are already cached
Fetch missing data from file servers
No metadata servers to keep record of who has what
25
The Zazen Protocol (cont’d)
Bitmaps: a compact structure for recording the presence or non-presence of a cached copy
All-to-all reduction algorithms: an efficient mechanism for inter-processor collective communications (used an MPI library in practice)
26
ImplementationThe Bodhi library
The Bodhi server
The Zazen protocol
Parallel supercomputer
Bodhiserver
Bodhi library
Bodhiserver
File servers
Analysis node Analysis nodeZazen cluster
Parallel analysis programs (HiMach jobs)
Zazen protocol
Bodhi library
Bodhi libraryI/O node
27
Performance Evaluation
28
A Linux cluster with 100 nodes
Two Intel Xeon 2.33 GHz quad-core processors per node
Four 500 GB 7200-RPM SATA disks organized in RAID 0 per node
16 GB physical memory per node
CentOS 4.6 with a Linux kernel of 2.6.26
Nodes connected to a Gigabit Ethernet core switch
Common accesses to NFS directories exported by anumber of enterprise storage servers
Experiment Setup
29
Fixed-Problem-Size Scalability
0
2
4
6
8
10
12
14
16
1 2 4 8 16 32 64 128
Tim
e (s
)
Number of nodes
Execution time of the Zazen protocol to assign the I/O tasks of reading 1 billion frames
30
Fixed-Cluster-Size Scalability
1E-03
1E-02
1E-01
1E+00
1E+01
1E+02
1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 1E+09
Tim
e (s
)
Number of frames
Execution time of the Zazen protocol on 100 nodes assigning different number of frames
31
Efficiency I: Achieving Better I/O BWOne Bodhi daemon
per user process
0
5
10
15
20
25
1 2 4 8
GB
/s
Application read processes per node
One Bodhi daemon per analysis node
0
5
10
15
20
25
1 2 4 8
GB
/s
Application read processes per node
1-GB256-MB64-MB2-MB
32
Efficiency II: Comparison w. NFS/PFSNFS (v3) on separate enterprise storage servers • Dual quad-core 2.8-GHz Opteron processors, 16 GB memory, 48 SATA disks organized in RAID 6
• Four 1 GigE connection to the core switch of the 100-node cluster
PVFS2 (2.8.1) on the same 100 analysis nodes• I/O (data) server and metadata server on all nodes
• File I/O performed via the PVFS2 Linux kernel interface
Hadoop/HDFS (0.19.1) on the same 100 nodes• Data stored via HDFS’s C library interface, block sizes set to be equal to file sizes, three replications per file
• Data accessed via a read-only Hadoop MapReduce Java program (with a number of best-effort optimizations)
33
Efficiency II: Outperforming NFS/PFS
0
5
10
15
20
25
2 MB 64 MB 256 MB 1 GB
GB
/s
File size for read
NFS PVFS2 Hadoop/HDFS Zazen
I/O bandwidth of reading files of different sizes
34
Efficiency II: Outperforming NFS/PFSTime to read one terabyte of data
1E+01
1E+02
1E+03
1E+04
1E+05
2 MB 64 MB 256 MB 1 GB
Tim
e (s
)
File size for read
NFSPVFS2Hadoop/HDFSZazen
35
Read Perf. under Writes (1GB/s)
0102030405060708090
100
2 MB 64 MB 256 MB 1 GB
Nor
mal
ized
per
form
ance
File size for reads
File size for writesNo writes 1 GB files 256 MB files 64 MB files 2 MB files
36
End-to-End Performance A HiMach analysis program called water residence on 100 nodes
2.5 million small frame files (430 KB each)
100
1,000
10,000
1 2 4 8
Tim
e (s
)
Application processes per node
NFSZazenMemory
37
RobustnessWorst case execution time is T(1 + δ (B/b) )
The water-residence program re-executed with varying number of nodes powered off
200
400
600
800
1,000
1,200
1,400
1,600
0% 10% 20% 30% 40% 50%
Tim
e (s
)
Node failure rate
Theoretical worst case
Actual running time
38
Summary Zazen accelerates order-independent, parallel data access by (1) actively caching simulation output, and (2) executing an efficient distributed consensus protocol.
Simple and robust
Scalable on a large number of nodes
Much higher performance than NFS/PFS *Applicable to a certain class of time-
dependent simulation datasets *
Slide Number 1Slide Number 2Slide Number 3Slide Number 4Slide Number 5Slide Number 6Slide Number 7Slide Number 8Slide Number 9Slide Number 10Slide Number 11Slide Number 12Slide Number 13Slide Number 14Slide Number 15Slide Number 16Slide Number 17Slide Number 18Slide Number 19Slide Number 20Slide Number 21Slide Number 22Slide Number 23Slide Number 24Slide Number 25Slide Number 26Slide Number 27Slide Number 28Slide Number 29Slide Number 30Slide Number 31Slide Number 32Slide Number 33Slide Number 34Slide Number 35Slide Number 36Slide Number 37Slide Number 38