Accelerating Parallel Analysis of Scientific Simulation Data via … · 2019. 2. 25. · Ron O. Dror, and David E. Shaw. D. E. Shaw Research. 2. Motivation. ¾Goal: To model biological

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico Sacerdoti, Ron O. Dror, and David E. Shaw

D. E. Shaw Research

2

MotivationGoal: To model biological processes that occur on the millisecond time scale

Approach: A specialized, massively parallel super-computer called Anton (2009 ACM Gordon Bell Award for Special Achievement)

3

Millisecond-scale MD Trajectories

Simulation length: 1x 10-3 s

÷ Output interval: 10 x 10-12 s

Number of frames: 100 M frames

A biomolecular system: 25 K atoms

× Position and velocity: 24 bytes/atom

Frame size: 0.6 MB/frame

4

Part I: How We Analyze Simulation Data in Parallel

5

An MD Trajectory Analysis Example: Ion Permeation

6

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ion A Ion B

A Hypothetic Trajectory20,000 atoms in total; two ions of interest

7

Ion State Transition

Abovechannel

Insidechannel

Into channel

from above

Belowchannel

Into channel

from below

S

S

S

8

Typical Sequential Analysis

Maintain a main-memory resident data structure to record states and positions

Process frames in ascending simulated physical time order

Strong inter-frame data dependence: Data analysis tightly coupled with data acquisition

9

Problems with Sequential AnalysisMillisecond-scale trajectory size : 60 TB

Local disk read bandwidth : 100 MB / s

Time to fetch data to memory : 1 week

Analysis time : Varied

Time to perform data analysis : Weeks

Sequential analysis lack the computational, memory, and I/O

capabilities!

10

A Parallel Data Analysis ModelSpecify which frames to be accessed

Trajectory definition

Stage1: Per-frame data acquisition

Stage 2: Cross-frame data analysis

Decouple data acquisition from data analysis

11

Trajectory Definition

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ion A Ion B

Every other frame in the trajectory

12

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ion A Ion B

Per-frame Data Acquisition (stage 1)P0 P1

13

Cross-frame Data Analysis (stage 2)

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Ion A Ion B

Analyze ion A on P0 and ion B on P1 in parallel

14

Inspiration: Google’s MapReduce

reduce(K1, ...) reduce(K2, ...)

Google File System

Input files Input files Input files

map(...)

K1: {v1j}K2: {v2j}

K1: {v1j, v1i, v1k}

map(...)

K1: {v1i}K2: {v2i}

map(...)

K1: {v1k}K2: {v2k}

K2: {v2k, v2j, v2i}

Output file Output file

15

Trajectory Analysis Cast Into MapReduce

Per-frame data acquisition (stage 1): map()

Cross-frame data analysis (stage 2): reduce()

Key-value pairs: connecting stage1 and stage2

Keys: categorical identifiers or names

Values: including timestamps

Examples: (ion_idj, (tk, xik, yjk, zjk))

Key Value

16

The HiMach LibraryA MapReduce-style API that allows users to write Python programs to analyze MD trajectories

A parallel runtime that executes HiMach user programs in parallel on a Linux cluster automatically

Performance results on a Linux cluster:

2 orders of magnitude faster on 512 cores than on a single core

17

Typical Simulation–Analysis Storage Infrastructure

Parallel supercomputer

File servers

Analysis node

Analysis cluster

I/O node

Analysis node

Local disks

Local disks

Parallel analysis programs

18

Part II: How We Overcome the I/O Bottleneck in Parallel Analysis

19

Trajectory CharacteristicsA large number of small frames

Write once, read many

Distinguishable by unique integer sequence numbers

Amenable to out-of-order parallel access in the map phase

20

Our Main IdeaAt simulation time, actively cache

frames in the local disks of the analysis nodes as the frames become available

At analysis time, fetch data from local disk caches in parallel

21

LimitationsRequire large aggregate disk

capacity on the analysis cluster

Assume relatively low average simulation data output rate

22

An ExampleAnalysis node 0

/

sim0

/

bodhi

Analysis node 1

/

bodhi

sim0 sim1sim1

f0 f202

seq

sim1

f0 f2f1 f3

sim0

f11

seq f3

3

0 1 0 1 Local bitmap1 0 1 0 Remote bitmap

NFS server

1 0 1 00 1 0 1

1 1 1 1 1 1 1 1 Merged bitmap

Local bitmapRemote bitmapMerged bitmap

23

The Zazen Protocol

How to guarantee that each frame is read by one and only one node in the face of node failure and recovery?

24

The Zazen ProtocolExecute a distributed consensus protocol before performing actual disk I/O

Assign data retrieval tasks in a location-aware manner

Read data from local disks if the data are already cached

Fetch missing data from file servers

No metadata servers to keep record of who has what

25

The Zazen Protocol (cont’d)

Bitmaps: a compact structure for recording the presence or non-presence of a cached copy

All-to-all reduction algorithms: an efficient mechanism for inter-processor collective communications (used an MPI library in practice)

26

ImplementationThe Bodhi library

The Bodhi server

The Zazen protocol

Parallel supercomputer

Bodhiserver

Bodhi library

Bodhiserver

File servers

Analysis node Analysis nodeZazen cluster

Parallel analysis programs (HiMach jobs)

Zazen protocol

Bodhi library

Bodhi libraryI/O node

27

Performance Evaluation

28

A Linux cluster with 100 nodes

Two Intel Xeon 2.33 GHz quad-core processors per node

Four 500 GB 7200-RPM SATA disks organized in RAID 0 per node

16 GB physical memory per node

CentOS 4.6 with a Linux kernel of 2.6.26

Nodes connected to a Gigabit Ethernet core switch

Common accesses to NFS directories exported by anumber of enterprise storage servers

Experiment Setup

29

Fixed-Problem-Size Scalability

0

2

4

6

8

10

12

14

16

1 2 4 8 16 32 64 128

Tim

e (s

)

Number of nodes

Execution time of the Zazen protocol to assign the I/O tasks of reading 1 billion frames

30

Fixed-Cluster-Size Scalability

1E-03

1E-02

1E-01

1E+00

1E+01

1E+02

1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 1E+09

Tim

e (s

)

Number of frames

Execution time of the Zazen protocol on 100 nodes assigning different number of frames

31

Efficiency I: Achieving Better I/O BWOne Bodhi daemon

per user process

0

5

10

15

20

25

1 2 4 8

GB

/s

Application read processes per node

One Bodhi daemon per analysis node

0

5

10

15

20

25

1 2 4 8

GB

/s

Application read processes per node

1-GB256-MB64-MB2-MB

32

Efficiency II: Comparison w. NFS/PFSNFS (v3) on separate enterprise storage servers • Dual quad-core 2.8-GHz Opteron processors, 16 GB memory, 48 SATA disks organized in RAID 6

• Four 1 GigE connection to the core switch of the 100-node cluster

PVFS2 (2.8.1) on the same 100 analysis nodes• I/O (data) server and metadata server on all nodes

• File I/O performed via the PVFS2 Linux kernel interface

Hadoop/HDFS (0.19.1) on the same 100 nodes• Data stored via HDFS’s C library interface, block sizes set to be equal to file sizes, three replications per file

• Data accessed via a read-only Hadoop MapReduce Java program (with a number of best-effort optimizations)

33

Efficiency II: Outperforming NFS/PFS

0

5

10

15

20

25

2 MB 64 MB 256 MB 1 GB

GB

/s

File size for read

NFS PVFS2 Hadoop/HDFS Zazen

I/O bandwidth of reading files of different sizes

34

Efficiency II: Outperforming NFS/PFSTime to read one terabyte of data

1E+01

1E+02

1E+03

1E+04

1E+05

2 MB 64 MB 256 MB 1 GB

Tim

e (s

)

File size for read

NFSPVFS2Hadoop/HDFSZazen

35

Read Perf. under Writes (1GB/s)

0102030405060708090

100

2 MB 64 MB 256 MB 1 GB

Nor

mal

ized

per

form

ance

File size for reads

File size for writesNo writes 1 GB files 256 MB files 64 MB files 2 MB files

36

End-to-End Performance A HiMach analysis program called water residence on 100 nodes

2.5 million small frame files (430 KB each)

100

1,000

10,000

1 2 4 8

Tim

e (s

)

Application processes per node

NFSZazenMemory

37

RobustnessWorst case execution time is T(1 + δ (B/b) )

The water-residence program re-executed with varying number of nodes powered off

200

400

600

800

1,000

1,200

1,400

1,600

0% 10% 20% 30% 40% 50%

Tim

e (s

)

Node failure rate

Theoretical worst case

Actual running time

38

Summary Zazen accelerates order-independent, parallel data access by (1) actively caching simulation output, and (2) executing an efficient distributed consensus protocol.

Simple and robust

Scalable on a large number of nodes

Much higher performance than NFS/PFS *Applicable to a certain class of time-

dependent simulation datasets *

Slide Number 1Slide Number 2Slide Number 3Slide Number 4Slide Number 5Slide Number 6Slide Number 7Slide Number 8Slide Number 9Slide Number 10Slide Number 11Slide Number 12Slide Number 13Slide Number 14Slide Number 15Slide Number 16Slide Number 17Slide Number 18Slide Number 19Slide Number 20Slide Number 21Slide Number 22Slide Number 23Slide Number 24Slide Number 25Slide Number 26Slide Number 27Slide Number 28Slide Number 29Slide Number 30Slide Number 31Slide Number 32Slide Number 33Slide Number 34Slide Number 35Slide Number 36Slide Number 37Slide Number 38

Documents

Accelerating Parallel Analysis of Scientific Simulation Data via … · 2019. 2. 25. · Ron O. Dror, and David E. Shaw. D. E. Shaw Research. 2. Motivation. ¾Goal: To model biological