38
Accelerating Parallel Analysis of Scientific Simulation Data via Zazen Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico Sacerdoti, Ron O. Dror, and David E. Shaw D. E. Shaw Research

Accelerating Parallel Analysis of Scientific Simulation Data via … · 2019. 2. 25. · Ron O. Dror, and David E. Shaw. D. E. Shaw Research. 2. Motivation. ¾Goal: To model biological

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

    Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico Sacerdoti, Ron O. Dror, and David E. Shaw

    D. E. Shaw Research

  • 2

    MotivationGoal: To model biological processes that occur on the millisecond time scale

    Approach: A specialized, massively parallel super-computer called Anton (2009 ACM Gordon Bell Award for Special Achievement)

  • 3

    Millisecond-scale MD Trajectories

    Simulation length: 1x 10-3 s

    ÷ Output interval: 10 x 10-12 s

    Number of frames: 100 M frames

    A biomolecular system: 25 K atoms

    × Position and velocity: 24 bytes/atom

    Frame size: 0.6 MB/frame

  • 4

    Part I: How We Analyze Simulation Data in Parallel

  • 5

    An MD Trajectory Analysis Example: Ion Permeation

  • 6

    -2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

    Ion A Ion B

    A Hypothetic Trajectory20,000 atoms in total; two ions of interest

  • 7

    Ion State Transition

    Abovechannel

    Insidechannel

    Into channel

    from above

    Belowchannel

    Into channel

    from below

    S

    S

    S

  • 8

    Typical Sequential Analysis

    Maintain a main-memory resident data structure to record states and positions

    Process frames in ascending simulated physical time order

    Strong inter-frame data dependence: Data analysis tightly coupled with data acquisition

  • 9

    Problems with Sequential AnalysisMillisecond-scale trajectory size : 60 TB

    Local disk read bandwidth : 100 MB / s

    Time to fetch data to memory : 1 week

    Analysis time : Varied

    Time to perform data analysis : Weeks

    Sequential analysis lack the computational, memory, and I/O

    capabilities!

  • 10

    A Parallel Data Analysis ModelSpecify which frames to be accessed

    Trajectory definition

    Stage1: Per-frame data acquisition

    Stage 2: Cross-frame data analysis

    Decouple data acquisition from data analysis

  • 11

    Trajectory Definition

    -2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

    Ion A Ion B

    Every other frame in the trajectory

  • 12

    -2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

    Ion A Ion B

    Per-frame Data Acquisition (stage 1)P0 P1

  • 13

    Cross-frame Data Analysis (stage 2)

    -2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

    Ion A Ion B

    Analyze ion A on P0 and ion B on P1 in parallel

  • 14

    Inspiration: Google’s MapReduce

    reduce(K1, ...) reduce(K2, ...)

    Google File System

    Input files Input files Input files

    map(...)

    K1: {v1j}K2: {v2j}

    K1: {v1j, v1i, v1k}

    map(...)

    K1: {v1i}K2: {v2i}

    map(...)

    K1: {v1k}K2: {v2k}

    K2: {v2k, v2j, v2i}

    Output file Output file

  • 15

    Trajectory Analysis Cast Into MapReduce

    Per-frame data acquisition (stage 1): map()

    Cross-frame data analysis (stage 2): reduce()

    Key-value pairs: connecting stage1 and stage2

    Keys: categorical identifiers or names

    Values: including timestamps

    Examples: (ion_idj, (tk, xik, yjk, zjk))

    Key Value

  • 16

    The HiMach LibraryA MapReduce-style API that allows users to write Python programs to analyze MD trajectories

    A parallel runtime that executes HiMach user programs in parallel on a Linux cluster automatically

    Performance results on a Linux cluster:

    2 orders of magnitude faster on 512 cores than on a single core

  • 17

    Typical Simulation–Analysis Storage Infrastructure

    Parallel supercomputer

    File servers

    Analysis node

    Analysis cluster

    I/O node

    Analysis node

    Local disks

    Local disks

    Parallel analysis programs

  • 18

    Part II: How We Overcome the I/O Bottleneck in Parallel Analysis

  • 19

    Trajectory CharacteristicsA large number of small frames

    Write once, read many

    Distinguishable by unique integer sequence numbers

    Amenable to out-of-order parallel access in the map phase

  • 20

    Our Main IdeaAt simulation time, actively cache

    frames in the local disks of the analysis nodes as the frames become available

    At analysis time, fetch data from local disk caches in parallel

  • 21

    LimitationsRequire large aggregate disk

    capacity on the analysis cluster

    Assume relatively low average simulation data output rate

  • 22

    An ExampleAnalysis node 0

    /

    sim0

    /

    bodhi

    Analysis node 1

    /

    bodhi

    sim0 sim1sim1

    f0 f202

    seq

    sim1

    f0 f2f1 f3

    sim0

    f11

    seq f3

    3

    0 1 0 1 Local bitmap1 0 1 0 Remote bitmap

    NFS server

    1 0 1 00 1 0 1

    1 1 1 1 1 1 1 1 Merged bitmap

    Local bitmapRemote bitmapMerged bitmap

  • 23

    The Zazen Protocol

    How to guarantee that each frame is read by one and only one node in the face of node failure and recovery?

  • 24

    The Zazen ProtocolExecute a distributed consensus protocol before performing actual disk I/O

    Assign data retrieval tasks in a location-aware manner

    Read data from local disks if the data are already cached

    Fetch missing data from file servers

    No metadata servers to keep record of who has what

  • 25

    The Zazen Protocol (cont’d)

    Bitmaps: a compact structure for recording the presence or non-presence of a cached copy

    All-to-all reduction algorithms: an efficient mechanism for inter-processor collective communications (used an MPI library in practice)

  • 26

    ImplementationThe Bodhi library

    The Bodhi server

    The Zazen protocol

    Parallel supercomputer

    Bodhiserver

    Bodhi library

    Bodhiserver

    File servers

    Analysis node Analysis nodeZazen cluster

    Parallel analysis programs (HiMach jobs)

    Zazen protocol

    Bodhi library

    Bodhi libraryI/O node

  • 27

    Performance Evaluation

  • 28

    A Linux cluster with 100 nodes

    Two Intel Xeon 2.33 GHz quad-core processors per node

    Four 500 GB 7200-RPM SATA disks organized in RAID 0 per node

    16 GB physical memory per node

    CentOS 4.6 with a Linux kernel of 2.6.26

    Nodes connected to a Gigabit Ethernet core switch

    Common accesses to NFS directories exported by anumber of enterprise storage servers

    Experiment Setup

  • 29

    Fixed-Problem-Size Scalability

    0

    2

    4

    6

    8

    10

    12

    14

    16

    1 2 4 8 16 32 64 128

    Tim

    e (s

    )

    Number of nodes

    Execution time of the Zazen protocol to assign the I/O tasks of reading 1 billion frames

  • 30

    Fixed-Cluster-Size Scalability

    1E-03

    1E-02

    1E-01

    1E+00

    1E+01

    1E+02

    1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 1E+09

    Tim

    e (s

    )

    Number of frames

    Execution time of the Zazen protocol on 100 nodes assigning different number of frames

  • 31

    Efficiency I: Achieving Better I/O BWOne Bodhi daemon

    per user process

    0

    5

    10

    15

    20

    25

    1 2 4 8

    GB

    /s

    Application read processes per node

    One Bodhi daemon per analysis node

    0

    5

    10

    15

    20

    25

    1 2 4 8

    GB

    /s

    Application read processes per node

    1-GB256-MB64-MB2-MB

  • 32

    Efficiency II: Comparison w. NFS/PFSNFS (v3) on separate enterprise storage servers • Dual quad-core 2.8-GHz Opteron processors, 16 GB memory, 48 SATA disks organized in RAID 6

    • Four 1 GigE connection to the core switch of the 100-node cluster

    PVFS2 (2.8.1) on the same 100 analysis nodes• I/O (data) server and metadata server on all nodes

    • File I/O performed via the PVFS2 Linux kernel interface

    Hadoop/HDFS (0.19.1) on the same 100 nodes• Data stored via HDFS’s C library interface, block sizes set to be equal to file sizes, three replications per file

    • Data accessed via a read-only Hadoop MapReduce Java program (with a number of best-effort optimizations)

  • 33

    Efficiency II: Outperforming NFS/PFS

    0

    5

    10

    15

    20

    25

    2 MB 64 MB 256 MB 1 GB

    GB

    /s

    File size for read

    NFS PVFS2 Hadoop/HDFS Zazen

    I/O bandwidth of reading files of different sizes

  • 34

    Efficiency II: Outperforming NFS/PFSTime to read one terabyte of data

    1E+01

    1E+02

    1E+03

    1E+04

    1E+05

    2 MB 64 MB 256 MB 1 GB

    Tim

    e (s

    )

    File size for read

    NFSPVFS2Hadoop/HDFSZazen

  • 35

    Read Perf. under Writes (1GB/s)

    0102030405060708090

    100

    2 MB 64 MB 256 MB 1 GB

    Nor

    mal

    ized

    per

    form

    ance

    File size for reads

    File size for writesNo writes 1 GB files 256 MB files 64 MB files 2 MB files

  • 36

    End-to-End Performance A HiMach analysis program called water residence on 100 nodes

    2.5 million small frame files (430 KB each)

    100

    1,000

    10,000

    1 2 4 8

    Tim

    e (s

    )

    Application processes per node

    NFSZazenMemory

  • 37

    RobustnessWorst case execution time is T(1 + δ (B/b) )

    The water-residence program re-executed with varying number of nodes powered off

    200

    400

    600

    800

    1,000

    1,200

    1,400

    1,600

    0% 10% 20% 30% 40% 50%

    Tim

    e (s

    )

    Node failure rate

    Theoretical worst case

    Actual running time

  • 38

    Summary Zazen accelerates order-independent, parallel data access by (1) actively caching simulation output, and (2) executing an efficient distributed consensus protocol.

    Simple and robust

    Scalable on a large number of nodes

    Much higher performance than NFS/PFS *Applicable to a certain class of time-

    dependent simulation datasets *

    Slide Number 1Slide Number 2Slide Number 3Slide Number 4Slide Number 5Slide Number 6Slide Number 7Slide Number 8Slide Number 9Slide Number 10Slide Number 11Slide Number 12Slide Number 13Slide Number 14Slide Number 15Slide Number 16Slide Number 17Slide Number 18Slide Number 19Slide Number 20Slide Number 21Slide Number 22Slide Number 23Slide Number 24Slide Number 25Slide Number 26Slide Number 27Slide Number 28Slide Number 29Slide Number 30Slide Number 31Slide Number 32Slide Number 33Slide Number 34Slide Number 35Slide Number 36Slide Number 37Slide Number 38