University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,

University of Illinois at Urbana-Champaign

Memory Architectures for Protein Folding:

MD on million PIM processors

Fort Lauderdale, May 03,

2L.V. Kale MD on very Large PIM machines

Overview

EIA-0081307: “ITR: Intelligent Memory Architectures and Algorithms to Crack the Protein Folding Problem”

PIs:

– Josep Torrellas and Laxmikant Kale (University of Illinois)

– Mark Tuckerman (New York University)

– Michael Klein (University of Pennsylvania)

– Also associated: Glenn Martyna (IBM)

Period: 8/00 - 7/03


Project Description

Multidisciplinary project in computer architecture and software, and computational biology

Goals:

– Design improved algorithms to help solve the protein folding problem

– Design the architecture and software of general-purpose parallel machines that speed-up the solution of the problem


Some Recent Progress: Ideas

Developed REPSWA – (Reference Potential Spatial Warping Algorithm) – Novel algorithm for accelerating conformational sampling in

molecular dynamics, a key element in protein folding

– Based on ``spatial warping'' variable transformation. This transformation is designed to shrink barrier regions on the energy

landscape and grow attractive basins without altering the equilibrium properties of the system

– Result: large gains in sampling efficiency

– Using novel variable transformations to enhance conformational sampling in molecular dynamics Z. Zhu, M. E. Tuckerman, S. O. Samuelson and G. J. Martyna, Phys. Rev. Lett. 88, 100201 (2002).


Some Recent Progress: Tools

Developed LeanMD, a molecular dynamics parallel program that targets at very large scale parallel machines– Research-quality program based on the Charm++ parallel object oriented

language

– Descendant from NAMD (another parallel molecular dynamics application) that achieved unprecedented speedup on thousands of processors

– LeanMD to be able to run on next generation parallel machines with ten thousands or even millions of processors such as Blue Gene/L or Blue Gene/C

– Requires a new parallelization strategy that can break up the simulation problem in a more fine grained manner to generate parallelism enough to effectively distribute work across a million processors.


Some Recent Progress: Tools

Developed a high-performance communication library

– For collective communication operations AlltoAll personalized communication, AlltoAll multicast, and AllReduce

These operations can be complex and time consuming in large parallel machines

Especially costly for applications that involve all-to-all patterns– such as 3-D FFT and sorting

– Library optimizes collective communication operations by performing message combining via imposing a virtual topology

– The overhead of AlltoAll communication for 76-byte message exchanges between 2058 processors is in the low tens of milliseconds


Some Recent Progress: People

The following graduate student researchers have been supported:

– Sameer Kumar (University of Illinois)

– Gengbin Zheng (University of Illinois)

– Jun Nakano (University of Illinois)

– Zhongwei Zhu (New York University)


Overview

Rest of the talk:

– Objective: Develop a Molecular Dynamics program that will run effectively on a million processors Each with low memory to processor ratio

– Method: Use parallel objects methodology

Develop an emulator/simulator that allows one to run full-fledged programs on simulated architecture

– Presenting Today: Simulator details

LeanMD Simulation on BG/L and BG/C


Performance Prediction on Large Machines

Problem:

– How to predict performance of applications on future machines?

– How to do performance tuning without continuous access to a large machine?

Solution:

– Leverage virtualization

– Develop a machine emulator

– Simulator: accurate time modeling

– Run a program on “100,000 processors” using only hundreds of processors


Blue Gene Emulator: functional view

Affinity message queues

Communication threads

Worker threads

inBuff

Non-affinity message queues

CorrectionQ

Converse scheduler

Converse Q

Communication threads

Worker threads

inBuff

Non-affinity message queues

CorrectionQ Affinity message

queues


Emulator to Simulator Emulator:

– Study programming model

and application development

Simulator:

– performance prediction

capability

– models communication

latency based on network

model;

– Doesn’t model memory

access on chip, or network

contention

Parallel performance is hard to model– Communication subsystem

Out of order messages Communication/

computation overlap

– Event dependencies

Parallel Discrete Event Simulation– Emulation program

executes in parallel with event time stamp correction.

– Exploit inherent determinacy of application


How to simulate?

Time stamping events

– Per thread timer (sharing one physical timer)

– Time stamp messages Calculate communication latency based on network model

Parallel event simulation

– When a message is sent out, calculate the predicted arrival time for the destination bluegene-processor

– When a message is received, update current time as: currTime = max(currTime,recvTime)

– Time stamp correction


Parallel correction algorithm

Sort message execution by receive time;

Adjust time stamps when needed

Use correction message to inform the change in event startTime.

Send out correction messages following the path message was sent

The events already in the timeline may have to move.


M8

M1 M7M6M5M4M3M2

RecvTime

ExecutionTimeLine

Timestamps Correction


M8M1 M7M6M5M4M3M2

RecvTime

ExecutionTimeLine



M1 M7M6M5M4M3M2

RecvTime

ExecutionTimeLine

M8

ExecutionTimeLineM1 M7M6M5M4M3M2 M8

RecvTime

Correction Message



M1 M7M6M5M4M3M2

RecvTime

ExecutionTimeLine

Correction Message (M4)

M4

Correction Message (M4)

M4

M1 M7M4M3M2

RecvTime

ExecutionTimeLineM5 M6

Correction Message

M1 M7M6M4 M3M2

RecvTime

ExecutionTimeLineM5

Correction Message



Predicted time vs latency factor

Validation


LeanMD

LeanMD is a molecular dynamics simulation application written in Charm++

Next generation of NAMD,

– The Gordon Bell Award winner in SC2002.

Requires a new parallelization strategy

– break up the problem in a more fine-grained manner to effectively distribute work across the extreme large number of processors.


LeanMD Performance Analysis

Need readable graphs:

1 to a page is fine, but with larger fonts, thicker lines


Documents

University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,