View
216
Download
0
Embed Size (px)
Citation preview
A Scalable FPGA-based Multiprocessor
for Molecular Dynamics SimulationArun Patel1, Christopher A. Madill2,3, Manuel Saldaña1,
Christopher Comis1, Régis Pomès2,3, Paul Chow1
Presented By:Arun Patel
Connections 2006The University of Toronto ECE Graduate Symposium
Toronto, Ontario, CanadaJune 9th, 2006
1: Department of Electrical and Computer Engineering, University of Toronto2: Department of Structural Biology and Biochemistry, The Hospital for Sick Children3: Department of Biochemistry, University of Toronto
06/09/2006 Connections 2006 2
Introduction
– FPGAs can accelerate many computing tasks by up to 2 or 3 orders of magnitude
– Supercomputers and computing clusters have been designed to improve computing performance
– Our work focuses on developing a computing cluster based on a scalable network of FPGAs
– Initial design will be tailored for performing Molecular Dynamics simulations
06/09/2006 Connections 2006 3
Molecular Dynamics
– Combines empirical force calculations with Newton’s equations of motion
– Predicts the time trajectory of small atomic systems
– Computationally demanding
F
1. Calculate interatomic forces
2. Calculate the net force
3. Integrate Newtonian equations of motion
06/09/2006 Connections 2006 4
Molecular Dynamics
– Combines empirical force calculations with Newton’s equations of motion
– Predicts the time trajectory of small atomic systems
– Computationally demanding
F
1. Calculate interatomic forces
2. Calculate the net force
3. Integrate Newtonian equations of motion
06/09/2006 Connections 2006 5
Molecular Dynamics
– Combines empirical force calculations with Newton’s equations of motion
– Predicts the time trajectory of small atomic systems
– Computationally demanding
F
1. Calculate interatomic forces
2. Calculate the net force
3. Integrate Newtonian equations of motion
06/09/2006 Connections 2006 6
Molecular Dynamics
– Combines empirical force calculations with Newton’s equations of motion
– Predicts the time trajectory of small atomic systems
– Computationally demanding
F
1. Calculate interatomic forces
2. Calculate the net force
3. Integrate Newtonian equations of motion
06/09/2006 Connections 2006 7
Molecular Dynamics
BondsAll
ob llk 2)(
AnglesAll
ok 2)(
TorsionsAll
nA )]cos(1[
PairsAll rr
612
4
PairsAll r
qq 21
U =
+
+
+
+
06/09/2006 Connections 2006 8
Why Molecular Dynamics?
2. Computationally Demanding
30 CPU Years
1. Inherently Parallelizable
06/09/2006 Connections 2006 9
Motivation for Architecture
• Majority of hardware accelerators achieve ~102-103x improvement over S/W by– Pipelining a serially-executed algorithm
- or -– Performing operations in parallel
• Such techniques do not address large-scale computing applications (such as MD)– Much greater speedups are required (104-105)– Not likely with a single hardware accelerator
• Ideal solution for large-scale computing?– Scalability of modern HPC platforms– Performance of hardware acceleration
06/09/2006 Connections 2006 10
The “TMD” Machine
• An investigation of a FPGA-based architecture– Designed for applications that exhibit high compute-to-
communication ratio– Made possible by integration of microprocessors, high-speed
communication interfaces into modern FPGA packages
06/09/2006 Connections 2006 11
Inter-Task Communication
• Based on Message Passing Interface (MPI)– Popular message-passing standard for distributed applications– Implementations available for virtually every HPC platform
• TMD-MPI– Subset of MPI standard developed for TMD architecture– Software library for tasks implemented on embedded
microprocessors– Hardware Message Passing Engine (MPE) for hardware
computing tasks
06/09/2006 Connections 2006 12
MD Software Implementation
Atom Store
r→
Atom Store
r→
Force Engine
Atom Store
r→F→
Force Engine Force Engine Force Engine
Atom Store
r→F→
F→
F→
mpiCC
Interconnection Network
Design Flow
– Testing and validation
– Parallel design
– Software to hardware transition
06/09/2006 Connections 2006 13
Current Work
XC2VP100 XC2VP100
PPC-405 PPC-405
Force EngineForce Engine Force Engine
Atom StoreAtom StoreAtom Store
Force Engine
Atom Store
Atom Store+
TMD-MPI+
ppc-g++
Force Engine
C++ → HDL+
TMD-MPE+
Synthesis
• Replace software processes with hardware computing engines
06/09/2006 Connections 2006 14
Acknowledgements
SOCRN
David Chui
Christopher Comis
Sam Lee
Dr. Paul Chow
Andrew House
Daniel Nunes
Manuel Saldaña
Emanuel Ramalho
Dr. Régis Pomès
Christopher Madill
Arun Patel
Lesley Shannon
TMD Group: Past Members:
06/09/2006 Connections 2006 15
Large-Scale Computing Solutions
• Class 1 Machines– Supercomputers or clusters of workstations– ~10-105 interconnected CPUs Interconnection Network
06/09/2006 Connections 2006 16
Large-Scale Computing Solutions
• Class 1 Machines– Supercomputers or clusters of workstations– ~10-105 interconnected CPUs
• Class 2 Machines– Hybrid network of CPU and FPGA hardware– FPGA acts as external co-processor to CPU– Programming model still evolving
Interconnection Network
Interconnection Network
06/09/2006 Connections 2006 17
Large-Scale Computing Solutions
• Class 1 Machines– Supercomputers or clusters of workstations– ~10-105 interconnected CPUs
• Class 2 Machines– Hybrid network of CPU and FPGA hardware– FPGA acts as external co-processor to CPU– Programming model still evolving
• Class 3 Machines– Network of FPGA-based computing nodes– Recent area of academic and industrial focus
Interconnection Network
Interconnection Network
Interconnection Network
06/09/2006 Connections 2006 18
TMD Communication Infrastructure
• Tier 1: Intra-FPGA Communication– Point-to-Point FIFOs are used as communication channels– Asynchronous FIFOs isolate clock domains– Application-specific network topologies can be defined
06/09/2006 Connections 2006 19
TMD Communication Infrastructure
• Tier 1: Intra-FPGA Communication– Point-to-Point FIFOs are used as communication channels– Asynchronous FIFOs isolate clock domains– Application-specific network topologies can be defined
• Tier 2: Inter-FPGA Communication– Multi-gigabit serial transceivers used for inter-FPGA communication– Fully-interconnected network topology using 2N*(N-1) pairs of traces
06/09/2006 Connections 2006 20
TMD Communication Infrastructure
• Tier 1: Intra-FPGA Communication– Point-to-Point FIFOs are used as communication channels– Asynchronous FIFOs isolate clock domains– Application-specific network topologies can be defined
• Tier 2: Inter-FPGA Communication– Multi-gigabit serial transceivers used for inter-FPGA communication– Fully-interconnected network topology using 2N*(N-1) pairs of traces
• Tier 3: Inter-Cluster Communication– Commercially-available switches interconnect cluster PCBs– Built-in features for large-scale computing: fault-tolerance, scalability
06/09/2006 Connections 2006 21
TMD “Computing Tasks” (1/2)
• Computing Tasks– Applications are defined as collection of computing tasks – Tasks communicate by passing messages
• Task Implementation Flexibility– Software processes executing on embedded microprocessors– Dedicated hardware computing engines
Task
ComputingEngine
EmbeddedMicroprocessor
Processor onCPU Node
Clas
s 3
Clas
s 1
06/09/2006 Connections 2006 22
TMD “Computing Tasks” (2/2)
• Computing Task Granularity– Tasks can vary in size and complexity– Not restricted to one task per FPGA
FPGAs Tasks
A
B
C
D E F
G H I
J K L M
06/09/2006 Connections 2006 23
TMD-MPI Software Implementation
Application
Hardware
MPI Application Interface
Point-to-Point MPI Functions
Send/Receive Implementation
FSL Hardware Interface
Layer 4: MPI InterfaceAll MPI functions implemented in TMD-MPI that are available to the application.
06/09/2006 Connections 2006 24
TMD-MPI Software Implementation
Application
Hardware
MPI Application Interface
Point-to-Point MPI Functions
Send/Receive Implementation
FSL Hardware Interface
Layer 4: MPI InterfaceAll MPI functions implemented in TMD-MPI that are available to the application.
Layer 3: Collective OperationsBarrier synchronization, data gathering and message broadcasts.
06/09/2006 Connections 2006 25
TMD-MPI Software Implementation
Application
Hardware
MPI Application Interface
Point-to-Point MPI Functions
Send/Receive Implementation
FSL Hardware Interface
Layer 4: MPI InterfaceAll MPI functions implemented in TMD-MPI that are available to the application.
Layer 3: Collective OperationsBarrier synchronization, data gathering and message broadcasts.
Layer 2: Communication PrimitivesMPI_Send and MPI_Recv methods are used to transmit data between processes.
06/09/2006 Connections 2006 26
TMD-MPI Software Implementation
Application
Hardware
MPI Application Interface
Point-to-Point MPI Functions
Send/Receive Implementation
FSL Hardware Interface
Layer 4: MPI InterfaceAll MPI functions implemented in TMD-MPI that are available to the application.
Layer 3: Collective OperationsBarrier synchronization, data gathering and message broadcasts.
Layer 2: Communication PrimitivesMPI_Send and MPI_Recv methods are used to transmit data between processes.
Layer 1: Hardware InterfaceLow level methods to communicate with FSLs for both on and off-chip communication.
06/09/2006 Connections 2006 27
TMD Application Design Flow
• Step 1: Application Prototyping– Software prototype of application developed– Profiling identifies compute-intensive routines
ApplicationPrototype
06/09/2006 Connections 2006 28
TMD Application Design Flow
• Step 1: Application Prototyping– Software prototype of application developed– Profiling identifies compute-intensive routines
• Step 2: Application Refinement– Partitioning into tasks communicating using MPI– Each task emulates a computing engine– Communication patterns analyzed to determine
network topology
ApplicationPrototype
Process A Process B Process C
06/09/2006 Connections 2006 29
TMD Application Design Flow
• Step 1: Application Prototyping– Software prototype of application developed– Profiling identifies compute-intensive routines
• Step 2: Application Refinement– Partitioning into tasks communicating using MPI– Each task emulates a computing engine– Communication patterns analyzed to determine
network topology
• Step 3: TMD Prototyping– Tasks are ported to soft-processors on TMD– Software refined to utilize TMD-MPI library – On-chip communication network verified
ApplicationPrototype
Process A Process B Process C
A B C
06/09/2006 Connections 2006 30
TMD Application Design Flow
• Step 1: Application Prototyping– Software prototype of application developed– Profiling identifies compute-intensive routines
• Step 2: Application Refinement– Partitioning into tasks communicating using MPI– Each task emulates a computing engine– Communication patterns analyzed to determine
network topology
• Step 3: TMD Prototyping– Tasks are ported to soft-processors on TMD– Software refined to utilize TMD-MPI library – On-chip communication network verified
• Step 4: TMD Optimization– Intensive tasks replaced with hardware engines– MPE handles communication for hardware
engines
ApplicationPrototype
Process A Process B Process C
A B C
B
06/09/2006 Connections 2006 31
Future Work – Phase 2
TMD Version 2 Prototype
06/09/2006 Connections 2006 32
Future Work – Phase 3
The final TMD architecture will contain a hierarchical network of FPGA chips