Parallel Simulations on High-Performance Clusters
C.D. PhamRESAM laboratory
Univ. Lyon 1, [email protected]
Outline
• Backgrounds– Discrete Event Simulation (DES)– Parallel DES and the synchronization
problems
• The CSAM Tool– Architecture of the simulator kernel– The communication network model
• Results– On mono-processor cluster– On multi-processor cluster
Simulation• To simulate is to reproduce the
behavior of a physical system with a model
• Practically, computers are used to numerically simulate a logical model
• Simulations are used for performance evaluation and prediction of complex systems– fluids dynamic, chemistry reactions (continous)– communication network models: routing,
congestion avoidance, mobile… (discrete)
• Simulation is more flexible than analytical methods
Discrete Event Simulation (DES)
• assumption that a system changes its state at discrete points in simulation time
a1 a2 a3 a4d1 d2 d3
S1 S3
S2
0 t 2t 3t 4t 5t 6t
time-step
DES concepts
• fundamental concepts:– system state (variables)– state transitions (events)– simulation time: totally ordered set of values
representing time in the system being modeled
• the system state can only be modified upon reception of an event
• modeling can be – event-oriented– process-oriented
Life cycle of a DES
• a DES system can be viewed as a collec-tion of simulated objects and a sequence of event computations
• each event computation contains a time stamp indicating when that event occurs in the physical system
• each event computation may:– modify state variables– schedule new events into the simulated future
• events are stored in a local event list– events are processed in time stamped order– usually, no more event = termination
A simple DES model
local event list
A B
5
link model delay = 5send processing time = 5
receive processing time = 1packet arrival
P1 at 5, P2 at 12, P3 at 22
<e4,15> B receive P1 from Ae4<e5,16> B sends ACK(P1) to Ae5
e8 <e8,23> B receive P2 from A
<e2,10> A sends P1 to B e2<e1,5> A receive packet P1 e1
<e6,17> A sends P2 to B e6
<e3,12> A receive packet P2 e3
<e9,22> A receive packet P3 e9
e7<e7,21> A receive ACK(P1)
Why it works?
• events are processed in time stamp order
• an event at time t can only generate future events with timestamp greater or equal to t (no event in the past)
• generated events are put and sorted in the event list, according to their timestamp
– the event with the smallest timestamp is always processed first,
– causality constraints are implicitly maintained.
Why change? It ’s so simple!
• models becomes larger and larger• the simulation time is overwhelming
or the simulation is just untractable• example:
– parallel programs with millions of lines of codes,– mobile networks with millions of mobile hosts,– ATM networks with hundreds of complex
switches,– multicast model with thousands of sources,– ever-growing Internet,– and much more...
Some figures to convince...
• ATM network models– Simulation at the cell-level,– 200 switches– 1000 traffic sources, 50Mbits/s– 155Mbits/s links,– 1 simulation event per cell arrival.
– simulation time increases as link speed increases,– usually more than 1 event per cell arrival,– how scalable is traditional simulation?
More than 26 billions events to simulate 1 second!30 hours if 1 event is processed in 1us
Parallel simulation - principles
• execution of a discrete event simulation on a parallel or distributed system with several physical processors.
• the simulation model is decomposed into several sub-models that can be executed in parallel– spacial partitioning,– temporel partitioning,
• radically different from simple simulation replications.
Parallel simulation - pros & cons
• pros– reduction of the simulation time,– increase of the model size,
• cons– causality constraints are difficult to maintain,– need of special mechanisms to synchronize
the different processors,– increase both the model and the simulation
kernel complexity.
• challenges– ease of use, transparency.
A simple PDES model
local event list
A B
5
link model delay = 5send processing time = 5
receive processing time = 1packet arrival
P1 at 5, P2 at 12, P3 at 22
<e5,16> B sends ACK(P1)e5
<e2,10> A sends P1 to B e2
e6<e6,17> A sends P2 to B
<e1,5> A rec. packet P1 e1
<e3,12> A rec. packet P2 e3<e4,15> B rec. P1 from Ae4
<e8,23> B rec. P2 from Ae8e7<e3,21> A rec. ACK(P1)
t
e9<e9,22> A rec. packet P3
causality error, violation
Synchronization problems
• fundamental concepts– each Logical Process (LP) can be at a
different simulation time– local causality constraints: events in each LP
must be executed in time stamp order
• synchronization algorithms– Conservative: avoids local causality
violations by waiting until it ’s safe– Optimistic: allows local causality violations
but provisions are done to recover from them at runtime
CSAM (Pham, UCBL)
• CSAM: Conservative Simulator for ATM network Model
• Simulation at the cell-level• Conservative and/or sequential• C++ programming-style, predefined
generic model of sources, switches, links…
• New models can be easily created by deriving from base classes
• Configuration file that describes the topology
CSAM - Kernel characteristics
• Exploits the lookahead of communication links: transparent for the user
• Virtual Input Channels– reduces overhead for event manipulation,– reduces overhead for null-messages handling.
• Cyclic event execution• Message aggregation
– static aggregation size,– asymmetric aggregation size on CLUMPS,– sender-initiated,– receiver-initiated.
CSAM - Life cycleMPI buffers
31 2 1 t6t7t8t9t103
13 2
3
1
2
t3
t4
t5
Future Event List
t3 t4t5 last[i]safetime = min(last[i])
t2
MPI buffers
3
1
2
1 t6
t7 t8
t9
t103
13 2
3
1
2
t3
t4
t5
t7 t9t8 last[i]safetime = min(last[i])t3 t7
t2
end
end
t2+L
(a) end of cycle, send a null-message (b) get new messages, begin new cycle
Future Event List
t23
Test case: 78-switch ATM network
Distance-Vector Routing with dynamic link cost functionsConnection setup, admission control protocols
Why is it difficult?
• Very small granularity: 1 message represents 1 cell tranfer– high level of message synchronisation– very small computation/communication ratio
• Load imbalance between links– large number of control messages– partitioning and load balancing are difficult
Parallel Simulation on High Performance Clusters
• Myrinet-based cluster of 12 Pentium Pro at 200MHz, 64 MBytes, Linux
• Myrinet-based cluster of 4 dual Pentium Pro 450MHz, 128 Mbytes, Linux
• Myrinet board with LANai 4.1, 256KB
• BIP, BIP-SMP, MPI/BIP, MPI/BIP-SMP communication libraries
Speedup on a myrinet clusterPentium Pro 200MHz
More than 53 millions events to simulate 0.31s
0
1
2
3
4
5
6
7
2 4 6 8 10
number of processors
sp
ee
du
p
Speedup with CLUMPS
0
0.5
1
1.5
2
2.5
no aggr 156 256 512 1024 256-156
512-156
1024-156
spe
ed
up
2 ext. 2 int. 4 ext. 2x2 int.
Dual Pentium Pro 450MHz
Increasing the model size (CLUMPS)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
no aggr 156 256 512 1024 256-156
512-156
1024-156
sp
ee
du
p
78 switches 156 switches
Dual Pentium Pro 450MHz, 4x2 int