Creang(the(NextGeneraon(of(Stable,(Interoperable (and ...hpc.pnl.gov/modsim/2014/Presentations/Wilke 3B.pdf · Sandia National Laboratories is a multi-program laboratory managed and

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP

Crea%ng the Next Genera%on of Stable, Interoperable and Performant Simulator – A Call for Open Standards

Jeremiah Wilke*, Joseph Kenny*, Robert Clay*, Simon Hammond+, Arun Rodrigues+, ScoG Hemmert+, James Ang+, Sudhakar Yalamanchilił

*Sandia CA, +Sandia ABQ, łGeorgia Tech

We need models of varying fidelity How many tools do we need for all of them? §  High fidelity

§  What you care about only at exists at cycle-‐accurate detail §  Cache reuse policies in memory §  Flit-‐level flow control

§  Valida%on of lower fidelity models

§  Medium fidelity §  Coarse-‐grained modeling of architecture at

system scale §  Adap%ve rou%ng without flit detail §  Scaling of collec%ves with network conges%on

§  Valida%on of cons%tu%ve models at scale

§  Cons%tu%ve models §  Poten%ally good accuracy with right fiXng §  Rapid parameter space explora%on

2

We need models of varying fidelity How many tools do we need for all of them?

3

SST Structural Simulation Toolkit

We need models of varying fidelity How many tools do we need for all of them?

4

SST Sonic Screwdriver Toolchain

How do we defeat the Daleks?

§  Common Core §  Scale up performance = scale up

performance simula%on = parallel simula%on (PDES)

§  Composability §  Define standards for composing models

that speak same language and share the same no%on of %me

§  Community §  Concerted effort to define standards and

reuse code

5

How do we defeat the Daleks?

§  Common Core §  Scale up performance = scale up

performance simula%on = parallel simula%on (PDES)

§  Composability §  Define standards for composing models

that speak same language and share the same no%on of %me

§  Community §  Concerted effort to define standards and

reuse code

6

Virtual()me(

Time(window(

LP(0(

LP(1(

Scheduling(future(events(

What is the major source of suffering in parallel discrete event simula%on (PDES)?

§  Scale up machines = scale up simula%ons of machines = PDES §  Need event management and scheduling

§  Avoid %me-‐order viola%ons!

7

Dependencies between events

What is the major source of suffering in parallel discrete event simula%on?

8

Virtual()me(

Time(window(

LP(0(

LP(1(


Wall()me(=(2s(

Violation!!!

What is the major source of suffering in parallel discrete event simula%on?

9

Virtual()me(

Time(window(

LP(0(

LP(1(


Wall()me(=(2s(

True for all models and fideli0es!

Virtual()me(

Time(window(

LP(0(

LP(1(


What is the major source of suffering in parallel discrete event simula%on? §  Parallelism possible mainly from lookahead (safe %me window)

based on virtual latency between components

10

Solu%ons to the problem exist, but are non-‐trivial: Why rewrite over and over? §  “Naïve” conserva%ve %me-‐stepping algorithm

§  Global, collec%ve communica%on §  Communica%on op%miza%ons to lower prefactor, but has scalability limits

11

LP#2#Queue# Event#B#

t=17#Lookahead##δ=10#

tmin=17#tmax=27#

Event#C#t=10#

Event#D#T=21#

Local#queue#(LP#0)#

tlocal=7#Local# Global#

tmax=18#

LP#1#Queue# Event#A#

T=8#Tmin=8#Tmax=18#

Figure 3. Snapshot of events and timestamps for example conservative PDESwith LP queues

LP#2#Queue#

Lookahead##δ=10#

tmin=17#tmax=27#

Event#D#T=21#

Local#queue#(LP#0)#


tmax=27#

LP#1#Queue# Event#E#

T=19#Tmin=19#Tmax=29#


checkpointing, the entire application state is saved at regularintervals, providing a fallback point. In reverse computation,the entire state is not saved - only an event list is maintained.When violations occur, the events must provide a “reverse”mechanism to undo all state changes. Checkpointing inducesa significant overhead even if no time violations occur. Reversecomputation only induces overhead when violations occur. De-tailed packet-level network simulations have been performedat very large scales using reverse computation with the ROSSsimulator []. Both checkpointing and reverse computationpresent special challenges for our desired simulation mode.There is a large amount (several TB) of application state tosave when simulating 100K-1M network endpoints, causinga very high storage and time overhead. Checkpoint strate-gies may not be feasible. Reverse computation also presentssignificant challenges since we want to allow arbitrary codeto be executed on real software stacks. If the simulationproceeds through a simple and well-defined state machine,programming reverse computation can be straightforward. Apacket sent that empties a queue, e.g. can be reinserted intothe queue. It is not obvious (or likely even possible) to definereverse computation for arbitrary code. Thus, despite potentialfor increased parallelism with optimistic PDES, we pursue aconservative strategy here.

Conservative approaches generally operate through event

queues []. Each LP maintains an event queue for every LPit might receive events from (Figures 3,4). Each event queuehas an associated time stamp indicating the last known virtualtime at the other LP. As an LP receives events on a queue,the new events advance the timestamp In Figure 3, the localprocess (LP 0) has synchronized and received events with thetimestamp t = 8 and t = 17 from the remote LPs 1 and 2.Because the lookahead is � = 10, LP 0 is guaranteed that LP 1will not deliver anything new with timestamp less than 18 andLP 2 will not deliver anything with timestamp less than 27.The local process can therefore safely run Event C at t = 10,Event B at t = 17 and advance time until t = 18. The localprocess now stalls and is unable to run Event D. In the nextsynchronization, LP 0 receives Event E with time t = 19.It has not received any new messages from LP 2, but it canstill advance its time window until t = 27 and can safely runEvent E and Event D. This scheme requires an LP to regularlyreceive events on each queue. If no events are received, thequeue time is never updated and an LP will block waiting forupdates. If such a block occurs, an LP must send so-callednull messages to “request” a time update. Once a new timeupdate is received, the LP can advance its time window andrun more events.

This scheme can be highly efficient for certain workloads.Even if there are 100 LPs, each LP might only communi-cate with, e.g., 3 other neighbors. Only local point-to-pointcommunication is required to synchronize times rather than aglobal communication. When many events are sent betweenLPs (i.e null messages are not needed), it also provides somedegree of asynchrony as LPs do not explicitly block waiting oneach other. For our workloads, however, the above algorithmbecomes highly inefficient. In many network topologies andmodels, each LP is connected to every other LP, erasingthe benefit of local communication. Furthermore, our simula-tor emphasizes coarse-grained compute and network models.Large time gaps often exist between consecutive events, oftenmuch larger than the lookahead. This leads to a pathologicalsituation in which LPs are consumed sending null messagesback and forth.

while t < ttermination

doRun all events until t + �Log event sends to S = {NLP0

events

, NLP0bytes

, NLP1events

, . . .}MPI ReduceScatter with SUM array SMPI ReduceScatter returns N total

events

, N total

bytes

N left

bytes

= N total

bytes

for msgNum < N total

events

doMPI Recv(void*, MPI ANY, N left

bytes

)N left

bytes

= N left

bytes

� msgSizeend forDetermine tlocal

min

MPI Allreduce(tglobal

min

, tlocal

min

)t = tglobal

min

end while

Here we instead use a global synchronization algorithm.While in some ways the simplest, it actually represents anoptimization on the event queue framework for workloads en-countered with macroscale SST. The simulation begins at t = 0with lookahead � determined from the network parameters. Weconsider a simulation with N

lp

logical processes. Within the




tmin=17#tmax=27#

Event#C#t=10#

Event#D#T=21#

Local#queue#(LP#0)#


tmax=18#


T=8#Tmin=8#Tmax=18#


LP#2#Queue#

Lookahead##δ=10#

tmin=17#tmax=27#

Event#D#T=21#

Local#queue#(LP#0)#


tmax=27#










events

, NLP0bytes

, NLP1events


events

, N total

bytes

N left

bytes

= N total

bytes


events


bytes

)N left

bytes

= N left

bytes


min


min

, tlocal

min

)t = tglobal

min

end while


lp



induces overhead when violations occur. Detailed packet-levelnetwork simulations have been performed at very large scalesusing reverse computation with the ROSS simulator [?], [?].Both checkpointing and reverse computation present specialchallenges for our desired simulation mode. There is a largeamount (several TB) of application state to save when simulat-ing 100K-1M network endpoints, causing a very high storageand time overhead. Checkpoint strategies may not be feasible.Reverse computation also presents significant challenges sincewe want to allow arbitrary code to be executed on real softwarestacks. If the simulation proceeds through a simple and well-defined state machine, programming reverse computation canbe straightforward. A packet sent that empties a queue, e.g.can be placed back into its original position in the queue toreverse the event. It is not obvious (or likely even possible)to define reverse computation for arbitrary code. Thus, despitepotential for increased parallelism with optimistic PDES, wepursue a conservative strategy here.

Conservative approaches generally operate through eventqueues [?], [?]. Each LP maintains an event queue for everyLP it might receive events from (Figures ??,??). Each eventqueue has an associated time stamp indicating the last knownvirtual time at the other LP. As an LP receives events on aqueue, the new events advance the timestamp. In Figure ??,the local process (LP 0) has synchronized and received eventswith the timestamp t = 8 and t = 17 from the remote LPs 1

and 2. Because the lookahead is � = 10, LP 0 is guaranteedthat LP 1 will not deliver anything new with timestamp lessthan 18 and LP 2 will not deliver anything with timestamp lessthan 27. The local process can therefore safely run event C att = 10, event B at t = 17 and advance time until t = 18. Thelocal process now stalls and is unable to run event D. In thenext synchronization, LP 0 receives event E with time t = 19.It has not received any new messages from LP 2, but it canstill advance its time window until t = 27 and can safely runevent E and event D. This scheme requires an LP to regularlyreceive events on each queue. If no events are received, thequeue time is never updated and an LP will block waiting forupdates. If such a block occurs, an LP must send so-called nullmessages to “request” a time update (although methods haveavoiding null messages have been proposed [?]). Once a newtime update is received, the LP can advance its time windowand run more events.

This scheme can be highly efficient for certain workloads.Even if there are 100 LPs, each LP might only communicatewith, e.g., 3 other neighbors. Only local point-to-point commu-nication is required to synchronize times rather than a globalcommunication. When many events are sent between LPs (i.enull messages not needed), it also provides some asynchronyas LPs do not explicitly block waiting on each other.

For our workloads, however, the above algorithm becomeshighly inefficient. In many network topologies and models,each LP is connected to every other LP, erasing the benefit oflocal communication. Furthermore, our simulator emphasizescoarse-grained compute and network models. Large time gapsoften exist between consecutive events, often much larger thanthe lookahead. This leads to a pathological situation in whichLPs are consumed sending null messages back and forth.



events

, NLP0bytes

, NLP1events


events

, N total

bytes

N left

bytes

= N total

bytes


events


bytes

)N left

bytes

= N left

bytes


min


min

, tlocal

min

)t = tglobal

min

end while


lp

logical processes. Within thetime window (0, �) each LP will call MPI Isend to deliverevents to other LPs. At t = �, every LP performs two blockingMPI collectives. First, they perform a reduce-scatter with asum function, illustrated in Figure ??. Each LP tracks the totalnumber of events and the total number of bytes sent to everyother LP. This array of size 2N

lp

is passed as input to thereduce-scatter. Every LP receives as input from the reduce-

Solu%ons to the problem exist, but are non-‐trivial: Why rewrite over and over? §  Conserva%ve algorithm with event queues: Op%miza%on to limit

communica%on for LP’s that are “connected” §  Local, point-‐to-‐point communica%on §  More difficult to implement

12



tmin=17#tmax=27#

Event#C#t=10#

Event#D#T=21#

Local#queue#(LP#0)#


tmax=18#


T=8#Tmin=8#Tmax=18#


LP#2#Queue#

Lookahead##δ=10#

tmin=17#tmax=27#

Event#D#T=21#

Local#queue#(LP#0)#


tmax=27#









doRun all events until t+ �Log event sends to S = {NLP0

events

, NLP0bytes

, NLP1events


events

, N total

bytes

N left

bytes

= N total

bytes


events


bytes

)N left

bytes

= N left

bytes

�msgSizeend forDetermine tlocal

min

MPI Allreduce(tglobalmin

, tlocalmin

)t = tglobal

min

end while


lp


What makes us write ad-‐hoc code instead of leverage exis%ng code? §  Lack of standards §  Lack of documenta%on §  Huge monolithic code bases §  Compa%bility across platorms §  A million and one dependencies §  Physical models dependent upon simula%on framework

13

Experiments we care about are model-driven – simulator is a tool to get us what we really care about!

Virtual()me(

Time(window(

LP(0(

LP(1(


Design considera%ons for an op%mal high fidelity (cycle-‐level) simulator §  Many events per %me window §  No major %me gaps (generally always have events) §  Components with different link latencies and clocks §  Domain specific synchroniza%on algorithms

14

Design considera%ons for an op%mal coarse-‐grained structural simulator §  May have a few events per %me window – or might have a lot §  Can have large gaps -‐ %me windows with no events in them §  Huge number of components, but with the same link latency

15

Router/ Switch

Router/ Switch

Node

NIC

Node

NIC

Abstract machine model with congestion via buffers and queues


16

Virtual()me(

Time(window(

LP(0(

LP(1(


Compute call might take 5ms but link latencies in the network are only 100ns!

MPI calls start generating network traffic


17

Router/Switch

Compute Nodes

LP 0 LP 3

LP 1 LP 2

Design considera%ons for a simulator based on analy%cal models §  Only a few events per %me window §  Many different components, but all connected to each other

18

Compute Node

Compute Node

Compute Node

Compute Node

Compute Node

Compute Node

α = Latency β = Inverse bandwidth

N =Message size

ΔT=α + β N

Unifying elements across all fideli%es

§  Sending network messages §  Portability layer to network APIs §  Serializa%on library for event objects

§  Local/global virtual %me §  Event ordering and correctness §  Model input and sta%s%cs collec%on

§  Par%%oning components across parallel workers §  Mapping of LPs to physical nodes §  Op%mize par%%on for cheaper

communica%on

§  Managing/scheduling events §  Map/calendar/heap data structures §  Cancel events

19





communica%on


20





communica%on


21

Coarse-grained time

Cycle-level time





communica%on


22





communica%on


23

T=1,T=2, T=3

T=7 T=17, T=19

T=25 T=36

T=100

O(Log(N)) heap

O(1) calendar

Disunifying elements across fideli%es: par%%oning strategy and parallel algorithm §  Par%%oning strategy

§  Cycle-‐level has heterogeneous components – par%%oning really requires intelligent graph par%%oner

§  Coarse-‐grained has many homogenous components – par%%oning s%ll requires intelligent par%%oner, but more about minimizing graph connec%vity than best link latency

§  Parallel algorithm §  Global assump%ons on interoperability components

§  Can any memory subsystem model interact with any processor or network model? Or are event messages only “self-‐compa%ble”?

24

Polymorphic components can be tuned for different problems §  Components don’t need to be aware of par%%oning strategy §  Components don’t need to be aware of parallel algorithm

25

The Simulation framework consists of a i) Core and ii) interacting Components that form the architecture simulation model. The core provides services common across models, such as

instantiating components, providing configuration information, partitioning models for parallel

execution, coordinating a common concept of time (e.g., in parallel execution), and transparently

handling parallel and local communication of events between model components. Model

components communicate through Links which deliver Events between components. Proper temporal sequencing of component execution and event delivery is transparently handled by the

core. Thus architecture component models are easily re-used and shared across system-level

simulation models. Parallel execution is handled by partitioning the simulation model and

assigning components to physical cores in a target parallel machine.

A key principle of the design is that model components are oblivious to sequential or parallel

execution, and hence are transparently reusable in multiple simulation scenarios. For example,

when events are communicated between components over a link, the component is unaware

whether the destination component is local or remote. Delivery is handled by the core including

any serialization necessary for remote communication in the event that the destination core is

remotely located.

The simulation model as a whole is specified via a system description that specifies

components, their interconnection, and their configuration including component-level and

system-level parameters. The latter are used in transparently partitioning the model for parallel

execution. Our current efforts are focused on two main aspects of the preceding framework - i)

the core API, and ii) the component API. These efforts have been influenced by several ongoing

projects notably SST, GEM5, and Manifold. Preliminary efforts at migrating models between SST

and Manifold supports the anticipated benefits of the common API.

A Call To Arms Our position is that this API is critical for the architecture community, but requires broad support

to be successful. We ask for community input and collaboration on this ongoing design project.

Please join in this effort by signing up to our mailing list ______ and visiting our design wiki

______.

Challenge problem: mixing fideli%es

26

LP with heavy-‐weight node opera%ng on different %me scales, but look ahead determined by coarse-‐grained links!

Router/Switch

Compute Nodes


27

Router/Switch

Compute Nodes

Par%%on creates op%mal lookahead on high latency links


28

Router/Switch

Compute Nodes Good partitioning balances

number of events per LP


29

Router/Switch

Compute Nodes

Simulator core problem, not a domain-specific problem

Challenge problem: histogram of message delays in PDES run §  Tag packet going out, going in with %me

§  Add single field to exis%ng network message object §  No%on of global %me

§  Histogram object that reduces/collects data §  hist-‐>add_one(delay) §  Object must output usable data format/figure at end of simula%on §  Data must be reduced across LPs

30

Minimal routing Adaptive routing

Challenge problem: histogram of message delays in PDES run

Majority of work should already be done for us in common core!

31

Minimal routing Adaptive routing

MODSIM is Camp David, not Sinai

32

Structural simulation toolkit at Sandia (SST) is a jumping off point for discussing universal simulation standards, not C++ framework written on stone tablets!

MODSIM is Camp David, not Sinai

33

We think our PDES core is mature and lightweight enough to make your life easer – enable you to use your own code, not force you to use ours!

Range of simula%on: success stories so far

34

Coarse grain + analytic

Coarse grain Cycle Level Mixed fidelity

The Simulation framework consists of a i) Core and ii) interacting Components that form the architecture simulation model. The core provides services common across models, such as

instantiating components, providing configuration information, partitioning models for parallel

execution, coordinating a common concept of time (e.g., in parallel execution), and transparently

handling parallel and local communication of events between model components. Model

components communicate through Links which deliver Events between components. Proper temporal sequencing of component execution and event delivery is transparently handled by the

core. Thus architecture component models are easily re-used and shared across system-level

simulation models. Parallel execution is handled by partitioning the simulation model and

assigning components to physical cores in a target parallel machine.

A key principle of the design is that model components are oblivious to sequential or parallel

execution, and hence are transparently reusable in multiple simulation scenarios. For example,

when events are communicated between components over a link, the component is unaware

whether the destination component is local or remote. Delivery is handled by the core including

any serialization necessary for remote communication in the event that the destination core is

remotely located.

The simulation model as a whole is specified via a system description that specifies

components, their interconnection, and their configuration including component-level and

system-level parameters. The latter are used in transparently partitioning the model for parallel

execution. Our current efforts are focused on two main aspects of the preceding framework - i)

the core API, and ii) the component API. These efforts have been influenced by several ongoing

projects notably SST, GEM5, and Manifold. Preliminary efforts at migrating models between SST

and Manifold supports the anticipated benefits of the common API.

A Call To Arms Our position is that this API is critical for the architecture community, but requires broad support

to be successful. We ask for community input and collaboration on this ongoing design project.

Please join in this effort by signing up to our mailing list ______ and visiting our design wiki

______.

+Eiger

Logic

clk

clk cond

input

clk

Common Core

+GEM5

MODSIM Summary

§  Gaps: §  Lack of standards, lack of code reuse

§  Bigger picture and poten%al collaborators: §  Everyone? Anyone who wants to scale experiments through PDES or wants to

compose models mixing different fidelity/physics

§  What would make it easier to leverage results from other groups? §  If you find yourself wri%ng a PDES core, who you gonna call…

§  Development/adop5on of standards will be driven by collabora5on and refined through use cases

35

Documents

Creang(the(NextGeneraon(of(Stable,(Interoperable (and ...hpc.pnl.gov/modsim/2014/Presentations/Wilke 3B.pdf · Sandia National Laboratories is a multi-program laboratory managed and