Multithreaded Parallelization for ns-3

Multithreaded Parallelizationfor ns-3

Mathieu Lacage, Guillaume [email protected]@ens.fr

INRIA,ENS

DREAMTECH, April 1st 2010

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 1 / 36

Outline

Discrete time event-driven simulations

A parallel simulator

Network simulations

ns-3 Multithreaded simulations

Thread-safe reference counting

Efficient barrier implementations

Conclusion


Outline



Network simulations




Conclusion


A simple simulatorThe APItypedef void (*EventFunction) (void *);void CreateEvent (double delay, EventFunction function, void *context);double Now (void);void Run (void);

The private data structuresstruct Event{double expiration;EventFunction function;void *context;

};static double g_now;static std::list<Event> g_events;


The implementation I

Runvoid Run (void){while (!list.empty ())

{Event event = list.front ();list.pop_front ();g_now = event.expiration;event.function (event.context);

}}


The implementation II

Nowdouble Now (void){return g_now;

}


The implementation IIICreateEventvoid CreateEvent (double delay, EventFunction function, void *context);{Event event;event.expiration = g_now + delay;event.function = function;event.context = context;std::list<Event>::iterator cur = g_list.begin ();while (cur != g_list.end () && event.expiration < cur->expiration)

{i++;

}g_list.insert (cur, event);

}


Outline



Network simulations




Conclusion


Logical processes

LP definition:• Independent pieces of code run separately• Communicate through messages

Example:• Plane takes off from Paris at 10h00• Plane lands in Nice at 11h15

• At 10h00, Paris sends/schedules Landing message to Nice for11h15.


Logical processes

LP definition:• Independent pieces of code run separately• Communicate through messages

Example:• Plane takes off from Paris at 10h00• Plane lands in Nice at 11h15• At 10h00, Paris sends/schedules Landing message to Nice for

11h15.


Optimistic algorithmsEach LP:

• Process all local events• If receives message in past, rollback

Example:• Plane takes off from Paris at 10h00• Plane takes off from Lyon at 10h30• Plane from Paris lands in Nice at 11h15• Plane from Lyon lands in Nice at 11h00

• At 10h00, Paris sends/schedules Landing message to Nice for11h15.

• At 10h30, Lyon sends/schedules Landing message to Nice for11h00.

• Nice executes Landing event for Paris -> Nice flight• Nice receives message from Lyon later.• Nice rollbacks back Landing event for Paris -> Nice• Nice executes Landing event for Lyon -> Nice flight




Example:• Plane takes off from Paris at 10h00• Plane takes off from Lyon at 10h30• Plane from Paris lands in Nice at 11h15• Plane from Lyon lands in Nice at 11h00• At 10h00, Paris sends/schedules Landing message to Nice for

11h15.• At 10h30, Lyon sends/schedules Landing message to Nice for

11h00.

• Nice executes Landing event for Paris -> Nice flight• Nice receives message from Lyon later.• Nice rollbacks back Landing event for Paris -> Nice• Nice executes Landing event for Lyon -> Nice flight




Example:• Plane takes off from Paris at 10h00• Plane takes off from Lyon at 10h30• Plane from Paris lands in Nice at 11h15• Plane from Lyon lands in Nice at 11h00• At 10h00, Paris sends/schedules Landing message to Nice for

11h15.• At 10h30, Lyon sends/schedules Landing message to Nice for

11h00.• Nice executes Landing event for Paris -> Nice flight• Nice receives message from Lyon later.• Nice rollbacks back Landing event for Paris -> Nice• Nice executes Landing event for Lyon -> Nice flight


Problems

Implementation complexity of rollback mechanism:• anti-event code written by hand• anti-event code generated from modified compiler• event data must be kept in memory until rollback can’t happen

In practice: it’s never implemented


Problems

Implementation complexity of rollback mechanism:• anti-event code written by hand• anti-event code generated from modified compiler• event data must be kept in memory until rollback can’t happen

In practice: it’s never implemented


Conservative algorithms

Enforce local causality constraint in each LP:• Make sure we don’t receive messages in the past• Requires global synchronization among LPs

Use of lookahead:• LP A can send messages to LP B• Minimum flight duration from A to B is 1h30• Current simulation time in A and B is timeA and timeB

• B can execute all events such that: event .time < timeA + 1h30

• 1h30 is the LookaheadAB

• Big lookahead == potential for lots of parallelization• Small lookahead == little potential for parallelization


Conservative algorithms

Enforce local causality constraint in each LP:• Make sure we don’t receive messages in the past• Requires global synchronization among LPs

Use of lookahead:• LP A can send messages to LP B• Minimum flight duration from A to B is 1h30• Current simulation time in A and B is timeA and timeB

• B can execute all events such that: event .time < timeA + 1h30• 1h30 is the LookaheadAB

• Big lookahead == potential for lots of parallelization• Small lookahead == little potential for parallelization


A simple algorithm

In each LP:while (events left){

Tmin = min (Ti + Li) for all ibarrierS = set of events in Pi with timestamp <= Tminprocess events in Sbarrier

}


Outline



Network simulations




Conclusion


Major characteristicsTarget metrics:

• Packet loss patterns• Packet bit error patterns• Packet throughput patterns• Packet delay patterns• Packet jitter patterns

Relevant model parameters:• Link delay due to limited progation speed• Link delay due to limited bandwidth• Per-packet processing delay due to limited cpu• Per-packet memory usage• Packet queue management policy


A typical network model (wired)

Node 0

Node 1

Node 2

Node 5

Node 3

Node 4


Parallel network simulations

• One logical process: one node• Lookahead: minimum propagation delay on link

The traditional implementation:• message communications through MPI• LP/physical machine partition static (no code/data migration):

• bad performance because the most efficient partition depends oncommunication patterns and they change over simulation time

• bad performance because no one knows how to generate adecent partition

• users have no idea how to generate a partition just to get started

• The goal is to scale to large number of LPs (distributedmemory), not to run fast.


Outline



Network simulations




Conclusion


Requirements

• We don’t care about scaling to large number of LPs• Multicore shared memory systems: at least 8 cores, target 24,

potentially more.• Should go faster: ideally, linear scaling with number of cores• Transparent for the user:

• no partition specification• no changes to the model development API

• No fundamental changes to ns-3


Multithreaded parallelsimulation

• Degenerate partition map: one node/LP <-> one partition• Classic conservative synchronization algorithm with lookahead• One worker thread per physical cpu:

• take partition from worklist• execute partition until conservative boundary

Pros:• No partition specification• Automatic load balancing across CPUS• Measurable speedup on nasty testcases with little parallelization

opportunities


What was really hard

• Need to make ns-3 minimally thread-safe:• schedule events from a thread to another thread• deep copy packets crossing threads (packets are normally COW)• object reference counting

• Hard to do load-balancing:• Execution time of each event is non-uniform• Number of events per partition per iteration varies a lot, fairly

small (10 to 40)

• Granularity of iteration fairly small: barrier synchronizationoverhead high


Conclusion

Running faster than non-parallel but:• hard to get realistic scenario benchmarks• scalability with number of cores is not linear• thread-safe reference counting still costly• load balancing heuristics appear very benchmark-specific• How are we going to scale to 64 cores with shared memory ?


Outline



Network simulations




Conclusion


Reference counting

Refm_count++;

Unrefm_count--;if (m_count == 0)

{delete this;

}


Mutexes

Refpthread_mutex_lock (&m_lock);m_count++;pthread_mutex_unlock (&m_lock);

Unrefpthread_mutex_lock (&m_lock);m_count--;if (m_count == 0)

{delete this;

}pthread_mutex_unlock (&m_lock);


Atomic ops

Refatomic_add (&m_count, 1);

Unrefbool was_one = atomic_dec_and_test (&m_count, 1, 1);if (was_one)

{delete this;

}


Thread Local Storage hash

Refstruct Entry *entry = HashEntry ();bool found = likely (entry->ptr == this);entry->count++;if (!found)

SlowRef ();

Unrefstruct Entry *entry = HashEntry ();bool found = likely (entry->ptr == this);entry->count--;if (!found)

SlowUnref ();


Performance comparison

0.1

1

10

100

1000

1 2 3 4 5 6

Dur

atio

n (s

econ

ds)

Number of threads

MutexAtomic

TLS hash


Outline



Network simulations




Conclusion


POSIX

Waitpthread_barrier_wait (&barrier)


Global spin

Waitwaiter->sense = -waiter->sense;int waiters = AtomicExchangeAndAdd (&m_waiters, 1);if (waiters + 1 == m_totalWaiters)

{AtomicExchangeAndAdd (&m_waiters, -m_totalWaiters);m_release = waiter->sense;

}else

{while (AtomicGet (&m_release) != waiter->sense)

{}}


Tree spin

Count Count

Count

Thread0 Thread0 Thread0Thread0


Performance comparison

10

100

1000

10000

100000

1e+06

1 2 3 4 5 6 7

Dur

atio

n (s

econ

ds)

Number of threads

POSIXGlobal

Tree


Outline



Network simulations




Conclusion


Lots of open questions

• Reference counting is harder than we thought• Efficient lockless data structure implementations are harder than

we thought• Profiling is hard : oprofile, hardware performance counters,

cache misses• Why is atomic_inc/dec_and_test so slower than ++/– with only 1

thread ?• Impact of core interconnect on performance results ?


References

• Parallel and Distributed Simulation Systems, R. Fujimoto.• Good news for parallel wireless network simulations, P.

Peschlow: same approach, use of rxTxTurnaround timemodeling to speedup wireless simulations

• Multi-core parallelism for ns-3 simulator: internship by GuillaumeSeguin (http://guillaume.segu.in/papers/ns3-multithreading.pdf)


Documents

Multithreaded Parallelization for ns-3