41
Multithreaded Parallelization for ns-3 Mathieu Lacage, Guillaume Seguin [email protected] [email protected] INRIA,ENS DREAMTECH, April 1st 2010 Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 1 / 36

Multithreaded Parallelization for ns-3

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multithreaded Parallelization for ns-3

Multithreaded Parallelizationfor ns-3

Mathieu Lacage, Guillaume [email protected]@ens.fr

INRIA,ENS

DREAMTECH, April 1st 2010

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 1 / 36

Page 2: Multithreaded Parallelization for ns-3

Outline

Discrete time event-driven simulations

A parallel simulator

Network simulations

ns-3 Multithreaded simulations

Thread-safe reference counting

Efficient barrier implementations

Conclusion

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 2 / 36

Page 3: Multithreaded Parallelization for ns-3

Outline

Discrete time event-driven simulations

A parallel simulator

Network simulations

ns-3 Multithreaded simulations

Thread-safe reference counting

Efficient barrier implementations

Conclusion

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 3 / 36

Page 4: Multithreaded Parallelization for ns-3

A simple simulatorThe APItypedef void (*EventFunction) (void *);void CreateEvent (double delay, EventFunction function, void *context);double Now (void);void Run (void);

The private data structuresstruct Event{double expiration;EventFunction function;void *context;

};static double g_now;static std::list<Event> g_events;

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 4 / 36

Page 5: Multithreaded Parallelization for ns-3

The implementation I

Runvoid Run (void){while (!list.empty ())

{Event event = list.front ();list.pop_front ();g_now = event.expiration;event.function (event.context);

}}

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 5 / 36

Page 6: Multithreaded Parallelization for ns-3

The implementation II

Nowdouble Now (void){return g_now;

}

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 6 / 36

Page 7: Multithreaded Parallelization for ns-3

The implementation IIICreateEventvoid CreateEvent (double delay, EventFunction function, void *context);{Event event;event.expiration = g_now + delay;event.function = function;event.context = context;std::list<Event>::iterator cur = g_list.begin ();while (cur != g_list.end () && event.expiration < cur->expiration)

{i++;

}g_list.insert (cur, event);

}

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 7 / 36

Page 8: Multithreaded Parallelization for ns-3

Outline

Discrete time event-driven simulations

A parallel simulator

Network simulations

ns-3 Multithreaded simulations

Thread-safe reference counting

Efficient barrier implementations

Conclusion

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 8 / 36

Page 9: Multithreaded Parallelization for ns-3

Logical processes

LP definition:• Independent pieces of code run separately• Communicate through messages

Example:• Plane takes off from Paris at 10h00• Plane lands in Nice at 11h15

• At 10h00, Paris sends/schedules Landing message to Nice for11h15.

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 9 / 36

Page 10: Multithreaded Parallelization for ns-3

Logical processes

LP definition:• Independent pieces of code run separately• Communicate through messages

Example:• Plane takes off from Paris at 10h00• Plane lands in Nice at 11h15• At 10h00, Paris sends/schedules Landing message to Nice for

11h15.

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 9 / 36

Page 11: Multithreaded Parallelization for ns-3

Optimistic algorithmsEach LP:

• Process all local events• If receives message in past, rollback

Example:• Plane takes off from Paris at 10h00• Plane takes off from Lyon at 10h30• Plane from Paris lands in Nice at 11h15• Plane from Lyon lands in Nice at 11h00

• At 10h00, Paris sends/schedules Landing message to Nice for11h15.

• At 10h30, Lyon sends/schedules Landing message to Nice for11h00.

• Nice executes Landing event for Paris -> Nice flight• Nice receives message from Lyon later.• Nice rollbacks back Landing event for Paris -> Nice• Nice executes Landing event for Lyon -> Nice flight

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 10 / 36

Page 12: Multithreaded Parallelization for ns-3

Optimistic algorithmsEach LP:

• Process all local events• If receives message in past, rollback

Example:• Plane takes off from Paris at 10h00• Plane takes off from Lyon at 10h30• Plane from Paris lands in Nice at 11h15• Plane from Lyon lands in Nice at 11h00• At 10h00, Paris sends/schedules Landing message to Nice for

11h15.• At 10h30, Lyon sends/schedules Landing message to Nice for

11h00.

• Nice executes Landing event for Paris -> Nice flight• Nice receives message from Lyon later.• Nice rollbacks back Landing event for Paris -> Nice• Nice executes Landing event for Lyon -> Nice flight

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 10 / 36

Page 13: Multithreaded Parallelization for ns-3

Optimistic algorithmsEach LP:

• Process all local events• If receives message in past, rollback

Example:• Plane takes off from Paris at 10h00• Plane takes off from Lyon at 10h30• Plane from Paris lands in Nice at 11h15• Plane from Lyon lands in Nice at 11h00• At 10h00, Paris sends/schedules Landing message to Nice for

11h15.• At 10h30, Lyon sends/schedules Landing message to Nice for

11h00.• Nice executes Landing event for Paris -> Nice flight• Nice receives message from Lyon later.• Nice rollbacks back Landing event for Paris -> Nice• Nice executes Landing event for Lyon -> Nice flight

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 10 / 36

Page 14: Multithreaded Parallelization for ns-3

Problems

Implementation complexity of rollback mechanism:• anti-event code written by hand• anti-event code generated from modified compiler• event data must be kept in memory until rollback can’t happen

In practice: it’s never implemented

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 11 / 36

Page 15: Multithreaded Parallelization for ns-3

Problems

Implementation complexity of rollback mechanism:• anti-event code written by hand• anti-event code generated from modified compiler• event data must be kept in memory until rollback can’t happen

In practice: it’s never implemented

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 11 / 36

Page 16: Multithreaded Parallelization for ns-3

Conservative algorithms

Enforce local causality constraint in each LP:• Make sure we don’t receive messages in the past• Requires global synchronization among LPs

Use of lookahead:• LP A can send messages to LP B• Minimum flight duration from A to B is 1h30• Current simulation time in A and B is timeA and timeB

• B can execute all events such that: event .time < timeA + 1h30

• 1h30 is the LookaheadAB

• Big lookahead == potential for lots of parallelization• Small lookahead == little potential for parallelization

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 12 / 36

Page 17: Multithreaded Parallelization for ns-3

Conservative algorithms

Enforce local causality constraint in each LP:• Make sure we don’t receive messages in the past• Requires global synchronization among LPs

Use of lookahead:• LP A can send messages to LP B• Minimum flight duration from A to B is 1h30• Current simulation time in A and B is timeA and timeB

• B can execute all events such that: event .time < timeA + 1h30• 1h30 is the LookaheadAB

• Big lookahead == potential for lots of parallelization• Small lookahead == little potential for parallelization

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 12 / 36

Page 18: Multithreaded Parallelization for ns-3

A simple algorithm

In each LP:while (events left){

Tmin = min (Ti + Li) for all ibarrierS = set of events in Pi with timestamp <= Tminprocess events in Sbarrier

}

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 13 / 36

Page 19: Multithreaded Parallelization for ns-3

Outline

Discrete time event-driven simulations

A parallel simulator

Network simulations

ns-3 Multithreaded simulations

Thread-safe reference counting

Efficient barrier implementations

Conclusion

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 14 / 36

Page 20: Multithreaded Parallelization for ns-3

Major characteristicsTarget metrics:

• Packet loss patterns• Packet bit error patterns• Packet throughput patterns• Packet delay patterns• Packet jitter patterns

Relevant model parameters:• Link delay due to limited progation speed• Link delay due to limited bandwidth• Per-packet processing delay due to limited cpu• Per-packet memory usage• Packet queue management policy

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 15 / 36

Page 21: Multithreaded Parallelization for ns-3

A typical network model (wired)

Node 0

Node 1

Node 2

Node 5

Node 3

Node 4

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 16 / 36

Page 22: Multithreaded Parallelization for ns-3

Parallel network simulations

• One logical process: one node• Lookahead: minimum propagation delay on link

The traditional implementation:• message communications through MPI• LP/physical machine partition static (no code/data migration):

• bad performance because the most efficient partition depends oncommunication patterns and they change over simulation time

• bad performance because no one knows how to generate adecent partition

• users have no idea how to generate a partition just to get started

• The goal is to scale to large number of LPs (distributedmemory), not to run fast.

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 17 / 36

Page 23: Multithreaded Parallelization for ns-3

Outline

Discrete time event-driven simulations

A parallel simulator

Network simulations

ns-3 Multithreaded simulations

Thread-safe reference counting

Efficient barrier implementations

Conclusion

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 18 / 36

Page 24: Multithreaded Parallelization for ns-3

Requirements

• We don’t care about scaling to large number of LPs• Multicore shared memory systems: at least 8 cores, target 24,

potentially more.• Should go faster: ideally, linear scaling with number of cores• Transparent for the user:

• no partition specification• no changes to the model development API

• No fundamental changes to ns-3

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 19 / 36

Page 25: Multithreaded Parallelization for ns-3

Multithreaded parallelsimulation

• Degenerate partition map: one node/LP <-> one partition• Classic conservative synchronization algorithm with lookahead• One worker thread per physical cpu:

• take partition from worklist• execute partition until conservative boundary

Pros:• No partition specification• Automatic load balancing across CPUS• Measurable speedup on nasty testcases with little parallelization

opportunities

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 20 / 36

Page 26: Multithreaded Parallelization for ns-3

What was really hard

• Need to make ns-3 minimally thread-safe:• schedule events from a thread to another thread• deep copy packets crossing threads (packets are normally COW)• object reference counting

• Hard to do load-balancing:• Execution time of each event is non-uniform• Number of events per partition per iteration varies a lot, fairly

small (10 to 40)

• Granularity of iteration fairly small: barrier synchronizationoverhead high

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 21 / 36

Page 27: Multithreaded Parallelization for ns-3

Conclusion

Running faster than non-parallel but:• hard to get realistic scenario benchmarks• scalability with number of cores is not linear• thread-safe reference counting still costly• load balancing heuristics appear very benchmark-specific• How are we going to scale to 64 cores with shared memory ?

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 22 / 36

Page 28: Multithreaded Parallelization for ns-3

Outline

Discrete time event-driven simulations

A parallel simulator

Network simulations

ns-3 Multithreaded simulations

Thread-safe reference counting

Efficient barrier implementations

Conclusion

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 23 / 36

Page 29: Multithreaded Parallelization for ns-3

Reference counting

Refm_count++;

Unrefm_count--;if (m_count == 0)

{delete this;

}

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 24 / 36

Page 30: Multithreaded Parallelization for ns-3

Mutexes

Refpthread_mutex_lock (&m_lock);m_count++;pthread_mutex_unlock (&m_lock);

Unrefpthread_mutex_lock (&m_lock);m_count--;if (m_count == 0)

{delete this;

}pthread_mutex_unlock (&m_lock);

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 25 / 36

Page 31: Multithreaded Parallelization for ns-3

Atomic ops

Refatomic_add (&m_count, 1);

Unrefbool was_one = atomic_dec_and_test (&m_count, 1, 1);if (was_one)

{delete this;

}

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 26 / 36

Page 32: Multithreaded Parallelization for ns-3

Thread Local Storage hash

Refstruct Entry *entry = HashEntry ();bool found = likely (entry->ptr == this);entry->count++;if (!found)

SlowRef ();

Unrefstruct Entry *entry = HashEntry ();bool found = likely (entry->ptr == this);entry->count--;if (!found)

SlowUnref ();

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 27 / 36

Page 33: Multithreaded Parallelization for ns-3

Performance comparison

0.1

1

10

100

1000

1 2 3 4 5 6

Dur

atio

n (s

econ

ds)

Number of threads

MutexAtomic

TLS hash

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 28 / 36

Page 34: Multithreaded Parallelization for ns-3

Outline

Discrete time event-driven simulations

A parallel simulator

Network simulations

ns-3 Multithreaded simulations

Thread-safe reference counting

Efficient barrier implementations

Conclusion

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 29 / 36

Page 35: Multithreaded Parallelization for ns-3

POSIX

Waitpthread_barrier_wait (&barrier)

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 30 / 36

Page 36: Multithreaded Parallelization for ns-3

Global spin

Waitwaiter->sense = -waiter->sense;int waiters = AtomicExchangeAndAdd (&m_waiters, 1);if (waiters + 1 == m_totalWaiters)

{AtomicExchangeAndAdd (&m_waiters, -m_totalWaiters);m_release = waiter->sense;

}else

{while (AtomicGet (&m_release) != waiter->sense)

{}}

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 31 / 36

Page 37: Multithreaded Parallelization for ns-3

Tree spin

Count Count

Count

Thread0 Thread0 Thread0Thread0

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 32 / 36

Page 38: Multithreaded Parallelization for ns-3

Performance comparison

10

100

1000

10000

100000

1e+06

1 2 3 4 5 6 7

Dur

atio

n (s

econ

ds)

Number of threads

POSIXGlobal

Tree

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 33 / 36

Page 39: Multithreaded Parallelization for ns-3

Outline

Discrete time event-driven simulations

A parallel simulator

Network simulations

ns-3 Multithreaded simulations

Thread-safe reference counting

Efficient barrier implementations

Conclusion

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 34 / 36

Page 40: Multithreaded Parallelization for ns-3

Lots of open questions

• Reference counting is harder than we thought• Efficient lockless data structure implementations are harder than

we thought• Profiling is hard : oprofile, hardware performance counters,

cache misses• Why is atomic_inc/dec_and_test so slower than ++/– with only 1

thread ?• Impact of core interconnect on performance results ?

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 35 / 36

Page 41: Multithreaded Parallelization for ns-3

References

• Parallel and Distributed Simulation Systems, R. Fujimoto.• Good news for parallel wireless network simulations, P.

Peschlow: same approach, use of rxTxTurnaround timemodeling to speedup wireless simulations

• Multi-core parallelism for ns-3 simulator: internship by GuillaumeSeguin (http://guillaume.segu.in/papers/ns3-multithreading.pdf)

Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 36 / 36