Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Multithreaded Parallelizationfor ns-3
Mathieu Lacage, Guillaume [email protected]@ens.fr
INRIA,ENS
DREAMTECH, April 1st 2010
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 1 / 36
Outline
Discrete time event-driven simulations
A parallel simulator
Network simulations
ns-3 Multithreaded simulations
Thread-safe reference counting
Efficient barrier implementations
Conclusion
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 2 / 36
Outline
Discrete time event-driven simulations
A parallel simulator
Network simulations
ns-3 Multithreaded simulations
Thread-safe reference counting
Efficient barrier implementations
Conclusion
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 3 / 36
A simple simulatorThe APItypedef void (*EventFunction) (void *);void CreateEvent (double delay, EventFunction function, void *context);double Now (void);void Run (void);
The private data structuresstruct Event{double expiration;EventFunction function;void *context;
};static double g_now;static std::list<Event> g_events;
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 4 / 36
The implementation I
Runvoid Run (void){while (!list.empty ())
{Event event = list.front ();list.pop_front ();g_now = event.expiration;event.function (event.context);
}}
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 5 / 36
The implementation II
Nowdouble Now (void){return g_now;
}
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 6 / 36
The implementation IIICreateEventvoid CreateEvent (double delay, EventFunction function, void *context);{Event event;event.expiration = g_now + delay;event.function = function;event.context = context;std::list<Event>::iterator cur = g_list.begin ();while (cur != g_list.end () && event.expiration < cur->expiration)
{i++;
}g_list.insert (cur, event);
}
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 7 / 36
Outline
Discrete time event-driven simulations
A parallel simulator
Network simulations
ns-3 Multithreaded simulations
Thread-safe reference counting
Efficient barrier implementations
Conclusion
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 8 / 36
Logical processes
LP definition:• Independent pieces of code run separately• Communicate through messages
Example:• Plane takes off from Paris at 10h00• Plane lands in Nice at 11h15
• At 10h00, Paris sends/schedules Landing message to Nice for11h15.
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 9 / 36
Logical processes
LP definition:• Independent pieces of code run separately• Communicate through messages
Example:• Plane takes off from Paris at 10h00• Plane lands in Nice at 11h15• At 10h00, Paris sends/schedules Landing message to Nice for
11h15.
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 9 / 36
Optimistic algorithmsEach LP:
• Process all local events• If receives message in past, rollback
Example:• Plane takes off from Paris at 10h00• Plane takes off from Lyon at 10h30• Plane from Paris lands in Nice at 11h15• Plane from Lyon lands in Nice at 11h00
• At 10h00, Paris sends/schedules Landing message to Nice for11h15.
• At 10h30, Lyon sends/schedules Landing message to Nice for11h00.
• Nice executes Landing event for Paris -> Nice flight• Nice receives message from Lyon later.• Nice rollbacks back Landing event for Paris -> Nice• Nice executes Landing event for Lyon -> Nice flight
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 10 / 36
Optimistic algorithmsEach LP:
• Process all local events• If receives message in past, rollback
Example:• Plane takes off from Paris at 10h00• Plane takes off from Lyon at 10h30• Plane from Paris lands in Nice at 11h15• Plane from Lyon lands in Nice at 11h00• At 10h00, Paris sends/schedules Landing message to Nice for
11h15.• At 10h30, Lyon sends/schedules Landing message to Nice for
11h00.
• Nice executes Landing event for Paris -> Nice flight• Nice receives message from Lyon later.• Nice rollbacks back Landing event for Paris -> Nice• Nice executes Landing event for Lyon -> Nice flight
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 10 / 36
Optimistic algorithmsEach LP:
• Process all local events• If receives message in past, rollback
Example:• Plane takes off from Paris at 10h00• Plane takes off from Lyon at 10h30• Plane from Paris lands in Nice at 11h15• Plane from Lyon lands in Nice at 11h00• At 10h00, Paris sends/schedules Landing message to Nice for
11h15.• At 10h30, Lyon sends/schedules Landing message to Nice for
11h00.• Nice executes Landing event for Paris -> Nice flight• Nice receives message from Lyon later.• Nice rollbacks back Landing event for Paris -> Nice• Nice executes Landing event for Lyon -> Nice flight
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 10 / 36
Problems
Implementation complexity of rollback mechanism:• anti-event code written by hand• anti-event code generated from modified compiler• event data must be kept in memory until rollback can’t happen
In practice: it’s never implemented
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 11 / 36
Problems
Implementation complexity of rollback mechanism:• anti-event code written by hand• anti-event code generated from modified compiler• event data must be kept in memory until rollback can’t happen
In practice: it’s never implemented
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 11 / 36
Conservative algorithms
Enforce local causality constraint in each LP:• Make sure we don’t receive messages in the past• Requires global synchronization among LPs
Use of lookahead:• LP A can send messages to LP B• Minimum flight duration from A to B is 1h30• Current simulation time in A and B is timeA and timeB
• B can execute all events such that: event .time < timeA + 1h30
• 1h30 is the LookaheadAB
• Big lookahead == potential for lots of parallelization• Small lookahead == little potential for parallelization
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 12 / 36
Conservative algorithms
Enforce local causality constraint in each LP:• Make sure we don’t receive messages in the past• Requires global synchronization among LPs
Use of lookahead:• LP A can send messages to LP B• Minimum flight duration from A to B is 1h30• Current simulation time in A and B is timeA and timeB
• B can execute all events such that: event .time < timeA + 1h30• 1h30 is the LookaheadAB
• Big lookahead == potential for lots of parallelization• Small lookahead == little potential for parallelization
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 12 / 36
A simple algorithm
In each LP:while (events left){
Tmin = min (Ti + Li) for all ibarrierS = set of events in Pi with timestamp <= Tminprocess events in Sbarrier
}
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 13 / 36
Outline
Discrete time event-driven simulations
A parallel simulator
Network simulations
ns-3 Multithreaded simulations
Thread-safe reference counting
Efficient barrier implementations
Conclusion
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 14 / 36
Major characteristicsTarget metrics:
• Packet loss patterns• Packet bit error patterns• Packet throughput patterns• Packet delay patterns• Packet jitter patterns
Relevant model parameters:• Link delay due to limited progation speed• Link delay due to limited bandwidth• Per-packet processing delay due to limited cpu• Per-packet memory usage• Packet queue management policy
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 15 / 36
A typical network model (wired)
Node 0
Node 1
Node 2
Node 5
Node 3
Node 4
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 16 / 36
Parallel network simulations
• One logical process: one node• Lookahead: minimum propagation delay on link
The traditional implementation:• message communications through MPI• LP/physical machine partition static (no code/data migration):
• bad performance because the most efficient partition depends oncommunication patterns and they change over simulation time
• bad performance because no one knows how to generate adecent partition
• users have no idea how to generate a partition just to get started
• The goal is to scale to large number of LPs (distributedmemory), not to run fast.
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 17 / 36
Outline
Discrete time event-driven simulations
A parallel simulator
Network simulations
ns-3 Multithreaded simulations
Thread-safe reference counting
Efficient barrier implementations
Conclusion
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 18 / 36
Requirements
• We don’t care about scaling to large number of LPs• Multicore shared memory systems: at least 8 cores, target 24,
potentially more.• Should go faster: ideally, linear scaling with number of cores• Transparent for the user:
• no partition specification• no changes to the model development API
• No fundamental changes to ns-3
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 19 / 36
Multithreaded parallelsimulation
• Degenerate partition map: one node/LP <-> one partition• Classic conservative synchronization algorithm with lookahead• One worker thread per physical cpu:
• take partition from worklist• execute partition until conservative boundary
Pros:• No partition specification• Automatic load balancing across CPUS• Measurable speedup on nasty testcases with little parallelization
opportunities
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 20 / 36
What was really hard
• Need to make ns-3 minimally thread-safe:• schedule events from a thread to another thread• deep copy packets crossing threads (packets are normally COW)• object reference counting
• Hard to do load-balancing:• Execution time of each event is non-uniform• Number of events per partition per iteration varies a lot, fairly
small (10 to 40)
• Granularity of iteration fairly small: barrier synchronizationoverhead high
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 21 / 36
Conclusion
Running faster than non-parallel but:• hard to get realistic scenario benchmarks• scalability with number of cores is not linear• thread-safe reference counting still costly• load balancing heuristics appear very benchmark-specific• How are we going to scale to 64 cores with shared memory ?
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 22 / 36
Outline
Discrete time event-driven simulations
A parallel simulator
Network simulations
ns-3 Multithreaded simulations
Thread-safe reference counting
Efficient barrier implementations
Conclusion
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 23 / 36
Reference counting
Refm_count++;
Unrefm_count--;if (m_count == 0)
{delete this;
}
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 24 / 36
Mutexes
Refpthread_mutex_lock (&m_lock);m_count++;pthread_mutex_unlock (&m_lock);
Unrefpthread_mutex_lock (&m_lock);m_count--;if (m_count == 0)
{delete this;
}pthread_mutex_unlock (&m_lock);
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 25 / 36
Atomic ops
Refatomic_add (&m_count, 1);
Unrefbool was_one = atomic_dec_and_test (&m_count, 1, 1);if (was_one)
{delete this;
}
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 26 / 36
Thread Local Storage hash
Refstruct Entry *entry = HashEntry ();bool found = likely (entry->ptr == this);entry->count++;if (!found)
SlowRef ();
Unrefstruct Entry *entry = HashEntry ();bool found = likely (entry->ptr == this);entry->count--;if (!found)
SlowUnref ();
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 27 / 36
Performance comparison
0.1
1
10
100
1000
1 2 3 4 5 6
Dur
atio
n (s
econ
ds)
Number of threads
MutexAtomic
TLS hash
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 28 / 36
Outline
Discrete time event-driven simulations
A parallel simulator
Network simulations
ns-3 Multithreaded simulations
Thread-safe reference counting
Efficient barrier implementations
Conclusion
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 29 / 36
POSIX
Waitpthread_barrier_wait (&barrier)
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 30 / 36
Global spin
Waitwaiter->sense = -waiter->sense;int waiters = AtomicExchangeAndAdd (&m_waiters, 1);if (waiters + 1 == m_totalWaiters)
{AtomicExchangeAndAdd (&m_waiters, -m_totalWaiters);m_release = waiter->sense;
}else
{while (AtomicGet (&m_release) != waiter->sense)
{}}
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 31 / 36
Tree spin
Count Count
Count
Thread0 Thread0 Thread0Thread0
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 32 / 36
Performance comparison
10
100
1000
10000
100000
1e+06
1 2 3 4 5 6 7
Dur
atio
n (s
econ
ds)
Number of threads
POSIXGlobal
Tree
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 33 / 36
Outline
Discrete time event-driven simulations
A parallel simulator
Network simulations
ns-3 Multithreaded simulations
Thread-safe reference counting
Efficient barrier implementations
Conclusion
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 34 / 36
Lots of open questions
• Reference counting is harder than we thought• Efficient lockless data structure implementations are harder than
we thought• Profiling is hard : oprofile, hardware performance counters,
cache misses• Why is atomic_inc/dec_and_test so slower than ++/– with only 1
thread ?• Impact of core interconnect on performance results ?
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 35 / 36
References
• Parallel and Distributed Simulation Systems, R. Fujimoto.• Good news for parallel wireless network simulations, P.
Peschlow: same approach, use of rxTxTurnaround timemodeling to speedup wireless simulations
• Multi-core parallelism for ns-3 simulator: internship by GuillaumeSeguin (http://guillaume.segu.in/papers/ns3-multithreading.pdf)
Lacage, Seguin (INRIA,ENS) Multithreaded Parallelization for ns-3 DREAMTECH 36 / 36