Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Analysing Real-Time Behaviour ofCollective Communication Patterns in MPI
Alexander Stegmeier, Martin Frieb,Jorg Mische, Theo Ungerer
University of Augsburg, Germany
26th International Conference onReal-Time Networks and Systems
11 October 2018
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 1
Motivation
I increase in performance needs for real-time applicationsI multicore analysis with shared memory difficult
I apply manycores withI Network-on-Chip (NoC)I local memory per nodeI explicit message passing
I message passing interface (MPI)I standarad programming model
I special focus on collective communicationI programming similar to Bulk Synchronous Parallel
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 2
Outline
Motivation
Basic Knowledge
Analysis
Evaluation
Conclusion
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 3
MPI Collectives
Communication Structure
I based on a central node (MPI Bcast, MPI Gather, . . . )I communication along tree structuresI investigated structures:
I pipeline, chains, binary tree, binomial tree
I uniform data exchange (MPI Allgather, MPI Barrier, . . . )I based on point-to-point communicationI investigated structures:
I ring, recursive doubling,neighbour exchange, bruck
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 4
0 1 2 3 4 5 6 7
01 01 23 23 45 45 67 67
0123 0123 0123 0123 4567 4567 4567 4567
01234567 01234567 01234567 01234567 01234567 01234567 01234567 01234567
Time-Division Multiplexing
I time-division multiplexing (TDM) for message schedulingI fixed time slots for sendingI prevents conflicts between delivered flitsI enables upper bounds for releasing and transporting flits
I WCTT for TDM:
WCTT = ta + ttta: admission timett : transportation time
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 5
Timing Analysis
Analysis flow
1. investigation of internal structureI separation of code execution and data transferI send/receive operations as boundaries
2. analysis of components (WCET, WCTT)
3. combination regarding to communication pattern
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 6
Analysis issues
Boundary between WCET and WCTT
n0 ss0
WCETs ta
(a) send driven by ta
n0 ss0
WCETsta
(b) send driven by WCETs
(f − 1) ·max(WCETs , ta) + WCETs + ta
I similar for receive
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 7
Analysis issues
Dispatch along multiple nodes
I multiple options to accumulate timesI identify longest path in terms of time
I three candidates for longest path
t
t
t
(a) send operationtakes longest time
t
t
t
(b) receive operationtakes longest time
t
t
t
(c) receive andforward takes longesttime
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 8
Analysis issues
Dispatch along multiple nodes
I multiple options to accumulate timesI identify longest path in terms of time
I three candidates for longest path
t
t
t
(a) send operationtakes longest time
t
t
t
(b) receive operationtakes longest time
t
t
t
(c) receive andforward takes longesttime
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 8
Analysis issues
Dispatch along multiple nodes
I multiple options to accumulate timesI identify longest path in terms of time
I three candidates for longest path
t
t
t
(a) send operationtakes longest time
t
t
t
(b) receive operationtakes longest time
t
t
t
(c) receive andforward takes longesttime
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 8
Analysis issues
Dispatch along multiple nodes
I multiple options to accumulate timesI identify longest path in terms of time
I three candidates for longest path
t
t
t
(a) send operationtakes longest time
t
t
t
(b) receive operationtakes longest time
t
t
t
(c) receive andforward takes longesttime
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 8
Analysis issues
Dispatch along multiple nodes
I multiple options to accumulate timesI identify longest path in terms of time
I three candidates for longest path
t
t
t
(a) send operationtakes longest time
t
t
t
(b) receive operationtakes longest time
t
t
t
(c) receive andforward takes longesttime
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 8
Analysis issues
Consideration of communication pattern
I treatment of tree structuresI occurance of leaf at different tree levels
I sending procedure for nodes with multiple childrenI deepest sub tree first
I options for longest path regarding timeI early forwarding + delivery along long sub treeI late forwarding + delivery along short sub tree
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 9
Applying the procedure
Illustration with example
I broadcast to 5 nodesI message contains f flitsI chain pattern with 2 chains
0
1
3
2
4
5
Communication details0
1234
5 t
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 10
Applying the procedure
Boundaries between WCET/WCTT
issue:n0 ss0
WCETs ta
n0 ss0
WCETsta
resulting timing:
Ws = (chi − 1) · max(WCETs , ta) + WCETs + ta (1)
Wsr = (chi − 1) · max(WCETsr , ta) + WCETsr + ta (2)
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 11
0
1
3
2
4
5
0
1234
5 t
Applying the procedure
Delivery along multiple nodes
I consideration of 1 flitWf = Ws + l · Wsr + (l + 1) · tt + WCETr (3)
I consideration of f flitsWa = f · Ws + l · Wsr + (l + 1) · tt + WCETr (4)
Wb = Ws + l · Wsr + (l + 1) · tt + f · WCETr (5)
Wc = Ws + f · Wsr + (l − 1) · Wsr + (l + 1) · tt + WCETr (6)
Wchain = max(Wa,Wb ,Wc) (7)
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 12
0
1
3
2
4
5
0
1234
5 t
Ws = (chi − 1) ·max(WCETs , ta) + WCETs + ta (1)
Wsr = (chi − 1) ·max(WCETsr , ta) + WCETsr + ta (2)
Applying the procedure
Delivery along multiple nodes
I consideration of 1 flitWf = Ws + l · Wsr + (l + 1) · tt + WCETr (3)
I consideration of f flitsWa = f · Ws + l · Wsr + (l + 1) · tt + WCETr (4)
Wb = Ws + l · Wsr + (l + 1) · tt + f · WCETr (5)
Wc = Ws + f · Wsr + (l − 1) · Wsr + (l + 1) · tt + WCETr (6)
Wchain = max(Wa,Wb ,Wc) (7)
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 12
0
1
3
2
4
5
0
1234
5 t
Ws = (chi − 1) ·max(WCETs , ta) + WCETs + ta (1)
Wsr = (chi − 1) ·max(WCETsr , ta) + WCETsr + ta (2)
t
t
t
Applying the procedure
Delivery along multiple nodes
I consideration of 1 flitWf = Ws + l · Wsr + (l + 1) · tt + WCETr (3)
I consideration of f flitsWa = f · Ws + l · Wsr + (l + 1) · tt + WCETr (4)
Wb = Ws + l · Wsr + (l + 1) · tt + f · WCETr (5)
Wc = Ws + f · Wsr + (l − 1) · Wsr + (l + 1) · tt + WCETr (6)
Wchain = max(Wa,Wb ,Wc) (7)
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 12
0
1
3
2
4
5
0
1234
5 t
Ws = (chi − 1) ·max(WCETs , ta) + WCETs + ta (1)
Wsr = (chi − 1) ·max(WCETsr , ta) + WCETsr + ta (2)
t
t
t
Applying the procedure
Delivery along multiple nodes
I consideration of 1 flitWf = Ws + l · Wsr + (l + 1) · tt + WCETr (3)
I consideration of f flitsWa = f · Ws + l · Wsr + (l + 1) · tt + WCETr (4)
Wb = Ws + l · Wsr + (l + 1) · tt + f · WCETr (5)
Wc = Ws + f · Wsr + (l − 1) · Wsr + (l + 1) · tt + WCETr (6)
Wchain = max(Wa,Wb ,Wc) (7)
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 12
0
1
3
2
4
5
0
1234
5 t
Ws = (chi − 1) ·max(WCETs , ta) + WCETs + ta (1)
Wsr = (chi − 1) ·max(WCETsr , ta) + WCETsr + ta (2)
t
t
t
Applying the procedure
Delivery along multiple nodes
I consideration of 1 flitWf = Ws + l · Wsr + (l + 1) · tt + WCETr (3)
I consideration of f flitsWa = f · Ws + l · Wsr + (l + 1) · tt + WCETr (4)
Wb = Ws + l · Wsr + (l + 1) · tt + f · WCETr (5)
Wc = Ws + f · Wsr + (l − 1) · Wsr + (l + 1) · tt + WCETr (6)
Wchain = max(Wa,Wb ,Wc) (7)
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 12
0
1
3
2
4
5
0
1234
5 t
Ws = (chi − 1) ·max(WCETs , ta) + WCETs + ta (1)
Wsr = (chi − 1) ·max(WCETsr , ta) + WCETsr + ta (2)
Applying the procedure
Consideration of communication pattern
I respect subtrees of different lengths1. long chain but early flit supply2. short chain but late flit supply
calculate overall timing: combine both casesWtotal = max(Wchain,W ′
chain)
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 13
0
1
3
2
4
5
0
1234
5 t
Ws = (chi − 1) ·max(WCETs , ta) + WCETs + ta (1)
Wsr = (chi − 1) ·max(WCETsr , ta) + WCETsr + ta (2)
Applying the procedure
Consideration of communication pattern
I respect subtrees of different lengths1. long chain but early flit supply2. short chain but late flit supply
calculate overall timing: combine both casesWtotal = max(Wchain,W ′
chain)
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 13
0
1
3
2
4
5
0
1234
5 t
Ws = (chi − 1) ·max(WCETs , ta) + WCETs + ta (1)
Wsr = (chi − 1) ·max(WCETsr , ta) + WCETsr + ta (2)
Evaluation
Assumptions
I platform: RC/MC manycoreI NoC topology: uni-directional quadratic torusI 64 ARM-V7 coresI local scratchpad memory for each core
I MPI collectives ported from OpenMPII synchronization done in software
I representation for each communication structureI MPI Bcast for tree structuresI MPI Allgather for uniform data exchange
I OTAWA for calculation of core local WCET bounds
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 14
Communicationbased on a central node
MPI Bcast
0 20 40 60 80 100
050
000
1500
00
number of flits
WC
ET
[cyc
les]
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
basic linearpipelinebinary treebinomial tree2 chains8 chains
I linear growth with respect toflits
I best performance:I binary tree
0 10 20 30 40 50 60
020
000
6000
0
group size
WC
ET
[cyc
les]
●●
●●
●●
●●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●●
●
●
basic linearpipelinebinary treebinomial tree2 chains8 chains
I influenced by softwaresynchronization
I best performance:I basic linear (small groups)I binary tree (otherwise)
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 15
Communicationbased on uniform data exchange
MPI Allgather
0 20 40 60 80 100
0e+
004e
+05
8e+
05
number of flits
WC
ET
[cyc
les]
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
basic linearringneighbor exchangerecursive doublingbruck
I only marginal differences(except basic linear)
I best performance:I bruckI recursive doubling
0 10 20 30 40 50 60
050
000
1000
00
group size
WC
ET
[cyc
les]
●●
●●
●●
●●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
basic linearringneighbor exchangerecursive doublingbruck
I significant differences
I best performance:I bruckI recursive doubling
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 16
Summary and Conclusion
SummaryI described timing analysis of collective communicationI focus on combination of code WCET bounds and WCTTI evaluation on concrete platform
I MPI collectives as representativesI comparison of communication patterns
ResultsI high impact of communication patternsI recommended communication patterns
I binary treeI bruck
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 17
Thank you for your attention.
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 18
Backup
Data transfer
I based on flits (equally sized atomic data unit)I send operation: put flit to send bufferI flits in send buffer:
I ejected to a appropriate slot in the NoCI flits at target: store to receive bufferI receive operation: handle flit from receive buffer
2018-10-11 Alexander Stegmeier et al. / Real-Time Analysis of Collective Operations 19
n0 ss0 n1rs1