Upload
imogene-newton
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Master Degree Program (Laurea Magistrale) in Computer Science and Networking
High Performance Computing
Multiprocessor architectures
• Ref: Sections 10.1, 10.2, 15, 16, 17(except 17.5), 18. • Background (Appendix): firmware, firmware communications, memories,
caching, process management; see also Section 11 for memory and caching.
MCSN - High Performance Computing 2
Contents
Shared memory architecture = multiprocessor
• Multicore technology (Chip MultiProcessor – CMP)
• Functional and performance features of external memory and caching in multiprocessors
• Interconnection networks
• Multiprocessor taxonomy
• Local I/O
Sections 15, 16, 17, 18 contain several ‘descriptive-style’ parts, e.g. classifications, technologies, products, etc. They can be read easily by the students. During the lectures we’ll concentrate on the most critical issues from the conceptual and technical point of view, through examples and exercises.
3
Abstract architecture and physical architecture
MCSN - High Performance Computing
. . .Abstract Processing Elements (PE),
having all the main features of real PEs
(processor, assembler, memory hierarchy, external memory, I/O, etc.);
one PE for each processAbstract interconnection network:all the needed direct links corresponding to the interprocess communication channels
Abstraction of physical interconnect,Memory hierarchy, I/O,
Process run-time support,Process mapping onto PEs, etc.
All the physical details are condensed into a small number of parameters used to evaluate Lcom.
Cost model of the
specific parallel program executed
on the specific parallel architecture
Evaluation of calculation times Tcalc
MCSN - High Performance Computing 4
Multiple Instruction Stream (MIMD) architectures
Parallelism between processes
MCSN - High Performance Computing 5
Shared memory vs distributed memory
MultiprocessorCurrently, the main technology for multicore and multicore-based systems.
MulticomputerCurrently, the main technology for clusters and data-centres.Processing nodes are multiprocessors.
We’ll start with shared memory multiprocessors.
MCSN - High Performance Computing 6
Levels and shared memory
Hardware
Applications
Processes
Assembler
Firmware
Shared physical memory (multiprocessor): any processor can address (thus can access directly) any location of the physical memory
contains all instructions and data• private• shared of processes
MCSN - High Performance Computing 7
Levels and shared memory
Hardware
Applications
Processes
Assembler
Firmware
message-passing (e.g. LC) or shared data
Shared physical memory (multiprocessor): any processor can address (thus can access directly) any location of the physical memory
Graphs of cooperating processes expressed by a concurrent language
contains all instructions and data• private• shared of processes
RTS (concurrent language)
based on shared data structures(communication channel descriptors,
process descriptors, etc.)
exploitingthe shared physical memory
No sharing implemented by sharing
Different shared data at different levels
MCSN - High Performance Computing 8
Generic scheme of multiprocessor
……M0
…
Mj Mm-1
Interconnection network(s)
W0
CPU0
WN-1
CPUN-1
…Wi
CPUi
External shared main memory
N x m interconnect
Processing Element (PE)
PE interface unit: decouples CPUs from interconnect technology (‘Wrapping’ unit)
CPU (processor units, MMUs, caches) + local I/O
PE0 PEiPEN-1
routing and flow control
MCSN - High Performance Computing 9
Typical PEInterconnection Network
To/from external memory modules and other PEs
W
CPU
Processing Node (PE)
Secondary Cache
. . .
UC
Local I/O units
Primary Cache (Instr. + Data)
MMUs
Processor units InterruptArbiter
External memory interface
I/O
interface
PE interface unit
MCSN - High Performance Computing 10
Shared memory basics - 1
MCSN - High Performance Computing 11
Example: an elementary multiprocessorJust to understand / to review basic concepts and techniques, which will be extended to real multiprocessor architectures
Abstract PE0
AbstractPE1
P Q
Question:Which kinds of requests are sent from a PE to M - and which reply is sent from M to a PE? [true/false]1. Copy a message from P_msg to Q_vtg2. Request a message to be assigned to Q_vtg3. A single-word read4. A single-word write 5. A C1-block read6. A C1-block write
M
C2
W
C1
P
C2
W
C1
P
PE0 PE1
t t
tM
MCSN - High Performance Computing 12
Example: an elementary multiprocessorJust to understand / to review basic concepts and techniques, which will be extended to real multiprocessor architectures
Abstract PE0
AbstractPE1
P Q
Question:Which kinds of requests are sent from a PE to M - and which reply is sent from M to a PE? [true/false]1. Copy a message from P_msg to Q_vtg2. Request a message to be assigned to Q_vtg3. A single-word read4. A single-word write 5. A C1-block read6. A C1-block write
Question:What is the format (configuration of bits) of a request PE-M and of a reply M-PE ?
Question:What happens if a request from PE0 and a request from PE1 arrive ‘simultaneoulsy’ to M ?
M
C2
W
C1
P
C2
W
C1
P
PE0 PE1
t t
tM
yes, if Write-Thoughyesyes, if Write-Back
MCSN - High Performance Computing 13
Behavior of the memory unit M
• Processing module as a unifying concept at the various levels = processing unit at firmware level, process at the process level.
• All the same mechanisms studied for process cooperation (LC) are applied at the firmware level too, though with different implementations and performances.
• Communication through RDY-ACK interfaces.
• Nondeterminism: test simultaneously, in the same clock cycle, all the RDYs of the input interfaces; select one of the ready requests, possibly applying a fair priority strategy.
• Nondeterminism may be implemented as real parallelism in the same clock cycle: if input requests are compatible and the memory bandwidth is sufficient, more requests can be served simultaneously.
MCSN - High Performance Computing 14
Behavior of the memory unit M
• A further feature can be defined for a shared memory unit: indivisible sequences of memory accesses.
• An additional bit (INDIV) is associated to each memory request: if it is set to 1, once the associated request is accepted by M, the other requests are left pending in the input interfaces (simple waiting queue mechanims), until INDIV is reset to 0.
• During an indivisible sequence of memory accesses, the M behavior is deterministic.
• At the end, the nondeterministic/parallel behavior is resumed (possibly by serving a waiting request).
• This mechanism is provided by some machines : proper instructions (e.g. TEST_AND_SET) or annotation in LOAD and STORE instructions.
MCSN - High Performance Computing 15
In general
……M0
…
Mj Mm-1
Interconnection network(s)
W0
CPU0
WN-1
CPUN-1
…Wi
CPUi
PE0 PEiPEN-1
Nodeterminism and parallelism
in the behavior of memory units
and ofnetwork switching units
MCSN - High Performance Computing 16
Technology overview and multicore
MCSN - High Performance Computing 17
CPU technologyPipelined
and multithreaded processor technology: general view
(Sections 12, 13)
C2
External memory interface (MINF)
MMU
C1
P
In the simplified cost model adopted for this course, this structure is invisible and abstracted by the equivalent service time per instruction Tinstr (e.g. 2t).
MCSN - High Performance Computing 18
Pipelined / vectorized Execution Unit
FP PipelinedAdd / Sub
INT PipelinedMul / Div
FP PipelinedMul / Div
EU_Master
• Distribuzione• Operazioni corte• LOAD• Registri RG• Registri RFP
Collettore
da DM
da IU
a IU
+ Vectorization facilities
general,floating-point and vector registers
MCSN - High Performance Computing 19
Multithreading (‘hardware’ multithreading)
I/Oexternal memory
IM – instruction C1
IU0
DM – data C1
EU_Master0
switch
FU0 FU1 FU2 FU3
IU1
EU_Master1
C2M
I
N
F
CPU chip
e.g.Hyperthreading
Ideally,an equivalent number of q N PEs is available,where q is the multithreading degree.
In practice,a q Nwith a < 1.
Example of 2-thread CPU
MCSN - High Performance Computing 20
Multicore technology: Chip MultiProcessor (CMP)
CMP
MINF MINF MINF MINF
PE 0 PE 1 . . . PE N-1
internal interconnectI/O INF I/O INF
W
C2
local I/Opipelined processor coproc.
C1 (instr + data)
PE / core
single chip
For our purposes, the terms ‘multicore’ and ‘manycore’ are synonymous.We use the more general and rigorous term ‘Chip MultiProcessor (CMP)’.
MCSN - High Performance Computing 21
Internal interconnect examples for CMPRing
PE PE PE PE
sw sw sw sw
sw sw sw sw
PE PE PE PE
2D toroidal mesh
… … …… … … … … … …
… … …… … … … … … …
… … …… … … … … … …
… … …
Switching Unit (or, simply, Switch): routing and flow control
Crossbar
MCSN - High Performance Computing 22
Example of single-CMP system
Ethernet Fiber channel graphics videoM0 . . . M7 M0 . . . M7
IM . . . . . . IM
CMPMINF MINF MINF MINF
PE 0 PE 1 . . . PE N-1
Raid subsystemsscsi scsi scsi scsi
LANS / WANS / other subnets
I/O chassis
router
interconnectinterconnect
storage server
I/O and networkingI/O and networking I/O INF internal interconnect I/O INF
high-bandwidth main memory
I/O chassis
MCSN - High Performance Computing 23
Example of multiple-CMP system
M . . . M . . . M . . . M . . . M . . . M
CMP0 CMP1 CMPm-1
PE . . . PE PE . . . PE PE . . . PE
. . .
external interconnect
high-bandwidth shared main memory
MCSN - High Performance Computing 24
Intel Xeon (4-16 PEs) and Tilera Tile64 (64 PEs)
MCSN - High Performance Computing 25
Intel Xeon Phi (64 PEs)
PE:• pipelined • in-order• vectorized arithm.• 4-thread• 2-level cache• ring interface
Bidirectional ringinterconnect
Internal local memory (GDDR5 technology), up to 16GB(3rd level cache-like)
MCSN - High Performance Computing 26
Shared memory basics - 2
MCSN - High Performance Computing 27
Memory bandwidth and latency
High bandwidth of M (BM) is needed for:
1. Minimize the latency of cache-block transfers (Tfault)
2. Minimize contention of PEs for memory accesses
MCSN - High Performance Computing 28
Memory bandwidth and latency
High bandwidth of M (BM) is needed for:
1. Minimize the latency of cache-block transfers (Tfault)
2. Minimize contention of PEs for memory accesses
MCSN - High Performance Computing 29
Minimize the latency of cache-block transfers
• If BM = 1 words/tM, cache is quite useless for programs characterized by locality only (or mainly locality)
• BM = s1 words/tM is the best offered bandwidth: exploitable if the remaining subsystems (interconnect, PEs) are able to sustain it.
• Solutions:
1. Interleaved macro-modules (hopefully, m = s1 , e.g. = 8)
2. High-bandwidth firmware communications from M to interconnect and PEs. Notice: s1-wide links are not realistic 1-word links
• Pipelined communications and wormhole flow-control: next week
Interleaved macro-module 1 Interleaved … 2Interleaved macro-module 0
. . .M0 M1 … M7 M8 M9 … M15
MCSN - High Performance Computing 30
Cost model of FW communications (Sect. 10.1, 10.2)
Single buffering (figure):Communication latency:
This expresses also the communication service time:
Communication Latency Lcom = 2 (Ttr + t) Tcalc ≥ Lcom Tcom = 0
ACKRDY RDY
Ttrt
Tcalc Tcom
Sender
Receiver
Clock cycle for calculation only
Clock cycle for calculation and communication
Transmission Latency (Link only)
Communication time NOT overlapped to (i.e. not masked by) internal calculation
Calculation timeService time Tid = Tcalc + Tcom
…
…
Lcom
Tid
Sender:: Receiver::
wait ACK; wait RDY;write msg into OUT, use IN, set RDY, reset ACK, … set ACK, reset RDY, …
On chip Ttr = 0
Double buffering:
Alternate use of the two interfaces
MCSN - High Performance Computing 31
Memory interface
unit( )t
Elementary system example: memory latencyM
M0 . . . M7
IM
W
C2
C1
PE0
W
C2
C1
PE1
Double buffered links
MEMORY ACCESS LATENCY per block:
= + + + +
A more general and accurate, and easier-to-use, cost model will be studied for fully pipelined communications.
RQ0 is the BASE memory access latency per block = without impact of contention: optimistic or for single PE.( )t ( )t
(tM)
tMStream
of s1 words
1 word
MIM
WC2
C1
t
Request e.g. 48 – 112 bits in parallel
Ttr
(not in scale)
s1 )
Block is pipelined word-by-word
. . .( )t
Possible optimization (not so popular): the processor could re-start at this point if the first word of the stream has the address that generated fault.
MCSN - High Performance Computing 32
Elementary system example: memory latency
For all units, except M: =
Memory service time: =
If M (stream generator) is the bottleneck.
Example: = = 8, = .
= + + + + = 68 t
If IM-net-PE is the bottleneck.
Example: = = 8, = .
= + + + = 98 t
tM tM
s1( + tTtr)
s1( + tTtr)tM
tM tM
s1( + t Ttr)s1( + t Ttr)
s1( + t Ttr)
MCSN - High Performance Computing 33
Tfault
MM0 . . . M7
IM
W
C2
C1
PE0
W
C2
C1
PE1
Memory interface
unit( )t
( )t ( )t
(tM)
UNDER-LOAD memory access latency per block: RQ RQ0
From now on: =
Initially, we assume: =
Abstract PE0
AbstractPE1
P Q
Example of the first weekFor Q:
=
For M = 5 Mega, no reuse can be exploited RQ = 68t = =
MCSN - High Performance Computing 34
The evaluation of RQ0 (M bottleneck) is optimistic because the request part of the timing diagram contains a rough simplification:
also the request will be pipelined word-by-word: request link 1-word wide.
tM
t
Request e.g. 48 – 112 bits in parallel
Ttr
Elementary system example
Exercises1. Why for all units, except M: = ?2. Explain the RQ0 evaluation in the example when M is not bottleneck.
35
In general
MCSN - High Performance Computing
Interleaved macro-module 1 Interleaved … 2Interleaved macro-module 0
. . .M0 M1 … M7 M8 M9 … M15
IM IM IM
Network. . .
. . .
. . .
. . .
. . . . . . . . .
s1 words are read in parallel by the macro-module, and sent in pipeline one word at the time through IM - network- W - C2 - C1.Reverse path for block writing.
Target:
and in general it is:
W
C2
C1
MCSN - High Performance Computing 36
Memory bandwidth and latency
High bandwidth of M (BM) is needed for:
1. Minimize the latency of cache-block transfers (Tfault)
2. Minimize contention of PEs for memory accesses
Memory bandwidth and contention
MCSN - High Performance Computing37
tM tM
s1( + t Ttr) s1( + t Ttr)
Single internally interleaved macro-module: blocks/sec
Only one PE at the time can be served.
M(0)
M0 . . . M7
IM
M(1)
M8 . . . M15
IM
W
C2
C1
PE0
W
C2
C1
PE1
interconnect
Two externally interleaved macro-modules (= interleaved each other), each macro-module internally interleaved (= inside the macro-module):
blocks/secTwo PE at the time can be served, if not in conflict for the same macro-module.
W select the macro-module M(j)
according to the index j contained in the physical address.
tM tM
s1( + tTtr)
s1( + tTtr)
conflict
tM
tM
s1( + tTtr)
s1( + tTtr)
no conflict
38
In general
MCSN - High Performance Computing
Interleaved macro-module 1 Interleaved … 2Interleaved macro-module 0
. . .M0 M1 … M7 M8 M9 … M15
IM IM IM
Network
. . .
. . .
. . . . . . . . .
N Processing Elements
m macro-modules
The destination macro-module name belongs to the routing information set(inserted by W)
MCSN - High Performance Computing 39
A first idea of contention effect
0 8 16 24 32 40 48 56 640.00
8.00
16.00
24.00
32.00
40.00
48.00
56.00
64.00
Interleaved memory bandwidth (m modules, N processors)
m=4m=8m=16m=32m= 64
N
For an (externally) interleaved memory, the probability that a generic processor accesses any (macro-)module is approximated by 1/m. With this assumption, the probability of having PEs in conflict for the same macro-module is distributed according to the binomial law. We can find (Section 17.3.5) :
Simplified evaluation:• only a subclass of
multiprocessor architectures (SMP),
• no network effect on latency and conflicts,
• no impact of parallel program structures.
MCSN - High Performance Computing 40
A more general client-server model will be derived
Contention in memory AND in the network the importance of high-bandwidth and low-latency networks
MCSN - High Performance Computing 41
Caching
• Caching is even more important in multiprocessor,– for latency and contention reduction,
– provided that reuse is intensively exploited.
• For shared data, intensive reuse can exist, with a proper design of process RTS.
• However, the CACHE COHERENCE problem arises (studied in the second part of semester).
MCSN - High Performance Computing 42
Multiprocessor taxonomy
MCSN - High Performance Computing 43
SMP vs NUMA architectures……M0
…
Mj Mm-
1Interconnection network
W0
CPU0
WN-1
CPUN-1
…Wi
CPUi
Symmetric MultiProcessor:The base latency is independent of the specific PE and memory macro-module.Also called UMA(Uniform Memory Access).
Target: contention is reduced, at the expence of base latency for shared data (optimizations are needed).
Non Uniform Memory Access:The base latency depends (heavily) on the specific PE and referred macro-module.
Local memories are shared.Each of them can be interleaved,but they are sequentially organized each other.
Local accesses have lower latency than remote ones. All private information are allocated in the local memory.
…
Interconnection network
W0
CPU0
M0
WN-1
CPUN-1
MN-1
MCSN - High Performance Computing 44
SMP-like single-CMP architecture
M0 . . . M7 M0 . . . M7
IM . . . . . . IM
CMPMINF MINF MINF MINF
PE 0 PE 1 . . . PE N-1
I/O INF internal interconnect I/O INF
MCSN - High Performance Computing 45
SMP and NUMA multiple-CMP architecturesa) multiple-CMP SMP architecture
M . . . M M . . . M M . . . M
IM0 IMj IMm-1
WW WW WW
CMP0 CMPi CMPN-1
PE . . . PE PE . . . PE PE . . . PE
b) multiple-CMP NUMA architecture
M M M
. . . IM0 WW . . . IMi WW . . . IMN-1 WW
M M M
CMP0 CMPi CMPN-1
PE . . . PE PE . . . PE PE . . . PE
external interconnect
external interconnect
MCSN - High Performance Computing 46
Parallel applications dedicated to specific domains
‘Traditional’ computing servers, data-centres (?), cloud (?).
Process_to_Processor Mapping
Multiprogrammed mapping
More processes share dynamically the same PE
Context-switch overhead
Exclusive mappingOne-to-one
Anonymous ProcessorsDynamic mapping
(low-level scheduling)
Dedicated ProcessorsStatic mapping
Originally SMPOriginally NUMA
Exercise: give an approximate evaluation of the context-switch calculation time.
MCSN - High Performance Computing 47
Interconnection networks
MCSN - High Performance Computing 48
Two extreme cases of networksOld-style bus Crossbar
• Bus is no longer applicable to highly parallel systems : cheap, but no parallelism in memory accesses minimum bandwidth, and maximum latency.
• Crossbar = fully interconnected with N2 dedicated links: maximum parallelism and bandwidth, minimum latency, but can be applied to limited parallelism only (e.g., N = 8) because of link cost and pin-count reasons.
• Limited degree networks for highly parallel systems: much lower cost than crossbar by reducing the number of links and interfaces (pin-count), at the expence of latency, but the maximum bandwidth can equally be achieved.
MCSN - High Performance Computing 49
‘High-performance’ networks• Many of the Limited Degree networks, that are studied for multiprocessors, are
used in distributed memory systems and in high-performance multicomputers too.– Notable industrial examples: Infiniband, Myrinet, QS-net, etc.
• The firmware level is the same, or is very similar for different architectures. The main difference lies in the implementation of routing and flow control protocols:
– In multiprocessors and high-performance multicomputers, the primitive protocols at the firmware level are (can be) used directly in the RTS of applications,
– without the additional software layers like TCP-IP of traditional networks.
– The overhead imposed by traditional TCP-IP implementations is evaluated in several orders of magnitude (e.g. msecs vs nsecs of latency !):
• no/scarce firmware support (NIC is used for physical layers only),• in kernel mode on top of operating systems (e.g., Linux).
• We’ll see that the modern network systems, cited above, render visible the primitive firmware protocols too
– For high-performance distributed applications, unless TCP-IP is forced by binary portability reasons of ‘old’/legacy products.
– Moreover, such networks implement TCP-IP with intensive firmware support (mainly in NIC) and in user mode: 1-2 orders of magnitude of overhead is saved.
MCSN - High Performance Computing 50
Firmware messages as streams
• Messages are packets transmitted as streams of elementary data units, typically words.
• Example: a cache block transmitted from the main memory as a stream of s1 words.
MCSN - High Performance Computing 51
Evaluation metricsAt least evaluated as order of magnitude O(f(N))
• Cost of links– bus O(1)
– crossbar O(N2) : absolute maximum
• Maximum bandwidth– bus O(1)
– crossbar O(N) : absolute maximum
• Complexity of design to achieve the maximum bandwidth (nondeterminism vs parallelism)
– bus O(1)
– crossbar O(cN) : absolute maximum (monolitic design)
• Latency ( distance)– bus O(N)
– crossbar O(1) : absolute minimum
Typical limited-degree networks:
O(1), O(N), O(N lgN)
O(N)
O(c2) O(1) for any N (modular design)
O(N), O(: the best except O(1)
MCSN - High Performance Computing 52
From crossbars to limited-degree networks
N N
bidirectional bidirectional
interfaces interfaces
N x N crossbar Monolitic (single unit)N x N crossbar
input interfaces output interfaces
MUXinputlinks output
linksMUX
Monolitic 2 x 2 crossbar
Assumed as elementary buiding block for N N modular design
Exercise: describe the firmware behavior of the 2 2 switch, and prove that the maximum bandwidth is given by (single buffering) or by (double buffering).
MCSN - High Performance Computing 53
Modular design for limited-degree networks
2 x 2 2 x 2crossbar crossbar
2 x 2 2 x 2crossbar crossbar
A 4 x 4 limited degree network implemented bythe limited degree interconnection of
2 x 2 elementary crossbars
Binary butterflywith dimension n = ( 2 in the example)
Notable example of multi-stage network(2-stage in the example)
network dimension n = number of stages
N = 4
MCSN - High Performance Computing 54
Modular crossbar as a butterflyn = 1 2 2
n = 2 2 2
2 2
2 2
2 2
n = 32 2
2 2
2 2
2 2
2 2
2 2
2 2
2 2
2 2
2 2
2 2
2 2
‘straight’ links: next stage, same level
‘oblique’ links: next stage; the base-2 representations of the source and destination levels differ only in the source-index bit.
MCSN - High Performance Computing 55
k-ary n-fly networksButterfly
sw sw sw
sw sw sw
sw sw sw
sw sw sw
Fat treesw
sw sw
sw sw sw sw
PE PE PE PE PE PE PE PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
M
M
M
M
M
M
M
M
PE
PE
ariety k, dimension n
• Number of processing nodes = 2N, N = kn
• Node degree = 2k
• Latency distance = n =
• Number of links and switches• = O(N lgN), respectively (n – 1) 2n , n 2n-1
• Maximum bandwidth = O(N)
• Complexity for maximum bandwidth = O(1), once the elementary crossbar is available.
Extendable to any ariety k, though it must be ‘low’ for limited degree networks.
Typical utilization: SMP
Simple deterministic routing algorithm, based on the binary representations of sender and destination, current stage index, and straight/oblique link .
k
n
MCSN - High Performance Computing 56
Fat treeButterfly
sw sw sw
sw sw sw
sw sw sw
sw sw sw
Fat treesw
sw sw
sw sw sw sw
PE PE PE PE PE PE PE PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
M
M
M
M
M
M
M
M
PE
PE
A tree structure (typical for NUMA) has logarithmic mean latency (e.g. n or 2n, with n = lg2(N) number of tree levels), and other similar properties of butterflies.
Routing algorithm: common-ancestor based.
In NUMA, process mapping must be chosen properly, in order to minimize distances.
However, contention in switches is too high with simple trees.
In order to minimize contention, the link and switch bandwidth increases from level to level, e.g. doubles: fat tree.
Problem: also the cost and complexity of switches increases from level to level! Modular crossbars cannot be used, otherwise the latency increases.
MCSN - High Performance Computing 57
Generalized fat tree
Second level crossbar
PE
PE
PE
PE
PE
PE
PE
PE
Third level crossbar
First level crossbar
Modest increase of contention.
Suitable both for NUMA and for SMP,if switches behave according to the butterfly-routing or to the tree-routing.
MCSN - High Performance Computing 58
k-ary n-cubes
4-ary 3-cube
4-ary 2-cube
4-ary 1-cube switch unitToroidal structures: rings.
• Number of processing nodes = N = kn
• Node degree = 2n• Latency distance = O(k n)
= O() for small n
= O(lgk N) for large n
• However, process mapping is critical.• Number of links and switches = k n• Maximum bandwidth = O(N)• Complexity for maximum bandwidth
= O(cn) for minimum latency, otherwise O(1).
• Simple deterministic routing (dimensional).
MCSN - High Performance Computing 59
Local Input-Output
MCSN - High Performance Computing 60
Interprocessor communicatons
• In a multiprocessor, the main mode of processor cooperation for process RTS is via shared memory.
• However, there are some cases in which asynchronous events are needed and more efficiently signaled through direct interprocessor communications, i.e. via Input-Output.
• Examples: – processor synchronization (locking, notify),
– low-level scheduling (process wake-up),
– cache coherence strategies, etc.
• In such cases, signaling and testing the presence of asynchronous events via shared memoria is very time consuming in terms of latency, bandwidth and contention.
MCSN - High Performance Computing 61
Local I/O
…
Interconnection Structure
W0
CPU0UC0
Wn-1
CPUn-1UCn-1
Each PE contains an on-chip local I/O unit (UC), to send and receive interprocessor event messages. The same, or a dedicated, interconnection structure is used. Traditional I/O bus has no sense for performance reasons: instead dedicated, on chip links are provided with CPU and W.
PE(core)
W
IM DM
output interprocessor msg.s
internal interconnect
interrupt interface
C2
local I/O unit(UC)
local I/Omemory(MUC)
instructionC1
dataC1
input interprocessor messages
Load/Store
Load data
Int, Ackintinterrupt message
IU EU
requests
• To start an interprocessor comunication, a CPU uses the I/O instructions: Memory Mapped I/O.
• The associated UC forwards the event message to the destination PE UC, in the form of word stream through Ws and interconnect.
• W is able to distinguish memory access requests/replies from interprocessor communications.
• The receiving UC uses the interrupt mechanism to forward the event message to the destination CPU.
There is no request-reply behavior, instead it is a purely asynchronous mechanism.
MCSN - High Performance Computing 62
Example 1Assume that the event message is composed by the event_code and by two data words (data_1, data_2), and that the process running on destination PE inserts the tuple (event_code, data_1, data_2) in a queue associated to the event.
Source CPU executes the following Memory Mapped I/O instructions:STORE RUC, 0, PE_dest where RUC means ...STORE RUC, 1, event_codeSTORE RUC, 2, data_1STORE RUC, 3, data_2
Interrupt message from UC to CPU: (event, parameter_1, parameter_2)
Destination CPU executes the following interrupt handler:
HANDLER: …
STORE Rbuffer_ev, Rbuffer_pointer, ReventSTORE Rbuffer_1, Rbuffer_pointer, Rparameter_1STORE Rbuffer_2, Rbuffer_pointer, Rparameter_2…GOTO Rret_interrupt
Exercise:1. What happens in a Memory Mapped I/O instruction if the I/O unit doesn’t contain a physical local
memory?2. Can the STORE instructions executed by the source CPU be replaced by LOAD instructions?
MCSN - High Performance Computing 63
Example 2
Alternative behavior: the process running in the destination PE is in a busy waiting condition of the event message, executing the special instruction:
WAITINT Rmask, Revent, Rparameter_1, Rparamter_2
or, if WAITINT instruction is not primitive, a simple busy waiting loop like:
MASKINT RmaskWAIT: GOTO WAIT
EI
(no real handler)
MCSN - High Performance Computing 64
Synchronous vs asynchronous event notification
interrupt
processinstructions
asynchronous wait
synchronus wait
eventregistration
interrupthandler
Example 1
Example 2