64
Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1, 10.2, 15, 16, 17(except 17.5), 18. Background (Appendix): firmware, firmware communications, memories, caching, process management; see also Section 11 for memory and caching.

Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

Embed Size (px)

Citation preview

Page 1: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

Master Degree Program (Laurea Magistrale) in Computer Science and Networking

High Performance Computing

Multiprocessor architectures

• Ref: Sections 10.1, 10.2, 15, 16, 17(except 17.5), 18. • Background (Appendix): firmware, firmware communications, memories,

caching, process management; see also Section 11 for memory and caching.

Page 2: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 2

Contents

Shared memory architecture = multiprocessor

• Multicore technology (Chip MultiProcessor – CMP)

• Functional and performance features of external memory and caching in multiprocessors

• Interconnection networks

• Multiprocessor taxonomy

• Local I/O

Sections 15, 16, 17, 18 contain several ‘descriptive-style’ parts, e.g. classifications, technologies, products, etc. They can be read easily by the students. During the lectures we’ll concentrate on the most critical issues from the conceptual and technical point of view, through examples and exercises.

Page 3: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

3

Abstract architecture and physical architecture

MCSN - High Performance Computing

. . .Abstract Processing Elements (PE),

having all the main features of real PEs

(processor, assembler, memory hierarchy, external memory, I/O, etc.);

one PE for each processAbstract interconnection network:all the needed direct links corresponding to the interprocess communication channels

Abstraction of physical interconnect,Memory hierarchy, I/O,

Process run-time support,Process mapping onto PEs, etc.

All the physical details are condensed into a small number of parameters used to evaluate Lcom.

Cost model of the

specific parallel program executed

on the specific parallel architecture

Evaluation of calculation times Tcalc

Page 4: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 4

Multiple Instruction Stream (MIMD) architectures

Parallelism between processes

Page 5: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 5

Shared memory vs distributed memory

MultiprocessorCurrently, the main technology for multicore and multicore-based systems.

MulticomputerCurrently, the main technology for clusters and data-centres.Processing nodes are multiprocessors.

We’ll start with shared memory multiprocessors.

Page 6: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 6

Levels and shared memory

Hardware

Applications

Processes

Assembler

Firmware

Shared physical memory (multiprocessor): any processor can address (thus can access directly) any location of the physical memory

contains all instructions and data• private• shared of processes

Page 7: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 7

Levels and shared memory

Hardware

Applications

Processes

Assembler

Firmware

message-passing (e.g. LC) or shared data

Shared physical memory (multiprocessor): any processor can address (thus can access directly) any location of the physical memory

Graphs of cooperating processes expressed by a concurrent language

contains all instructions and data• private• shared of processes

RTS (concurrent language)

based on shared data structures(communication channel descriptors,

process descriptors, etc.)

exploitingthe shared physical memory

No sharing implemented by sharing

Different shared data at different levels

Page 8: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 8

Generic scheme of multiprocessor

……M0

Mj Mm-1

Interconnection network(s)

W0

CPU0

WN-1

CPUN-1

…Wi

CPUi

External shared main memory

N x m interconnect

Processing Element (PE)

PE interface unit: decouples CPUs from interconnect technology (‘Wrapping’ unit)

CPU (processor units, MMUs, caches) + local I/O

PE0 PEiPEN-1

routing and flow control

Page 9: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 9

Typical PEInterconnection Network

To/from external memory modules and other PEs

W

CPU

Processing Node (PE)

Secondary Cache

. . .

UC

Local I/O units

Primary Cache (Instr. + Data)

MMUs

Processor units InterruptArbiter

External memory interface

I/O

interface

PE interface unit

Page 10: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 10

Shared memory basics - 1

Page 11: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 11

Example: an elementary multiprocessorJust to understand / to review basic concepts and techniques, which will be extended to real multiprocessor architectures

Abstract PE0

AbstractPE1

P Q

Question:Which kinds of requests are sent from a PE to M - and which reply is sent from M to a PE? [true/false]1. Copy a message from P_msg to Q_vtg2. Request a message to be assigned to Q_vtg3. A single-word read4. A single-word write 5. A C1-block read6. A C1-block write

M

C2

W

C1

P

C2

W

C1

P

PE0 PE1

t t

tM

Page 12: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 12

Example: an elementary multiprocessorJust to understand / to review basic concepts and techniques, which will be extended to real multiprocessor architectures

Abstract PE0

AbstractPE1

P Q

Question:Which kinds of requests are sent from a PE to M - and which reply is sent from M to a PE? [true/false]1. Copy a message from P_msg to Q_vtg2. Request a message to be assigned to Q_vtg3. A single-word read4. A single-word write 5. A C1-block read6. A C1-block write

Question:What is the format (configuration of bits) of a request PE-M and of a reply M-PE ?

Question:What happens if a request from PE0 and a request from PE1 arrive ‘simultaneoulsy’ to M ?

M

C2

W

C1

P

C2

W

C1

P

PE0 PE1

t t

tM

yes, if Write-Thoughyesyes, if Write-Back

Page 13: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 13

Behavior of the memory unit M

• Processing module as a unifying concept at the various levels = processing unit at firmware level, process at the process level.

• All the same mechanisms studied for process cooperation (LC) are applied at the firmware level too, though with different implementations and performances.

• Communication through RDY-ACK interfaces.

• Nondeterminism: test simultaneously, in the same clock cycle, all the RDYs of the input interfaces; select one of the ready requests, possibly applying a fair priority strategy.

• Nondeterminism may be implemented as real parallelism in the same clock cycle: if input requests are compatible and the memory bandwidth is sufficient, more requests can be served simultaneously.

Page 14: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 14

Behavior of the memory unit M

• A further feature can be defined for a shared memory unit: indivisible sequences of memory accesses.

• An additional bit (INDIV) is associated to each memory request: if it is set to 1, once the associated request is accepted by M, the other requests are left pending in the input interfaces (simple waiting queue mechanims), until INDIV is reset to 0.

• During an indivisible sequence of memory accesses, the M behavior is deterministic.

• At the end, the nondeterministic/parallel behavior is resumed (possibly by serving a waiting request).

• This mechanism is provided by some machines : proper instructions (e.g. TEST_AND_SET) or annotation in LOAD and STORE instructions.

Page 15: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 15

In general

……M0

Mj Mm-1

Interconnection network(s)

W0

CPU0

WN-1

CPUN-1

…Wi

CPUi

PE0 PEiPEN-1

Nodeterminism and parallelism

in the behavior of memory units

and ofnetwork switching units

Page 16: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 16

Technology overview and multicore

Page 17: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 17

CPU technologyPipelined

and multithreaded processor technology: general view

(Sections 12, 13)

C2

External memory interface (MINF)

MMU

C1

P

In the simplified cost model adopted for this course, this structure is invisible and abstracted by the equivalent service time per instruction Tinstr (e.g. 2t).

Page 18: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 18

Pipelined / vectorized Execution Unit

FP PipelinedAdd / Sub

INT PipelinedMul / Div

FP PipelinedMul / Div

EU_Master

• Distribuzione• Operazioni corte• LOAD• Registri RG• Registri RFP

Collettore

da DM

da IU

a IU

+ Vectorization facilities

general,floating-point and vector registers

Page 19: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 19

Multithreading (‘hardware’ multithreading)

I/Oexternal memory

IM – instruction C1

IU0

DM – data C1

EU_Master0

switch

FU0 FU1 FU2 FU3

IU1

EU_Master1

C2M

I

N

F

CPU chip

e.g.Hyperthreading

Ideally,an equivalent number of q N PEs is available,where q is the multithreading degree.

In practice,a q Nwith a < 1.

Example of 2-thread CPU

Page 20: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 20

Multicore technology: Chip MultiProcessor (CMP)

CMP

MINF MINF MINF MINF

PE 0 PE 1 . . . PE N-1

internal interconnectI/O INF I/O INF

W

C2

local I/Opipelined processor coproc.

C1 (instr + data)

PE / core

single chip

For our purposes, the terms ‘multicore’ and ‘manycore’ are synonymous.We use the more general and rigorous term ‘Chip MultiProcessor (CMP)’.

Page 21: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 21

Internal interconnect examples for CMPRing

PE PE PE PE

sw sw sw sw

sw sw sw sw

PE PE PE PE

2D toroidal mesh

… … …… … … … … … …

… … …… … … … … … …

… … …… … … … … … …

… … …

Switching Unit (or, simply, Switch): routing and flow control

Crossbar

Page 22: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 22

Example of single-CMP system

Ethernet Fiber channel graphics videoM0 . . . M7 M0 . . . M7

IM . . . . . . IM

CMPMINF MINF MINF MINF

PE 0 PE 1 . . . PE N-1

Raid subsystemsscsi scsi scsi scsi

LANS / WANS / other subnets

I/O chassis

router

interconnectinterconnect

storage server

I/O and networkingI/O and networking I/O INF internal interconnect I/O INF

high-bandwidth main memory

I/O chassis

Page 23: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 23

Example of multiple-CMP system

M . . . M . . . M . . . M . . . M . . . M

CMP0 CMP1 CMPm-1

PE . . . PE PE . . . PE PE . . . PE

. . .

external interconnect

high-bandwidth shared main memory

Page 24: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 24

Intel Xeon (4-16 PEs) and Tilera Tile64 (64 PEs)

Page 25: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 25

Intel Xeon Phi (64 PEs)

PE:• pipelined • in-order• vectorized arithm.• 4-thread• 2-level cache• ring interface

Bidirectional ringinterconnect

Internal local memory (GDDR5 technology), up to 16GB(3rd level cache-like)

Page 26: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 26

Shared memory basics - 2

Page 27: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 27

Memory bandwidth and latency

High bandwidth of M (BM) is needed for:

1. Minimize the latency of cache-block transfers (Tfault)

2. Minimize contention of PEs for memory accesses

Page 28: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 28

Memory bandwidth and latency

High bandwidth of M (BM) is needed for:

1. Minimize the latency of cache-block transfers (Tfault)

2. Minimize contention of PEs for memory accesses

Page 29: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 29

Minimize the latency of cache-block transfers

• If BM = 1 words/tM, cache is quite useless for programs characterized by locality only (or mainly locality)

• BM = s1 words/tM is the best offered bandwidth: exploitable if the remaining subsystems (interconnect, PEs) are able to sustain it.

• Solutions:

1. Interleaved macro-modules (hopefully, m = s1 , e.g. = 8)

2. High-bandwidth firmware communications from M to interconnect and PEs. Notice: s1-wide links are not realistic 1-word links

• Pipelined communications and wormhole flow-control: next week

Interleaved macro-module 1 Interleaved … 2Interleaved macro-module 0

. . .M0 M1 … M7 M8 M9 … M15

Page 30: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 30

Cost model of FW communications (Sect. 10.1, 10.2)

Single buffering (figure):Communication latency:

This expresses also the communication service time:

Communication Latency Lcom = 2 (Ttr + t) Tcalc ≥ Lcom Tcom = 0

ACKRDY RDY

Ttrt

Tcalc Tcom

Sender

Receiver

Clock cycle for calculation only

Clock cycle for calculation and communication

Transmission Latency (Link only)

Communication time NOT overlapped to (i.e. not masked by) internal calculation

Calculation timeService time Tid = Tcalc + Tcom

Lcom

Tid

Sender:: Receiver::

wait ACK; wait RDY;write msg into OUT, use IN, set RDY, reset ACK, … set ACK, reset RDY, …

On chip Ttr = 0

Double buffering:

Alternate use of the two interfaces

Page 31: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 31

Memory interface

unit( )t

Elementary system example: memory latencyM

M0 . . . M7

IM

W

C2

C1

PE0

W

C2

C1

PE1

Double buffered links

MEMORY ACCESS LATENCY per block:

= + + + +

A more general and accurate, and easier-to-use, cost model will be studied for fully pipelined communications.

RQ0 is the BASE memory access latency per block = without impact of contention: optimistic or for single PE.( )t ( )t

(tM)

tMStream

of s1 words

1 word

MIM

WC2

C1

t

Request e.g. 48 – 112 bits in parallel

Ttr

(not in scale)

s1 )

Block is pipelined word-by-word

. . .( )t

Possible optimization (not so popular): the processor could re-start at this point if the first word of the stream has the address that generated fault.

Page 32: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 32

Elementary system example: memory latency

For all units, except M: =

Memory service time: =

If M (stream generator) is the bottleneck.

Example: = = 8, = .

= + + + + = 68 t

If IM-net-PE is the bottleneck.

Example: = = 8, = .

= + + + = 98 t

tM tM

s1( + tTtr)

s1( + tTtr)tM

tM tM

s1( + t Ttr)s1( + t Ttr)

s1( + t Ttr)

Page 33: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 33

Tfault

MM0 . . . M7

IM

W

C2

C1

PE0

W

C2

C1

PE1

Memory interface

unit( )t

( )t ( )t

(tM)

UNDER-LOAD memory access latency per block: RQ RQ0

From now on: =

Initially, we assume: =

Abstract PE0

AbstractPE1

P Q

Example of the first weekFor Q:

=

For M = 5 Mega, no reuse can be exploited RQ = 68t = =

Page 34: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 34

The evaluation of RQ0 (M bottleneck) is optimistic because the request part of the timing diagram contains a rough simplification:

also the request will be pipelined word-by-word: request link 1-word wide.

tM

t

Request e.g. 48 – 112 bits in parallel

Ttr

Elementary system example

Exercises1. Why for all units, except M: = ?2. Explain the RQ0 evaluation in the example when M is not bottleneck.

Page 35: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

35

In general

MCSN - High Performance Computing

Interleaved macro-module 1 Interleaved … 2Interleaved macro-module 0

. . .M0 M1 … M7 M8 M9 … M15

IM IM IM

Network. . .

. . .

. . .

. . .

. . . . . . . . .

s1 words are read in parallel by the macro-module, and sent in pipeline one word at the time through IM - network- W - C2 - C1.Reverse path for block writing.

Target:

and in general it is:

W

C2

C1

Page 36: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 36

Memory bandwidth and latency

High bandwidth of M (BM) is needed for:

1. Minimize the latency of cache-block transfers (Tfault)

2. Minimize contention of PEs for memory accesses

Page 37: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

Memory bandwidth and contention

MCSN - High Performance Computing37

tM tM

s1( + t Ttr) s1( + t Ttr)

Single internally interleaved macro-module: blocks/sec

Only one PE at the time can be served.

M(0)

M0 . . . M7

IM

M(1)

M8 . . . M15

IM

W

C2

C1

PE0

W

C2

C1

PE1

interconnect

Two externally interleaved macro-modules (= interleaved each other), each macro-module internally interleaved (= inside the macro-module):

blocks/secTwo PE at the time can be served, if not in conflict for the same macro-module.

W select the macro-module M(j)

according to the index j contained in the physical address.

tM tM

s1( + tTtr)

s1( + tTtr)

conflict

tM

tM

s1( + tTtr)

s1( + tTtr)

no conflict

Page 38: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

38

In general

MCSN - High Performance Computing

Interleaved macro-module 1 Interleaved … 2Interleaved macro-module 0

. . .M0 M1 … M7 M8 M9 … M15

IM IM IM

Network

. . .

. . .

. . . . . . . . .

N Processing Elements

m macro-modules

The destination macro-module name belongs to the routing information set(inserted by W)

Page 39: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 39

A first idea of contention effect

0 8 16 24 32 40 48 56 640.00

8.00

16.00

24.00

32.00

40.00

48.00

56.00

64.00

Interleaved memory bandwidth (m modules, N processors)

m=4m=8m=16m=32m= 64

N

For an (externally) interleaved memory, the probability that a generic processor accesses any (macro-)module is approximated by 1/m. With this assumption, the probability of having PEs in conflict for the same macro-module is distributed according to the binomial law. We can find (Section 17.3.5) :

Simplified evaluation:• only a subclass of

multiprocessor architectures (SMP),

• no network effect on latency and conflicts,

• no impact of parallel program structures.

Page 40: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 40

A more general client-server model will be derived

Contention in memory AND in the network the importance of high-bandwidth and low-latency networks

Page 41: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 41

Caching

• Caching is even more important in multiprocessor,– for latency and contention reduction,

– provided that reuse is intensively exploited.

• For shared data, intensive reuse can exist, with a proper design of process RTS.

• However, the CACHE COHERENCE problem arises (studied in the second part of semester).

Page 42: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 42

Multiprocessor taxonomy

Page 43: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 43

SMP vs NUMA architectures……M0

Mj Mm-

1Interconnection network

W0

CPU0

WN-1

CPUN-1

…Wi

CPUi

Symmetric MultiProcessor:The base latency is independent of the specific PE and memory macro-module.Also called UMA(Uniform Memory Access).

Target: contention is reduced, at the expence of base latency for shared data (optimizations are needed).

Non Uniform Memory Access:The base latency depends (heavily) on the specific PE and referred macro-module.

Local memories are shared.Each of them can be interleaved,but they are sequentially organized each other.

Local accesses have lower latency than remote ones. All private information are allocated in the local memory.

Interconnection network

W0

CPU0

M0

WN-1

CPUN-1

MN-1

Page 44: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 44

SMP-like single-CMP architecture

M0 . . . M7 M0 . . . M7

IM . . . . . . IM

CMPMINF MINF MINF MINF

PE 0 PE 1 . . . PE N-1

I/O INF internal interconnect I/O INF

Page 45: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 45

SMP and NUMA multiple-CMP architecturesa) multiple-CMP SMP architecture

M . . . M M . . . M M . . . M

IM0 IMj IMm-1

WW WW WW

CMP0 CMPi CMPN-1

PE . . . PE PE . . . PE PE . . . PE

b) multiple-CMP NUMA architecture

M M M

. . . IM0 WW . . . IMi WW . . . IMN-1 WW

M M M

CMP0 CMPi CMPN-1

PE . . . PE PE . . . PE PE . . . PE

external interconnect

external interconnect

Page 46: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 46

Parallel applications dedicated to specific domains

‘Traditional’ computing servers, data-centres (?), cloud (?).

Process_to_Processor Mapping

Multiprogrammed mapping

More processes share dynamically the same PE

Context-switch overhead

Exclusive mappingOne-to-one

Anonymous ProcessorsDynamic mapping

(low-level scheduling)

Dedicated ProcessorsStatic mapping

Originally SMPOriginally NUMA

Exercise: give an approximate evaluation of the context-switch calculation time.

Page 47: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 47

Interconnection networks

Page 48: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 48

Two extreme cases of networksOld-style bus Crossbar

• Bus is no longer applicable to highly parallel systems : cheap, but no parallelism in memory accesses minimum bandwidth, and maximum latency.

• Crossbar = fully interconnected with N2 dedicated links: maximum parallelism and bandwidth, minimum latency, but can be applied to limited parallelism only (e.g., N = 8) because of link cost and pin-count reasons.

• Limited degree networks for highly parallel systems: much lower cost than crossbar by reducing the number of links and interfaces (pin-count), at the expence of latency, but the maximum bandwidth can equally be achieved.

Page 49: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 49

‘High-performance’ networks• Many of the Limited Degree networks, that are studied for multiprocessors, are

used in distributed memory systems and in high-performance multicomputers too.– Notable industrial examples: Infiniband, Myrinet, QS-net, etc.

• The firmware level is the same, or is very similar for different architectures. The main difference lies in the implementation of routing and flow control protocols:

– In multiprocessors and high-performance multicomputers, the primitive protocols at the firmware level are (can be) used directly in the RTS of applications,

– without the additional software layers like TCP-IP of traditional networks.

– The overhead imposed by traditional TCP-IP implementations is evaluated in several orders of magnitude (e.g. msecs vs nsecs of latency !):

• no/scarce firmware support (NIC is used for physical layers only),• in kernel mode on top of operating systems (e.g., Linux).

• We’ll see that the modern network systems, cited above, render visible the primitive firmware protocols too

– For high-performance distributed applications, unless TCP-IP is forced by binary portability reasons of ‘old’/legacy products.

– Moreover, such networks implement TCP-IP with intensive firmware support (mainly in NIC) and in user mode: 1-2 orders of magnitude of overhead is saved.

Page 50: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 50

Firmware messages as streams

• Messages are packets transmitted as streams of elementary data units, typically words.

• Example: a cache block transmitted from the main memory as a stream of s1 words.

Page 51: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 51

Evaluation metricsAt least evaluated as order of magnitude O(f(N))

• Cost of links– bus O(1)

– crossbar O(N2) : absolute maximum

• Maximum bandwidth– bus O(1)

– crossbar O(N) : absolute maximum

• Complexity of design to achieve the maximum bandwidth (nondeterminism vs parallelism)

– bus O(1)

– crossbar O(cN) : absolute maximum (monolitic design)

• Latency ( distance)– bus O(N)

– crossbar O(1) : absolute minimum

Typical limited-degree networks:

O(1), O(N), O(N lgN)

O(N)

O(c2) O(1) for any N (modular design)

O(N), O(: the best except O(1)

Page 52: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 52

From crossbars to limited-degree networks

N N

bidirectional bidirectional

interfaces interfaces

N x N crossbar Monolitic (single unit)N x N crossbar

input interfaces output interfaces

MUXinputlinks output

linksMUX

Monolitic 2 x 2 crossbar

Assumed as elementary buiding block for N N modular design

Exercise: describe the firmware behavior of the 2 2 switch, and prove that the maximum bandwidth is given by (single buffering) or by (double buffering).

Page 53: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 53

Modular design for limited-degree networks

2 x 2 2 x 2crossbar crossbar

2 x 2 2 x 2crossbar crossbar

A 4 x 4 limited degree network implemented bythe limited degree interconnection of

2 x 2 elementary crossbars

Binary butterflywith dimension n = ( 2 in the example)

Notable example of multi-stage network(2-stage in the example)

network dimension n = number of stages

N = 4

Page 54: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 54

Modular crossbar as a butterflyn = 1 2 2

n = 2 2 2

2 2

2 2

2 2

n = 32 2

2 2

2 2

2 2

2 2

2 2

2 2

2 2

2 2

2 2

2 2

2 2

‘straight’ links: next stage, same level

‘oblique’ links: next stage; the base-2 representations of the source and destination levels differ only in the source-index bit.

Page 55: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 55

k-ary n-fly networksButterfly

sw sw sw

sw sw sw

sw sw sw

sw sw sw

Fat treesw

sw sw

sw sw sw sw

PE PE PE PE PE PE PE PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

M

M

M

M

M

M

M

M

PE

PE

ariety k, dimension n

• Number of processing nodes = 2N, N = kn

• Node degree = 2k

• Latency distance = n =

• Number of links and switches• = O(N lgN), respectively (n – 1) 2n , n 2n-1

• Maximum bandwidth = O(N)

• Complexity for maximum bandwidth = O(1), once the elementary crossbar is available.

Extendable to any ariety k, though it must be ‘low’ for limited degree networks.

Typical utilization: SMP

Simple deterministic routing algorithm, based on the binary representations of sender and destination, current stage index, and straight/oblique link .

k

n

Page 56: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 56

Fat treeButterfly

sw sw sw

sw sw sw

sw sw sw

sw sw sw

Fat treesw

sw sw

sw sw sw sw

PE PE PE PE PE PE PE PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

M

M

M

M

M

M

M

M

PE

PE

A tree structure (typical for NUMA) has logarithmic mean latency (e.g. n or 2n, with n = lg2(N) number of tree levels), and other similar properties of butterflies.

Routing algorithm: common-ancestor based.

In NUMA, process mapping must be chosen properly, in order to minimize distances.

However, contention in switches is too high with simple trees.

In order to minimize contention, the link and switch bandwidth increases from level to level, e.g. doubles: fat tree.

Problem: also the cost and complexity of switches increases from level to level! Modular crossbars cannot be used, otherwise the latency increases.

Page 57: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 57

Generalized fat tree

Second level crossbar

PE

PE

PE

PE

PE

PE

PE

PE

Third level crossbar

First level crossbar

Modest increase of contention.

Suitable both for NUMA and for SMP,if switches behave according to the butterfly-routing or to the tree-routing.

Page 58: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 58

k-ary n-cubes

4-ary 3-cube

4-ary 2-cube

4-ary 1-cube switch unitToroidal structures: rings.

• Number of processing nodes = N = kn

• Node degree = 2n• Latency distance = O(k n)

= O() for small n

= O(lgk N) for large n

• However, process mapping is critical.• Number of links and switches = k n• Maximum bandwidth = O(N)• Complexity for maximum bandwidth

= O(cn) for minimum latency, otherwise O(1).

• Simple deterministic routing (dimensional).

Page 59: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 59

Local Input-Output

Page 60: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 60

Interprocessor communicatons

• In a multiprocessor, the main mode of processor cooperation for process RTS is via shared memory.

• However, there are some cases in which asynchronous events are needed and more efficiently signaled through direct interprocessor communications, i.e. via Input-Output.

• Examples: – processor synchronization (locking, notify),

– low-level scheduling (process wake-up),

– cache coherence strategies, etc.

• In such cases, signaling and testing the presence of asynchronous events via shared memoria is very time consuming in terms of latency, bandwidth and contention.

Page 61: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 61

Local I/O

Interconnection Structure

W0

CPU0UC0

Wn-1

CPUn-1UCn-1

Each PE contains an on-chip local I/O unit (UC), to send and receive interprocessor event messages. The same, or a dedicated, interconnection structure is used. Traditional I/O bus has no sense for performance reasons: instead dedicated, on chip links are provided with CPU and W.

PE(core)

W

IM DM

output interprocessor msg.s

internal interconnect

interrupt interface

C2

local I/O unit(UC)

local I/Omemory(MUC)

instructionC1

dataC1

input interprocessor messages

Load/Store

Load data

Int, Ackintinterrupt message

IU EU

requests

• To start an interprocessor comunication, a CPU uses the I/O instructions: Memory Mapped I/O.

• The associated UC forwards the event message to the destination PE UC, in the form of word stream through Ws and interconnect.

• W is able to distinguish memory access requests/replies from interprocessor communications.

• The receiving UC uses the interrupt mechanism to forward the event message to the destination CPU.

There is no request-reply behavior, instead it is a purely asynchronous mechanism.

Page 62: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 62

Example 1Assume that the event message is composed by the event_code and by two data words (data_1, data_2), and that the process running on destination PE inserts the tuple (event_code, data_1, data_2) in a queue associated to the event.

Source CPU executes the following Memory Mapped I/O instructions:STORE RUC, 0, PE_dest where RUC means ...STORE RUC, 1, event_codeSTORE RUC, 2, data_1STORE RUC, 3, data_2

Interrupt message from UC to CPU: (event, parameter_1, parameter_2)

Destination CPU executes the following interrupt handler:

HANDLER: …

STORE Rbuffer_ev, Rbuffer_pointer, ReventSTORE Rbuffer_1, Rbuffer_pointer, Rparameter_1STORE Rbuffer_2, Rbuffer_pointer, Rparameter_2…GOTO Rret_interrupt

Exercise:1. What happens in a Memory Mapped I/O instruction if the I/O unit doesn’t contain a physical local

memory?2. Can the STORE instructions executed by the source CPU be replaced by LOAD instructions?

Page 63: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 63

Example 2

Alternative behavior: the process running in the destination PE is in a busy waiting condition of the event message, executing the special instruction:

WAITINT Rmask, Revent, Rparameter_1, Rparamter_2

or, if WAITINT instruction is not primitive, a simple busy waiting loop like:

MASKINT RmaskWAIT: GOTO WAIT

EI

(no real handler)

Page 64: Master Degree Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Multiprocessor architectures Ref: Sections 10.1,

MCSN - High Performance Computing 64

Synchronous vs asynchronous event notification

interrupt

processinstructions

asynchronous wait

synchronus wait

eventregistration

interrupthandler

Example 1

Example 2