21
01/21/22 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1) Realizing programming models (7.2) Scalable communication architectures (SCAs) Message-based SCAs (7.3-7.5) Shared-memory based SCAs (7.6) Read Dubois/Annavaram/Stenström Chapter 5.5-5.6 (COMA architectures could be paper topic) Read Dubois/Annavaram/Stenström Chapter 6

7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)

Embed Size (px)

Citation preview

04/19/23 slide 1PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Scalable Multiprocessors

What is a scalable design? (7.1) Realizing programming models (7.2) Scalable communication architectures (SCAs) Message-based SCAs (7.3-7.5) Shared-memory based SCAs (7.6)

Read Dubois/Annavaram/Stenström Chapter 5.5-5.6(COMA architectures could be paper topic)Read Dubois/Annavaram/Stenström Chapter 6

04/19/23 slide 2PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

ScalabilityScalabilityGoals (P is number of processors) Bandwidth: scale linearly with P Latency: short and independent of P Cost: low fixed cost and scale linearly with P

Example: A bus-based multiprocessor Bandwidth: constant Latency: short and constant Cost: high for infrastructure and then linear

04/19/23 slide 3PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Organizational IssuesOrganizational Issues

Network composed of switches for performance and cost Many concurrent transactions allowed Distributed memory can bring down bandwidth demands

Bandwidth scaling: no global arbitration and ordering broadcast bandwidth fixed and expensive

Scalable network

P

$

Switch

M

P

$

P

$

P

$

M M

Switch Switch

Scalable network

CA

P

$

Switch

M

Switch Switch

Distributed memory organizationDance-hall memory organization

04/19/23 slide 4PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Scaling IssuesScaling IssuesLatency scaling: T(n) = Overhead + Channel Time + Routing Delay Channel Time is a function of bandwidth Routing Delay is a function of number of hops in network

Cost scaling: Cost(p,m) = Fixed cost + Incremental Cost (p,m) Design is cost-effective if speedup(p,m) > costup(p,m)

04/19/23 slide 5PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Physical ScalingPhysical Scaling

Chip, board, system-level partitioning has a big impact on scalingHowever, little consensus

Diagnostics network

Control network

Data network

Processingpartition

Processingpartition

Controlprocessors

I/O partition

PM PM

SPARC

MBUS

DRAMctrl

DRAM DRAM DRAM DRAM

DRAMctrl

Vectorunit DRAM

ctrlDRAM

ctrl

Vectorunit

FPU Datanetworks

Controlnetwork

$ctrl

$SRAM

NI

04/19/23 slide 6PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Network Transaction PrimitivesNetwork Transaction PrimitivesPrimitives to implement the programming model on a scalable machine

output buffer input buffer

Source Node Destination Node

Communication Network

serialized msg One-way transfer between source and destination Resembles a bus transaction but much richer in variety

Examples: A message send transaction A write transaction in a SAS machine

04/19/23 slide 7PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Bus vs. Network TransactionsBus vs. Network TransactionsDesign Issues:

Protection

Format

Output buffering

Media arbitration

Destination name & routing

Input buffering

Action

Completion detection

Transaction ordering

Bus Transactions:

V->P address translation

Fixed

Simple

Global

Direct

One source

Response

Simple

Global order

Network Transactions:

Done at multiple points

Flexible

Support flexible in format

Distributed

Via several switches

Several sources

Rich diversity

Response transaction

No global order

04/19/23 slide 8PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

SAS TransactionsSAS Transactions

Issues: Fixed or variable size transfers Deadlock avoidance and input buffer full

Source Destination

Time

Load r Global address]

Read request

Read request

Memory access

Read response

(1) Initiate memory access

(2) Address translation

(3) Local /remote check

(4) Request transaction

(5) Remote memory access

(6) Reply transaction

(7) Complete memory access

Wait

Read response

04/19/23 slide 9PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Sequential ConsistencySequential Consistency

Memory

P1 P2 P3

Memory Memory

A=1;flag=1;

while (flag==0);print A;

A:0 flag:0->1

Interconnection network

1: A=1

2: flag=1

3: load ADelay

P1

P3P2

(b)

(a)

Congested path

Issues: Writes need acks to signal completion SC may cause extreme waiting times

04/19/23 slide 10PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Message PassingMessage PassingMultiple flavors of synchronization semanticsBlocking versus non-blocking Blocking send/recv returns when operation completes Non-blocking returns immediately (probe function

tests completion)Synchronous Send completes after matching receive has executed Receive completes after data transfer from matching

send completesAsynchronous (buffered, in MPI terminology) Send completes as soon as send buffer may be

reused

04/19/23 slide 11PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Synchronous MP ProtocolSynchronous MP Protocol

Alternative: Keep match table at the sender, enabling a two-phase receive-initiated protocol

Source Destination

Time

Send Pdest, local VA, len

Send-rdy req

Tag check

(1) Initiate send

(2) Address translation on Psrc

(4) Send-ready request

(6) Reply transaction

Wait

Recv Psrc, local VA, len

Recv-rdy reply

Data-xfer req

(5) Remote check for posted receive (assume success)

(7) Bulk data transferSource VA Dest VA or ID

(3) Local/remote check

04/19/23 slide 12PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Asynchronous Optimistic Asynchronous Optimistic MP ProtocolMP Protocol

Issues:Copying overhead at receiver from temp buffer to user spaceHuge buffer space at receiver to cope with worst case

Source Destination

Time

Send (Pdest, local VA, len)

(1) Initiate send

(2) Address translation

(4) Send data

Recv Psrc, local VA, len

Data-xfer req

Tag match

Allocate buffer

(3) Local /remote check

(5) Remote check for posted receive; on fail, allocate data buffer

04/19/23 slide 13PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Asynchronous Robust MP Asynchronous Robust MP ProtocolProtocol

Note: after handshake, send and recv buffer addresses are known, so data transfer can be performed with little overhead

Source Destination

Time

Send Pdest, local VA, len

Send-rdy req

Tag check

(1) Initiate send

(2) Address translation on Pdest

(4) Send-ready request

(6) Receive-ready request

Return and compute

Recv Psrc, local VA, len

Recv-rdy req

Data-xfer reply

(3) Local /remote check

(5) Remote check for posted receive (assume fail); record send-ready

(7) Bulk data replySource VA Dest VA or ID

04/19/23 slide 14PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Active MessagesActive Messages

User-level analog of network transactions transfer data packet and invoke handler to

extract it from network and integrate with on-going computation

Request

handler

handler

Reply

04/19/23 slide 15PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Challenges Common to Challenges Common to SAS and MPSAS and MP

Input buffer overflow: how to signal buffer space is exhausted

Solutions: ACK at protocol level back pressure flow control special ACK path or drop packets (requires time-out)

Fetch deadlock (revisited): a request often generates a response that can form dependence cycles in the network

Solutions: two logically independent request/response networks NACK requests at receiver to free space

04/19/23 slide 16PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Spectrum of DesignsSpectrum of DesignsNone, physical bit stream

blind, physical DMA nCUBE, iPSC, . . .User/System

User-level port CM-5, *TUser-level handler J-Machine, Monsoon, . . .

Remote virtual addressProcessing, translation Paragon, Meiko CS-2

Global physical addressProc + Memory controller RP3, BBN, T3D

Cache-to-cacheCache controller Dash, KSR, Flash

Inc

rea

sin

g H

W S

up

po

rt, S

pec

ializ

ati

on

, In

tru

sive

nes

s, P

erf

orm

an

ce (

??

?)

04/19/23 slide 17PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

MP ArchitecturesMP Architectures

Design tradeoff: how much processing in CA vs P, and how much interpretation of network transaction Physical DMA (7.3)User-level access (7.4)Dedicated message processing (7.5)

PM

CA

PM

CA° ° °

Scalable Network

Node Architecture

Communication Assist

Message

Output Processing – checks – translation – formatting – scheduling

Input Processing – checks – translation – buffering – action

04/19/23 slide 18PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Physical DMAPhysical DMA

Node processor packages messages in user/system modeDMA used to copy between network and system buffersProblem: no way to distinguish between user/system messages, which results in much overhead because node processor must be involved

PMemory

Cmd

DestData

Addr

Length

Rdy

PMemory

DMAchannels

Status,interrupt

Addr

Length

Rdy

Example: nCUBE/2,IBM SP1

04/19/23 slide 19PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

User-Level AccessUser-Level Access

Network interface mapped into user address space Communication assist does protection checks, translation, etc.

No intervention by kernel except for interrupts

PMem

DestData

User/system

PMemStatus,interrupt

Example: CM-5

04/19/23 slide 20PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Dedicated Message ProcessingDedicated Message ProcessingMP doesInterprets messageSupports message operationsOff-loads P with a clean message abstraction

Network

° ° ° 

dest

Mem

P M P

NI

User System

Mem

P M P

NI

User System

Issues: P/MP communicate via shared memory: coherence traffic MP can be a bottleneck due to all concurrent actions

04/19/23 slide 21PCOD: Scalable Parallelism (ICs)

Per Stenström (c) 2008, Sally A. McKee (c) 2011

Shared Physical Address SpaceShared Physical Address Space

Remote read/write performed by pseudo processorsCache coherence issues treated in Ch. 8

M

Pseudomemory

Pseudoprocessor

PM

Pseudomemory

Pseudoprocessor

P

Scalable Network