Upload
roderick-mcdowell
View
227
Download
3
Tags:
Embed Size (px)
Citation preview
04/19/23 slide 1PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Scalable Multiprocessors
What is a scalable design? (7.1) Realizing programming models (7.2) Scalable communication architectures (SCAs) Message-based SCAs (7.3-7.5) Shared-memory based SCAs (7.6)
Read Dubois/Annavaram/Stenström Chapter 5.5-5.6(COMA architectures could be paper topic)Read Dubois/Annavaram/Stenström Chapter 6
04/19/23 slide 2PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
ScalabilityScalabilityGoals (P is number of processors) Bandwidth: scale linearly with P Latency: short and independent of P Cost: low fixed cost and scale linearly with P
Example: A bus-based multiprocessor Bandwidth: constant Latency: short and constant Cost: high for infrastructure and then linear
04/19/23 slide 3PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Organizational IssuesOrganizational Issues
Network composed of switches for performance and cost Many concurrent transactions allowed Distributed memory can bring down bandwidth demands
Bandwidth scaling: no global arbitration and ordering broadcast bandwidth fixed and expensive
Scalable network
P
$
Switch
M
P
$
P
$
P
$
M M
Switch Switch
Scalable network
CA
P
$
Switch
M
Switch Switch
Distributed memory organizationDance-hall memory organization
04/19/23 slide 4PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Scaling IssuesScaling IssuesLatency scaling: T(n) = Overhead + Channel Time + Routing Delay Channel Time is a function of bandwidth Routing Delay is a function of number of hops in network
Cost scaling: Cost(p,m) = Fixed cost + Incremental Cost (p,m) Design is cost-effective if speedup(p,m) > costup(p,m)
04/19/23 slide 5PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Physical ScalingPhysical Scaling
Chip, board, system-level partitioning has a big impact on scalingHowever, little consensus
Diagnostics network
Control network
Data network
Processingpartition
Processingpartition
Controlprocessors
I/O partition
PM PM
SPARC
MBUS
DRAMctrl
DRAM DRAM DRAM DRAM
DRAMctrl
Vectorunit DRAM
ctrlDRAM
ctrl
Vectorunit
FPU Datanetworks
Controlnetwork
$ctrl
$SRAM
NI
04/19/23 slide 6PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Network Transaction PrimitivesNetwork Transaction PrimitivesPrimitives to implement the programming model on a scalable machine
output buffer input buffer
Source Node Destination Node
Communication Network
serialized msg One-way transfer between source and destination Resembles a bus transaction but much richer in variety
Examples: A message send transaction A write transaction in a SAS machine
04/19/23 slide 7PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Bus vs. Network TransactionsBus vs. Network TransactionsDesign Issues:
Protection
Format
Output buffering
Media arbitration
Destination name & routing
Input buffering
Action
Completion detection
Transaction ordering
Bus Transactions:
V->P address translation
Fixed
Simple
Global
Direct
One source
Response
Simple
Global order
Network Transactions:
Done at multiple points
Flexible
Support flexible in format
Distributed
Via several switches
Several sources
Rich diversity
Response transaction
No global order
04/19/23 slide 8PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
SAS TransactionsSAS Transactions
Issues: Fixed or variable size transfers Deadlock avoidance and input buffer full
Source Destination
Time
Load r Global address]
Read request
Read request
Memory access
Read response
(1) Initiate memory access
(2) Address translation
(3) Local /remote check
(4) Request transaction
(5) Remote memory access
(6) Reply transaction
(7) Complete memory access
Wait
Read response
04/19/23 slide 9PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Sequential ConsistencySequential Consistency
Memory
P1 P2 P3
Memory Memory
A=1;flag=1;
while (flag==0);print A;
A:0 flag:0->1
Interconnection network
1: A=1
2: flag=1
3: load ADelay
P1
P3P2
(b)
(a)
Congested path
Issues: Writes need acks to signal completion SC may cause extreme waiting times
04/19/23 slide 10PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Message PassingMessage PassingMultiple flavors of synchronization semanticsBlocking versus non-blocking Blocking send/recv returns when operation completes Non-blocking returns immediately (probe function
tests completion)Synchronous Send completes after matching receive has executed Receive completes after data transfer from matching
send completesAsynchronous (buffered, in MPI terminology) Send completes as soon as send buffer may be
reused
04/19/23 slide 11PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Synchronous MP ProtocolSynchronous MP Protocol
Alternative: Keep match table at the sender, enabling a two-phase receive-initiated protocol
Source Destination
Time
Send Pdest, local VA, len
Send-rdy req
Tag check
(1) Initiate send
(2) Address translation on Psrc
(4) Send-ready request
(6) Reply transaction
Wait
Recv Psrc, local VA, len
Recv-rdy reply
Data-xfer req
(5) Remote check for posted receive (assume success)
(7) Bulk data transferSource VA Dest VA or ID
(3) Local/remote check
04/19/23 slide 12PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Asynchronous Optimistic Asynchronous Optimistic MP ProtocolMP Protocol
Issues:Copying overhead at receiver from temp buffer to user spaceHuge buffer space at receiver to cope with worst case
Source Destination
Time
Send (Pdest, local VA, len)
(1) Initiate send
(2) Address translation
(4) Send data
Recv Psrc, local VA, len
Data-xfer req
Tag match
Allocate buffer
(3) Local /remote check
(5) Remote check for posted receive; on fail, allocate data buffer
04/19/23 slide 13PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Asynchronous Robust MP Asynchronous Robust MP ProtocolProtocol
Note: after handshake, send and recv buffer addresses are known, so data transfer can be performed with little overhead
Source Destination
Time
Send Pdest, local VA, len
Send-rdy req
Tag check
(1) Initiate send
(2) Address translation on Pdest
(4) Send-ready request
(6) Receive-ready request
Return and compute
Recv Psrc, local VA, len
Recv-rdy req
Data-xfer reply
(3) Local /remote check
(5) Remote check for posted receive (assume fail); record send-ready
(7) Bulk data replySource VA Dest VA or ID
04/19/23 slide 14PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Active MessagesActive Messages
User-level analog of network transactions transfer data packet and invoke handler to
extract it from network and integrate with on-going computation
Request
handler
handler
Reply
04/19/23 slide 15PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Challenges Common to Challenges Common to SAS and MPSAS and MP
Input buffer overflow: how to signal buffer space is exhausted
Solutions: ACK at protocol level back pressure flow control special ACK path or drop packets (requires time-out)
Fetch deadlock (revisited): a request often generates a response that can form dependence cycles in the network
Solutions: two logically independent request/response networks NACK requests at receiver to free space
04/19/23 slide 16PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Spectrum of DesignsSpectrum of DesignsNone, physical bit stream
blind, physical DMA nCUBE, iPSC, . . .User/System
User-level port CM-5, *TUser-level handler J-Machine, Monsoon, . . .
Remote virtual addressProcessing, translation Paragon, Meiko CS-2
Global physical addressProc + Memory controller RP3, BBN, T3D
Cache-to-cacheCache controller Dash, KSR, Flash
Inc
rea
sin
g H
W S
up
po
rt, S
pec
ializ
ati
on
, In
tru
sive
nes
s, P
erf
orm
an
ce (
??
?)
04/19/23 slide 17PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
MP ArchitecturesMP Architectures
Design tradeoff: how much processing in CA vs P, and how much interpretation of network transaction Physical DMA (7.3)User-level access (7.4)Dedicated message processing (7.5)
PM
CA
PM
CA° ° °
Scalable Network
Node Architecture
Communication Assist
Message
Output Processing – checks – translation – formatting – scheduling
Input Processing – checks – translation – buffering – action
04/19/23 slide 18PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Physical DMAPhysical DMA
Node processor packages messages in user/system modeDMA used to copy between network and system buffersProblem: no way to distinguish between user/system messages, which results in much overhead because node processor must be involved
PMemory
Cmd
DestData
Addr
Length
Rdy
PMemory
DMAchannels
Status,interrupt
Addr
Length
Rdy
Example: nCUBE/2,IBM SP1
04/19/23 slide 19PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
User-Level AccessUser-Level Access
Network interface mapped into user address space Communication assist does protection checks, translation, etc.
No intervention by kernel except for interrupts
PMem
DestData
User/system
PMemStatus,interrupt
Example: CM-5
04/19/23 slide 20PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Dedicated Message ProcessingDedicated Message ProcessingMP doesInterprets messageSupports message operationsOff-loads P with a clean message abstraction
Network
° ° °
dest
Mem
P M P
NI
User System
Mem
P M P
NI
User System
Issues: P/MP communicate via shared memory: coherence traffic MP can be a bottleneck due to all concurrent actions
04/19/23 slide 21PCOD: Scalable Parallelism (ICs)
Per Stenström (c) 2008, Sally A. McKee (c) 2011
Shared Physical Address SpaceShared Physical Address Space
Remote read/write performed by pseudo processorsCache coherence issues treated in Ch. 8
M
Pseudomemory
Pseudoprocessor
PM
Pseudomemory
Pseudoprocessor
P
Scalable Network