Node-to-Network Interface in Scalable Multiprocessors

Node-to-Network Interfacein Scalable Multiprocessors

CS 258, Spring 99

David E. Culler

Computer Science Division

U.C. Berkeley

3/10/99 CS258 S99 2

Racap: Common Challenges

• Input buffer overflow– N-1 queue over-commitment => must slow sources

– reserve space per source (credit)

» when available for reuse? • Ack or Higher level

– Refuse input when full

» backpressure in reliable network

» tree saturation

» deadlock free

» what happens to traffic not bound for congested dest?

– Reserve ack back channel

– drop packets

– Utilize higher-level semantics of programming model

3/10/99 CS258 S99 3

Racap: Challenges (cont)

• Fetch Deadlock– For network to remain deadlock free, nodes must continue

accepting messages, even when cannot source msgs

– what if incoming transaction is a request?

» Each may generate a response, which cannot be sent!

» What happens when internal buffering is full?

• logically independent request/reply networks – physical networks

– virtual channels with separate input/output queues

• bound requests and reserve input buffer space– K(P-1) requests + K responses per node

– service discipline to avoid fetch deadlock?

• NACK on input buffer full– NACK delivery?

3/10/99 CS258 S99 4

Network Transaction Processing

• Key Design Issue:

• How much interpretation of the message?

• How much dedicated processing in the Comm. Assist?

PM

CA

PM

CA° ° °

Scalable Network

Node Architecture

Communication Assist

Message

Output Processing – checks – translation – formating – scheduling

Input Processing – checks – translation – buffering – action

3/10/99 CS258 S99 5

Spectrum of Designs

• None: Physical bit stream– blind, physical DMA nCUBE, iPSC, . . .

• User/System– User-level port CM-5, *T

– User-level handler J-Machine, Monsoon, . . .

• Remote virtual address– Processing, translation Paragon, Meiko

CS-2

• Global physical address– Proc + Memory controller RP3, BBN, T3D

• Cache-to-cache– Cache controller Dash, KSR, FlashIncreasing HW Support, Specialization, Intrusiveness, Performance (???)

3/10/99 CS258 S99 6

Net Transactions: Physical DMA

• DMA controlled by regs, generates interrupts

• Physical => OS initiates transfers

• Send-side– construct system “envelope” around user data in kernel area

• Receive– must receive into system buffer, since no interpretation inCA

PMemory

Cmd

DestData

Addr

Length

Rdy

PMemory

DMAchannels

Status,interrupt

Addr

Length

Rdy

sender auth

dest addr

3/10/99 CS258 S99 7

nCUBE Network Interface

• independent DMA channel per link direction– leave input buffers always open

– segmented messages

• routing interprets envelope– dimension-order routing on hypercube

– bit-serial with 36 bit cut-through

Processor

Switch

Input ports

Output ports

Memory

Addr AddrLength

Addr Addr AddrLength

AddrLength

DMAchannels

Memorybus

Os 16 ins 260 cy13 us

Or 18 200 cy15 us

- includes interrupt

3/10/99 CS258 S99 8

Conventional LAN NI

NIC Controller

DMAaddr

len

trncv

TX

RX

Addr LenStatusNext

Addr LenStatusNext

Addr LenStatusNext

Addr LenStatusNext

Addr LenStatusNext

Addr LenStatusNext

Data

Host Memory NIC

IO Busmem bus

Proc

3/10/99 CS258 S99 9

User Level Ports

• initiate transaction at user level

• deliver to user without OS intervention

• network port in user space

• User/system flag in envelope– protection check, translation, routing, media access in src CA

– user/sys check in dest CA, interrupt on system

PMem

DestData

User/system

PMemStatus,interrupt

3/10/99 CS258 S99 10

User Level Network ports

• Appears to user as logical message queues plus status

• What happens if no user pop?

Virtual address space

Status

Net outputport

Net inputport

Program counter

Registers

Processor

3/10/99 CS258 S99 11

Example: CM-5

• Input and output FIFO for each network

• 2 data networks

• tag per message– index NI mapping

table

• context switching?

• *T integrated NI on chip

• iWARP also

Diagnostics network

Control network

Data network

Processingpartition

Processingpartition

Controlprocessors

I/O partition

PM PM

SPARC

MBUS

DRAMctrl

DRAM DRAM DRAM DRAM

DRAMctrl

Vectorunit DRAM

ctrlDRAM

ctrl

Vectorunit

FPU Datanetworks

Controlnetwork

$ctrl

$SRAM

NI

Os 50 cy 1.5 us

Or 53 cy 1.6 us

interrupt 10us

3/10/99 CS258 S99 12

User Level Handlers

• Hardware support to vector to address specified in message

– message ports in registers

U ser /sy s te m

PM e m

D e stD ata A d dress

PM e m

3/10/99 CS258 S99 13

J-Machine: Msg-Driven Processor

• Each node a small msg driven processor

• HW support to queue msgs and dispatch to msg handler task

3/10/99 CS258 S99 14

Monsoon Explicit Token-Store

3/10/99 CS258 S99 15

*T: Network Co-Processor

3/10/99 CS258 S99 16

iWARP: Systolic Computation

• Nodes integrate communication with computation on systolic basis

• Msg data direct to register

• Stream into memory

Interface unit

Host

3/10/99 CS258 S99 17

Dedicated processing without dedicated hardware design

3/10/99 CS258 S99 18

Dedicated Message Processor

• General Purpose processor performs arbitrary output processing (at system level)

• General Purpose processor interprets incoming network transactions (at system level)

• User Processor <–> Msg Processor share memory

• Msg Processor <–> Msg Processor via system network transaction

Network

° ° °

dest

Mem

P M P

NI

User System

Mem

P M P

NI

User System

3/10/99 CS258 S99 19

Levels of Network Transaction

• User Processor stores cmd / msg / data into shared output queue– must still check for output queue full (or make elastic)

• Communication assists make transaction happen– checking, translation, scheduling, transport, interpretation

• Effect observed on destination address space and/or events

• Protocol divided between two layers

Network

° ° °

dest

Mem

P M P

NI

User System

Mem

PM P

NI

3/10/99 CS258 S99 20

Example: Intel Paragon

Network

° ° ° Mem

P M P

NIi860xp50 MHz16 KB $4-way32B BlockMESI

sDMArDMA

64400 MB/s

$ $

16 175 MB/s Duplex

I/ONodes

rteMP handler

Var dataEOP

I/ONodes

Service

Devices

Devices

2048 B

3/10/99 CS258 S99 21

User Level Abstraction (Lok Liu)

• Any user process can post a transaction for any other in protection domain

– communication layer moves OQsrc –> IQdest

– may involve indirection: VASsrc –> VASdest

ProcOQ

IQ

VAS

ProcOQ

IQ

VAS

ProcOQ

IQ

VAS

ProcOQ

IQ

VAS

3/10/99 CS258 S99 22

Msg Processor Events

Dispatcher

User OutputQueues

Send FIFO~Empty

Rcv FIFO~Full

Send DMA

Rcv DMA

DMA done

ComputeProcessorKernel

SystemEvent

3/10/99 CS258 S99 23

Basic Implementation Costs: Scalar

• Cache-to-cache transfer (two 32B lines, quad word ops)– producer: read(miss,S), chk, write(S,WT), write(I,WT),write(S,WT)

– consumer: read(miss,S), chk, read(H), read(miss,S), read(H),write(S,WT)

• to NI FIFO: read status, chk, write, . . .

• from NI FIFO: read status, chk, dispatch, read, read, . . .

CP

User OQ

MP

Registers

Cache

Net FIFO

UserIQ

MP CP Net

2 1.5 2

4.4 µs 5.4 µs

10.5 µs

7 wds

2 2 2

250ns + H*40ns

3/10/99 CS258 S99 24

Virtual DMA -> Virtual DMA

• Send MP segments into 8K pages and does VA –> PA

• Recv MP reassembles, does dispatch and VA –> PA per page

CP

User OQ

MP

Registers

Cache

Net FIFO

UserIQ

MP CP Net

2 1.5 2

7 wds

2 2 2

Memory

sDMA

hdr

rDMA

MP

20482048

400 MB/s

175 MB/s

400 MB/s

3/10/99 CS258 S99 25

Single Page Transfer Rate

Transfer Size (B)

MB

/s

0

50

100

150

200

250

300

350

400

0 2000 4000 6000 8000

Total MB/s

Burst MB/s

Actual Buffer Size: 2048Effective Buffer Size: 3232

3/10/99 CS258 S99 26

Msg Processor Assessment

• Concurrency Intensive– Need to keep inbound flows moving while outbound flows stalled

– Large transfers segmented

• Reduces overhead but adds latency

User OutputQueues

Send FIFO~Empty

Rcv FIFO~Full

Send DMA

Rcv DMA

DMA done

ComputeProcessorKernel

SystemEvent

User InputQueues

VAS

Dispatcher

3/10/99 CS258 S99 27

Case Study: Meiko CS2 Concept

• Circuit-switched Network Transaction– source-dest circuit held open for request response

– limited cmd set executed directly on NI

• Dedicated communication processor for each step in flow

Network

Dest

P P

Mem Mem

Pout Pin Preply

Pcmd V P Pevent

Pout Pin Preply

Pcmd V P Pevent

3/10/99 CS258 S99 28

Case Study: Meiko CS2 Organization

Set-event

Generatesset-event3 x write_word

SWAP:CMD, AddrAccept

Interrupt

Run-thread

Start-DMA

Pinput

P

Mem interface

Threads

DMAdescriptorsUser

data Mem

Output controlExecute net transactions

· requests from Pthread· write_blocks from PDMA· set-event and write_word

from Preply

DMA from memoryIssue write_block transactions(50-s limit)

RISC instruction set64-K nonpreemptive threadsConstruct arbitrary net transactionsOutput protocol

Preply

Pcmd

Pthread PDMA

Network

3/10/99 CS258 S99 29

Shared Physical Address Space

• NI emulates memory controller at source

• NI emulates processor at dest– must be deadlock free

Scalable network

P$

Memory management unit

Data

Ld R Addr

Pseudomemory

Pseudo-processor

DestReadAddrSrcTag

DataTagRrspSrc

Output processing· Mem access· Response

CommmunicationInput processing

· Parse· Complete read

P$

MMU

Mem

Pseudo-memory

Pseudo-processor

Mem

assist

3/10/99 CS258 S99 30

Case Study: Cray T3D

• Build up info in ‘shell’

• Remote memory operations encoded in address

DRAM

Reqout

P$

MMU

150-MHz DEC Alpha (64 bit)

8-KB instruction + 8-KB data

43-bit virtual address

Prefetch

Load-lock, store-conditional

32-bit

DTB

Prefetch queue· 16 64

Message queue

· 4,080 4 64

Special registers

· swaperand · fetch&add · barrier

PE# + FC

DMA

Resp

in 3D torus of pairs of PEs· share net and BLT

· up to 2,048

· 64 MB each

Req

in

Respout

Block transfer

32- and 64-bit memory and byte operations

Nonblocking stores and memory barrier

engine

physical address

3/10/99 CS258 S99 31

Case Study: NOW

• General purpose processor embedded in NIC

L2 $

Bus adapterSBUS (25 MHz)Mem

UltraSparc

s DMA

Host DMA

SRAM

Myrinet

X-bar

r DMA

Bus interface

Mainprocessor

LinkInterface

160-MB/sbidirectionallinks

MyricomLanai NIC(37.5-MHz processor,256-MB SRAM3 DMA units)

Eight-portwormholeswitches

3/10/99 CS258 S99 32

Message Time Breakdown

• Communication pipeline

Ma

ch

ine

r

es

ou

rc

e

T im e o f th e m e s s ag e

S o u rce p ro ce sso r

C o m m u n ic a tion a ss is t

D e s tin a t io n p ro c es so r

To ta l co m m u n ic a tion la te nc y

O bs e rve d ne tw o r k

la te nc y

C o m m u n ic a tion a ss is t

N e tw o rk

Os

OrL

3/10/99 CS258 S99 33

Message Time Comparison

Mic

rose

cond

s

CM

-5

Par

ago

n

Mei

ko C

S-2

NO

W U

ltra

T3

D

CM

-5

Par

ago

n

Mei

ko C

S-2

NO

W U

ltra

T3

D

0

2

4

6

8

10

12

14

Processing overhead,receiving side (Or)

Processing overhead,sending side (Os)

Communicationlatency (L)

Time per message, pipelinedsequence of request-response operations (g)

3/10/99 CS258 S99 34

SAS Time Comparison

Mic

rose

cond

s

CM

-5

Pa

rago

n

Me

iko

CS

-2

NO

W U

ltra

T3

D

CM

-5

Pa

rago

n

Me

iko

CS

-2

NO

W U

ltra

T3

D

0

5

10

15

20

25

Gap

Issue

Latency

3/10/99 CS258 S99 35

Message-Passing Time vs Size

Tim

e (

s)

1 10 100 1,000

Message size

10,000 100,000 1,000,000

*Sunmos operating systemis used for the benchmark.

1

10

100

1,000

10,000

100,000

1,000,000

iPSC/860

IBM SP-2

Meiko CS-2

Paragon/Sunmos*

Cray T3D

SGI Challenge

NOW

Sun E5000

3/10/99 CS258 S99 36

Message-Passing Bandwidth vs Size

1 10 100 1,000

Message size

10,000 100,000 1,000,0000

20

40

60

80

100

120

140

160

180B

and

wid

th (

MB

/s)

iPSC/860

IBM SP-2

Meiko CS-2

Paragon/Sunmos

Cray T3D

SGI Challenge

NOW

Sun E6000

3/10/99 CS258 S99 37

Application Performance on LUS

pee

dup

on

LU

-A

0 25 50 75 100 125

T3D

SP

-2

NO

W

0

25

50

75

100

125

NOW

SP-2

Ideal

T3D

0

50

100

150

200

250

MFLOPS on LU-Ausing four processors

Number of processors

3/10/99 CS258 S99 38

Application Performance on BT

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Sp

eedu

p o

n B

T-A NOW

Ideal

SP-2

T3D

BT MFLOPSusing 25

processors

Number of processors

0

200

400

600

800

1,000

1,200

1,400

T3D

SP

-2

NO

W

3/10/99 CS258 S99 39

Message Profile on BT

0 500 1,000 1,500 2,000 2,5000

5

10

15

20

25

30

35

40

Me

ssa

ge s

ize

(KB

)

Time (ms)

3/10/99 CS258 S99 40

Reflective Memory

• Writes to local region reflected to remote

T1

T2

R1

R2

T3

R3

VA0

VA2

Nodei

T0

T2

R0

R2

VANodej

T0

T1

R1

R0

VA

Nodek

T1

T2

T3

R1

R2

R3

Physicaladdress

I/O

3/10/99 CS258 S99 41

Case Study: DEC Memory Channel

• See also Shrimp

PCI (33 MHz)

Receive DMA

Bus interface

PCTtxctrl

rxctrl

AlphaServerSMP

Memory Channel interconnect

100 MB/s

Linkinterface

Bus adapter

MemAlphaP - $

Documents

Node-to-Network Interface in Scalable Multiprocessors