1 Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches Eric Freudenthal and Allan Gottlieb {freudenthal, gottlieb}@nyu.edu

1

Debunking then Duplicating Ultracomputer Performance Claims by

Debugging the Combining Switches

Eric Freudenthal and Allan Gottlieb

{freudenthal, gottlieb}@nyu.edu

2

Talk Summary

• Review of combining networks– MIMD architecture expected to provide high

performance for hot spot traffic & centralized coordination

• Duplicating & debunking– High hot spot latency, slow centralized coord.– Why

• Improvements to architecture– Larger buffers, adaptive queue capacity– Improved hot-spot & coordination performance

3

PRAM and Fetch & AddFetch & Add: atomic primitiveequivalent toint FAA(int *v, addend) int r = *v; *v = r + addend; return r

Serialization free FAA-based centralized• Queues, barriers• Shared locks

– Counting semaphores, R/W– (Not binary semaphores)– Single memory ref if uncontended

These algorithms generate hot spot memory traffic which may serialize in the memory system.

PEn

PE4PE3PE2PE1PE0

PRAM

Idealized Multi-Port

Shared Memory

• • •

NYU Ultracomputer approximatesthis model.

4

222120

PE7

PE6

PE5

PE4

PE3

PE2

PE1

PE0

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

MM7

MM6

MM5

MM4

MM3

MM2

MM1

MM0

ProcessingElements Switches

MemoryModules

Routing:

23 PE computer with omega network

NUMA Connections

“Dance Hall”(All Processors equally distant from all Memory Modules)

“Budoir”(Processors & Memory Modules can be co-resident)

5

Network congestion due to polling of hot spot in MM3

• Each PE has single outstanding reference to same variable.– Low offered load

• If switches simply route messages– Polling requests serialize at MM3– Switch queues in “funnel” near MM3 (in red) fill

• If switches combine references to same variable– A single MM operation satisfies multiple requests– Lower network congestion & access latency

PE7PE6PE5PE4PE3PE2PE1PE0

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

SW

MM7MM6MM5MM4MM3MM2MM1MM0

6

Combining of Fetch & Add

FAA(X,1)

FAA(X,2)

FAA(X,4)

FAA(X,8)

Start: X=0

Addend for decombine in wait buffer. Upper port first, its addend=4

MM

X:12

X:0

X:13

X:4

X:12

X:0

FAA(X,3)

FAA(X,12)

1U

4U

FAA(X,15)12L

X:0 End: X=15

1º

2º

3º

4º

Semantics equivalent to some serialization.

7

NYU Combining Queue Design

Background: Guibas & Liang Systolic FIFO

Ultracomputer Combining Queuein outchute

ALU

in outchute

ALU

in outchute

ALU

in outchute

ALU

No associative memory required

inout

8

Prior Results

• Architecture reasonable and motivated– Switches not prohibitively expensive – Serialization-free coordination algorithms

• Queues in switches permit high bandwidth– Low latency for random & mixed hot spot traffic

• Hot spot congestion remains problematic– Queues in switches near hot memory fill – High latency for all overlapping traffic– Ultra-3 flow-control believed helpful

11

Rest of this talk• Duplication of old results:

– Low average latency for low hot spot fraction– High latency for hotspot polling

• New results– Debunking: High latency despite Ultra3 flow control

• Distributed synchronization algorithms superior to centralized– Deconstructing: Understanding of high latency

• Reduced combining due to wait buffer exhaustion• Queuing delays in network – reduced Q capacity helps

– Debugging: Improvements to combining switches• Larger wait buffer needed• Adaptive reduction of queue capacity when combining occurs

– Duplication: Centralized algorithms competitive• Much superior for concurrent-access locks

12

Ultra III “baseline” switchesMemory Latency, one request / PE

% h

ot s

pot

100%, n

o combining

ideal

~2x

~4x

0-10%

40%

100%

13

Two “Fixes” to Ultra III Switch Design

• Problem: Full wait buffers reduce combining – Ultra III flow control:

• Full wait buffer → block To-MM input ports• Switches in funnel starved from “combinable” traffic.

– “Sufficient” capacity → 45% latency reduction

• Problem: Congestion in “combining funnel”– Combined messages fill queues near MMs– Closed system: |PEs| messages, most near MMs– Few messages in other switches; combining unlikely– Shortened queues →backpressure

• Lower per-stage queuing delays, more combining• Reduces latency another 30%; Centralized algs competitive

14

Design tension I: “Best” queue length

• Problem– Non-hot spot latency benefits from large queues– Hot-spot combining benefits from small queues

• Solution– Detect switches engaged in combining

• Multiple combined messages awaiting transmission

– Adaptively reduce capacity of these switches• Other switches unaffected

• Results– Reduced polling latency, good non-poll latency

15

Other Combining SwitchDesign Tensions

Single input queues simpler• Hard to double clock rate• Dual input combining queue

can be built from two single-input queues

• Messages from different ports ineligible for combining

Decoupled ALUs• Decoupling allows faster

clock speeds• Head item can not combine

– Max(transmission, ALU) rather than sum

• Three enqueued messages required for combining

ALU ALU

mux ALU

ALU

16

Memory latency, 1024 PE SystemsOver a range of accepted load

• Baseline Ultra III switch– Limited wait buffer– Fixed queue size

• Waitbuf100– Baseline– Sufficient wait buffer

• Improved– Waitbuf100– Adaptive queue length

• Aggressive– Improved – Combines from both ports &

on first slice– Assume no reduction of

clock rate (optimistic)

100% hot

20% hot

Uniform

17

Mellor-Crummey & Scott (MCS):Local-spin coordination

• No hot spot polling– Each PE spins on distinct shared var in co-located MM– Other parts of algorithm may generate hot spot traffic

• Serialization-free barriers– Barrier satisfaction “disseminated” without generating

hotspot traffic

– Each processor has log2(N) rendezvous

• Locks: Global state in hot spot variables– Heads of linked lists (blocked requestors)– Count of readers– Hot spot accesses benefit from combining

18

Synchronization: BarriersMCS also serialization-free

• IntenseLoop:– barrier

• RealisticLoop:– Ref 15 or 30 shared vars– barrier

Faster

19

Reader-Writer Experiment• Loop:

Determine if reader or writerSleep for 100 cyclesLockReference 10 shared variablesUnlock

• Reader-writer mix– All reader, all writer– 1, 10 expected writer -- P(writer) = 1/N

• Plots on next slides– Rate readers and writer locks granted (unit=rate/kc)– Greater values indicate greater progress

20

All Readers / All Writers

• All Readers– Combining helps MCS– Serialization-free (FAA

algorithm) faster

• All Writers– Essentially a semaphore– Little network traffic– MCS fastest if contention

Fas

ter

21

1 Expected Writer

• Rate readers proceed– FAA faster– MCS benefits from

combining

• Rate writers proceed– FAA faster on best arch– MCS benefits from

combining

Fas

ter

22

Conclusions

• Architecture– Large wait buffers decrease hot spot latency– Adaptive Q capacity decreases latency

• General technique?

• Performance of Centralized Algorithms– Centralized R/W competitive with MCS alternative

• Much superior when readers dominate• Require combining.

– Centralized barrier• Almost as fast as distributed with “improved ultra3” switches• Faster than distributed with “new” switch design• Benefits diminish as superstep size increases

23

Relevance & Future Work

• Large shared memory systems are manufactured– Combining possible on all topologies

• Return messages must be routed to combine sites

– Combining demonstrated as useful for inter-process coordination.

• Application of adaptive queue capacity modulation to other domains– Such as responding to flash-flood & DOS traffic

• Analytic model of queuing delays for hot spot combining under development

Documents

1 Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches Eric Freudenthal and Allan Gottlieb {freudenthal, gottlieb}@nyu.edu