Architectural Interactions in High Performance Clusters

RTPP98 - 1

Architectural Interactions in High Performance Clusters

RTPP 98

David E. Culler

Computer Science Division

University of California, Berkeley

RTPP98 - 2

Run-Time Framework

Network

° ° °

MachineArchitecture

RunTime

ParallelProgram

MachineArchitecture

RunTime

ParallelProgram

MachineArchitecture

RunTime

ParallelProgram

RTPP98 - 3

Two Example RunTime Layers

• Split-C– thin global address space abstraction over Active

Messages

– get, put, read, write

• MPI– thicker message passing abstraction over Active

Messages

– send, receive

RTPP98 - 4

Split-C over Active Messages

• Read, Write, Get, Put built on small Active Message request / reply (RPC)

• Bulk-Transfer (store & get)

Request

handler

handler

Reply

RTPP98 - 5

Model Framework: LogP

Interconnection Network

MPMPMP° ° °

P ( processors )

Limited Volume ( L/g to a proc)

o (overhead)

L (latency)

og (gap)

• Latency in sending a (small) message between modules

• overhead felt by the processor on sending or receiving msg

• gap between successive sends or receives (1/rate)

• Processors

Round Trip time: 2 x ( 2o + L)

RTPP98 - 6

LogP Summary of Current Machines

Max MB/s: 38 141 47

0

2

4

6

8

10

12

14

16µs

gLOrOs

RTPP98 - 7

Methodology

• Apparatus: – 35 Ultra 170s (64 MB, .5 MB L2, Solaris 2.5)

– M2F Lanai + Myricom Net in Fat Tree variant

– GAM + Split-C

• Modify the Active Message layer to inflate L, o, g, or G independently

• Execute a diverse suite of applications and observe effect

• Evaluate against natural performance models

RTPP98 - 8

Adjusting L, o, and g (and G) in situ

Lanai

Host Workstation

O: stall Ultraon msg write

AM lib

Lanai

Host Workstation

AM lib

g: delay Lanaiafter msg injection

(after fragment forbulk transfers)

L: defer markingmsg as valid untilRx + L

O: stall Ultraon msg read

Myrinet

RTPP98 - 9

0

50

100

0 50 100

L (desired)

µs

Calibration

0

50

100

150

200

0 50 100

O (desired)

µs

o

g

L

desired

0

50

100

0 50 100

g (desired)

RTPP98 - 10

Applications Characteristics

• Message Frequency

• Write-based vs. Read-based

• Short vs. Bulk Messages

• Synchronization

• Communication Balance

RTPP98 - 11

Applications used in the Study

Program Description Input Time(16) Time(32) Msg Interval(µs)radix Int Radix Sort 16 M 32-bit keys 17.0 s 9.8 s 7.6em3d Int Sample Sort 32 M 32-bit keys 101 s 44.3 s 10.2sample EM Wave Prop. 8 K nodes, deg. 10, 100 steps 24.3 s 15.9 s 14.0ebarnes Hierarchical N-body 1 M bodies 81.2 s 46.6 s 60.1p-ray Ray Tracer 1 M pixel image, 16 k objs 25.8 s 17.8 s 108.1murphi Protocol Verifier SCI, 2 procs, 1 line, 1 mem 71.3 s 37.9 s 219.4connect Connected Components 4 M nodes, 2D, 30 % 2.5 s 1.45 s 282.9radb Bulk Radix 16 M 32-bit keys 6.5 s 3.95 s 1260.0

RTPP98 - 12

Baseline Communication

Program µs/msg ms/Barrier Avg Msg/Proc Max Msg./Proc Reads Bulk Msg

radix 7.6 895 2,228,364 2,229,106 0.0% 0.0%

em3d 10.2 324 9,953,384 9,974,265 0.0% 0.0%

sample 14.0 2499 1,966,199 2,319,362 0.0% 0.0%

ebarnes 60.1 233 1,351,194 1,400,601 9.6% 23.5%

p-ray 108.1 1465 216,869 353,640 47.7% 47.9%

murphi 219.4 36054 328,699 332,054 0.0% 51.2%

connect 282.9 101 8,912 9,221 34.1% 0.1%

radb 1260.0 80 5,520 5,927 0.0% 43.7%

RTPP98 - 13

Application Sensitivity to Communication Performance

RTPP98 - 14

Sensitivity to Overhead

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80 90 100 110

Overhead

Slo

wdow

n

Barnes

Radix

EM3D(write)

EM3D(read)

Sample

P-Ray

Murphi

Connect

NOWsort

RadB

RTPP98 - 15

Sensitivity to gap (1/msg rate)

0

2

4

6

8

10

12

14

16

18

20

0 10 20 30 40 50 60 70 80 90 100 110

gap

Slo

wdow

n

Barnes

Radix

EM3D(write)

EM3D(read)

Sample

P-Ray

Murphi

Connect

NOWsort

RTPP98 - 16

Sensitivity to Latency

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80 90 100 110

Latency

Slo

wdow

n

Barnes

Radix

EM3D(write)

EM3D(read)

Sample

P-Ray

Murphi

Connect

NOWsort

RTPP98 - 17

Sensitivity to bulk BW (1/G)

0

0.5

1

1.5

2

2.5

0 10 20 30 40

MB/s

Slo

wdow

n

Barnes

Radix

EM3D(write)

EM3D(read)

Sample

P-Ray

Murphi

Connect

NOWsort

RadB

RTPP98 - 18

Modeling Effects of Overhead

• Tpred = Torig + 2 x max #msgs x o– request / response

– proc with most msgs limits overall time

• Why does this model under-predict?

Overhead (µs) radix sample em3d ebarnes p-ray connect murphi radb0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.001 1.03 1.00 1.14 1.02 1.03 1.02 1.00 1.012 1.02 1.00 1.26 1.00 1.04 1.04 1.02 1.004 1.02 1.00 1.50 0.90 1.07 1.07 1.04 1.005 1.01 0.99 1.59 0.88 1.06 1.12 1.03 1.00

10 1.01 0.99 2.01 4.83 1.21 1.30 1.09 0.9920 1.02 0.98 2.53 N/A 1.31 1.32 1.16 0.9950 1.03 0.98 3.22 N/A 1.50 1.60 1.30 1.02

100 1.04 0.98 3.61 N/A 1.58 1.85 1.46 1.07

RTPP98 - 19

Modeling Effects of gap

• Uniform communication model

Tpred = Torig , if g < I, average msg interval

= Torig + m (g - I ), otherwise

• Bursty Communication

Tpred = Torig + m g

g

RTPP98 - 20

Extrapolating to Low Overhead

0

1

2

3

4

5

0 5 10 15

Overhead

Slo

wdow

n

Barnes

Radix

EM3D(write)

EM3D(read)

Sample

P-Ray

Murphi

Connect

NOWsort

RadB

RTPP98 - 21

MPI over AM: ping-pong bandwidth

0

10

20

30

40

50

60

70

10 100 1000 10000 100000 1000000

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

SGI Challenge

Meiko CS2

NOW

IBM SP2

Cray T3D

RTPP98 - 22

MPI over AM: start-up

0

10

20

30

40

50

60

70

80

90

SG

IC

ha

llen

ge

Me

iko

NO

W

IBM

SP2

Cra

y T

3D

mic

rose

con

ds

RTPP98 - 23

NPB2 Speedup: NOW vs SP2

RTPP98 - 24

NOW vs. Origin

RTPP98 - 25

Single Processor Performance

Origin Ultra 170 SP2BT 2488 4178 2574SP 1652 2897 1817LU 1373 2470 1871MG 53 90 53IS 37 41 29FT 133 131 139SPECfp95 19 9.4 9.7SPECint95 9.5 5.6 3.2Triad MB/s 317 254 655

RTPP98 - 26

Understanding Speedup

SpeedUp(p) = T1 MAXp (Tcompute + Tcomm. + T wait)

Tcompute = (work/p + extra) x efficiency

RTPP98 - 27

Performance Tools for Clusters

• Independent data collection on every node– Timing

– Sampling

– Tracing

• Little perturbation of global effects

RTPP98 - 28

Where the Time Goes: LU-a

0

500

1000

1500

2000

2500

3000

4 8 16 32

Processors

To

tal T

ime Wait

Receive

Send

Compute

RTPP98 - 29

Where the Time Goes: BT-a

3400

3500

3600

3700

3800

3900

4000

4100

4200

4300

4400

4 9 16 25 36

Processors

To

tal

Tim

e Wait

Receive

Send

Compute

RTPP98 - 30

Constant Problem Size Scaling

4

8163264

128256

RTPP98 - 31

Communication Scaling

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

0 10 20 30 40

FT

IS

LU

MG

SP

BT

0

1

2

3

4

5

6

7

8

0 10 20 30 40

FT

IS

LU

MG

SP

BT

Normalized Msgs per Proc Average Message Size

RTPP98 - 32

Communication Scaling: Volume

Bytes per Processor

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40

FT

IS

LU

MG

SP

BT

Total Bytes

0.00E+00

1.00E+09

2.00E+09

3.00E+09

4.00E+09

5.00E+09

6.00E+09

7.00E+09

8.00E+09

9.00E+09

0 10 20 30 40

FT

IS

LU

MG

SP

BT

RTPP98 - 33

Extra Work

RTPP98 - 34

Cache Working Sets: LU

8-fold reductionin miss rate from4 to 8 proc

RTPP98 - 35

Cache Working Sets: BT

RTPP98 - 36

Cycles per Instruction

RTPP98 - 37

MPI Internal Protocol

Sender Receiver

RTPP98 - 38

Revised Protocol

Sender Receiver

RTPP98 - 39

Sensitivity to Overhead

0.98

1

1.02

1.04

1.06

1.08

1.1

1.12

1.14

0 100 200 300 400

Added Overhead

Slo

wd

ow

n P=32

P=16

P=8

P=4

P=2

RTPP98 - 40

Conclusions

• Run Time systems for Parallel Programs must deal with a host of architectural interactions

– communication

– computation

– memory system

• Build a performance model of you RTPP– only way to recognize anomalies

• Build tools along with the RT to reflect characteristics and sensitivity back to PP

• Much can lurk beneath a perfect speedup curve

Documents

Architectural Interactions in High Performance Clusters