40
RTPP98 - 1 Architectural Interactions in High Performance Clusters RTPP 98 David E. Culler Computer Science Division University of California, Berkeley

Architectural Interactions in High Performance Clusters

  • Upload
    loman

  • View
    22

  • Download
    0

Embed Size (px)

DESCRIPTION

Architectural Interactions in High Performance Clusters. RTPP 98 David E. Culler Computer Science Division University of California, Berkeley. Parallel Program. Parallel Program. Parallel Program. Run-Time Framework. ° ° °. RunTime. RunTime. RunTime. Machine Architecture. Machine - PowerPoint PPT Presentation

Citation preview

Page 1: Architectural Interactions in High Performance Clusters

RTPP98 - 1

Architectural Interactions in High Performance Clusters

RTPP 98

David E. Culler

Computer Science Division

University of California, Berkeley

Page 2: Architectural Interactions in High Performance Clusters

RTPP98 - 2

Run-Time Framework

Network

° ° °

MachineArchitecture

RunTime

ParallelProgram

MachineArchitecture

RunTime

ParallelProgram

MachineArchitecture

RunTime

ParallelProgram

Page 3: Architectural Interactions in High Performance Clusters

RTPP98 - 3

Two Example RunTime Layers

• Split-C– thin global address space abstraction over Active

Messages

– get, put, read, write

• MPI– thicker message passing abstraction over Active

Messages

– send, receive

Page 4: Architectural Interactions in High Performance Clusters

RTPP98 - 4

Split-C over Active Messages

• Read, Write, Get, Put built on small Active Message request / reply (RPC)

• Bulk-Transfer (store & get)

Request

handler

handler

Reply

Page 5: Architectural Interactions in High Performance Clusters

RTPP98 - 5

Model Framework: LogP

Interconnection Network

MPMPMP° ° °

P ( processors )

Limited Volume ( L/g to a proc)

o (overhead)

L (latency)

og (gap)

• Latency in sending a (small) message between modules

• overhead felt by the processor on sending or receiving msg

• gap between successive sends or receives (1/rate)

• Processors

Round Trip time: 2 x ( 2o + L)

Page 6: Architectural Interactions in High Performance Clusters

RTPP98 - 6

LogP Summary of Current Machines

Max MB/s: 38 141 47

0

2

4

6

8

10

12

14

16µs

gLOrOs

Page 7: Architectural Interactions in High Performance Clusters

RTPP98 - 7

Methodology

• Apparatus: – 35 Ultra 170s (64 MB, .5 MB L2, Solaris 2.5)

– M2F Lanai + Myricom Net in Fat Tree variant

– GAM + Split-C

• Modify the Active Message layer to inflate L, o, g, or G independently

• Execute a diverse suite of applications and observe effect

• Evaluate against natural performance models

Page 8: Architectural Interactions in High Performance Clusters

RTPP98 - 8

Adjusting L, o, and g (and G) in situ

Lanai

Host Workstation

O: stall Ultraon msg write

AM lib

Lanai

Host Workstation

AM lib

g: delay Lanaiafter msg injection

(after fragment forbulk transfers)

L: defer markingmsg as valid untilRx + L

O: stall Ultraon msg read

Myrinet

Page 9: Architectural Interactions in High Performance Clusters

RTPP98 - 9

0

50

100

0 50 100

L (desired)

µs

Calibration

0

50

100

150

200

0 50 100

O (desired)

µs

o

g

L

desired

0

50

100

0 50 100

g (desired)

Page 10: Architectural Interactions in High Performance Clusters

RTPP98 - 10

Applications Characteristics

• Message Frequency

• Write-based vs. Read-based

• Short vs. Bulk Messages

• Synchronization

• Communication Balance

Page 11: Architectural Interactions in High Performance Clusters

RTPP98 - 11

Applications used in the Study

Program Description Input Time(16) Time(32) Msg Interval(µs)radix Int Radix Sort 16 M 32-bit keys 17.0 s 9.8 s 7.6em3d Int Sample Sort 32 M 32-bit keys 101 s 44.3 s 10.2sample EM Wave Prop. 8 K nodes, deg. 10, 100 steps 24.3 s 15.9 s 14.0ebarnes Hierarchical N-body 1 M bodies 81.2 s 46.6 s 60.1p-ray Ray Tracer 1 M pixel image, 16 k objs 25.8 s 17.8 s 108.1murphi Protocol Verifier SCI, 2 procs, 1 line, 1 mem 71.3 s 37.9 s 219.4connect Connected Components 4 M nodes, 2D, 30 % 2.5 s 1.45 s 282.9radb Bulk Radix 16 M 32-bit keys 6.5 s 3.95 s 1260.0

Page 12: Architectural Interactions in High Performance Clusters

RTPP98 - 12

Baseline Communication

Program µs/msg ms/Barrier Avg Msg/Proc Max Msg./Proc Reads Bulk Msg

radix 7.6 895 2,228,364 2,229,106 0.0% 0.0%

em3d 10.2 324 9,953,384 9,974,265 0.0% 0.0%

sample 14.0 2499 1,966,199 2,319,362 0.0% 0.0%

ebarnes 60.1 233 1,351,194 1,400,601 9.6% 23.5%

p-ray 108.1 1465 216,869 353,640 47.7% 47.9%

murphi 219.4 36054 328,699 332,054 0.0% 51.2%

connect 282.9 101 8,912 9,221 34.1% 0.1%

radb 1260.0 80 5,520 5,927 0.0% 43.7%

Page 13: Architectural Interactions in High Performance Clusters

RTPP98 - 13

Application Sensitivity to Communication Performance

Page 14: Architectural Interactions in High Performance Clusters

RTPP98 - 14

Sensitivity to Overhead

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80 90 100 110

Overhead

Slo

wdow

n

Barnes

Radix

EM3D(write)

EM3D(read)

Sample

P-Ray

Murphi

Connect

NOWsort

RadB

Page 15: Architectural Interactions in High Performance Clusters

RTPP98 - 15

Sensitivity to gap (1/msg rate)

0

2

4

6

8

10

12

14

16

18

20

0 10 20 30 40 50 60 70 80 90 100 110

gap

Slo

wdow

n

Barnes

Radix

EM3D(write)

EM3D(read)

Sample

P-Ray

Murphi

Connect

NOWsort

Page 16: Architectural Interactions in High Performance Clusters

RTPP98 - 16

Sensitivity to Latency

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80 90 100 110

Latency

Slo

wdow

n

Barnes

Radix

EM3D(write)

EM3D(read)

Sample

P-Ray

Murphi

Connect

NOWsort

Page 17: Architectural Interactions in High Performance Clusters

RTPP98 - 17

Sensitivity to bulk BW (1/G)

0

0.5

1

1.5

2

2.5

0 10 20 30 40

MB/s

Slo

wdow

n

Barnes

Radix

EM3D(write)

EM3D(read)

Sample

P-Ray

Murphi

Connect

NOWsort

RadB

Page 18: Architectural Interactions in High Performance Clusters

RTPP98 - 18

Modeling Effects of Overhead

• Tpred = Torig + 2 x max #msgs x o– request / response

– proc with most msgs limits overall time

• Why does this model under-predict?

Overhead (µs) radix sample em3d ebarnes p-ray connect murphi radb0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.001 1.03 1.00 1.14 1.02 1.03 1.02 1.00 1.012 1.02 1.00 1.26 1.00 1.04 1.04 1.02 1.004 1.02 1.00 1.50 0.90 1.07 1.07 1.04 1.005 1.01 0.99 1.59 0.88 1.06 1.12 1.03 1.00

10 1.01 0.99 2.01 4.83 1.21 1.30 1.09 0.9920 1.02 0.98 2.53 N/A 1.31 1.32 1.16 0.9950 1.03 0.98 3.22 N/A 1.50 1.60 1.30 1.02

100 1.04 0.98 3.61 N/A 1.58 1.85 1.46 1.07

Page 19: Architectural Interactions in High Performance Clusters

RTPP98 - 19

Modeling Effects of gap

• Uniform communication model

Tpred = Torig , if g < I, average msg interval

= Torig + m (g - I ), otherwise

• Bursty Communication

Tpred = Torig + m g

g

Page 20: Architectural Interactions in High Performance Clusters

RTPP98 - 20

Extrapolating to Low Overhead

0

1

2

3

4

5

0 5 10 15

Overhead

Slo

wdow

n

Barnes

Radix

EM3D(write)

EM3D(read)

Sample

P-Ray

Murphi

Connect

NOWsort

RadB

Page 21: Architectural Interactions in High Performance Clusters

RTPP98 - 21

MPI over AM: ping-pong bandwidth

0

10

20

30

40

50

60

70

10 100 1000 10000 100000 1000000

Message Size (bytes)

Ban

dw

idth

(M

B/s

)

SGI Challenge

Meiko CS2

NOW

IBM SP2

Cray T3D

Page 22: Architectural Interactions in High Performance Clusters

RTPP98 - 22

MPI over AM: start-up

0

10

20

30

40

50

60

70

80

90

SG

IC

ha

llen

ge

Me

iko

NO

W

IBM

SP2

Cra

y T

3D

mic

rose

con

ds

Page 23: Architectural Interactions in High Performance Clusters

RTPP98 - 23

NPB2 Speedup: NOW vs SP2

Page 24: Architectural Interactions in High Performance Clusters

RTPP98 - 24

NOW vs. Origin

Page 25: Architectural Interactions in High Performance Clusters

RTPP98 - 25

Single Processor Performance

Origin Ultra 170 SP2BT 2488 4178 2574SP 1652 2897 1817LU 1373 2470 1871MG 53 90 53IS 37 41 29FT 133 131 139SPECfp95 19 9.4 9.7SPECint95 9.5 5.6 3.2Triad MB/s 317 254 655

Page 26: Architectural Interactions in High Performance Clusters

RTPP98 - 26

Understanding Speedup

SpeedUp(p) = T1 MAXp (Tcompute + Tcomm. + T wait)

Tcompute = (work/p + extra) x efficiency

Page 27: Architectural Interactions in High Performance Clusters

RTPP98 - 27

Performance Tools for Clusters

• Independent data collection on every node– Timing

– Sampling

– Tracing

• Little perturbation of global effects

Page 28: Architectural Interactions in High Performance Clusters

RTPP98 - 28

Where the Time Goes: LU-a

0

500

1000

1500

2000

2500

3000

4 8 16 32

Processors

To

tal T

ime Wait

Receive

Send

Compute

Page 29: Architectural Interactions in High Performance Clusters

RTPP98 - 29

Where the Time Goes: BT-a

3400

3500

3600

3700

3800

3900

4000

4100

4200

4300

4400

4 9 16 25 36

Processors

To

tal

Tim

e Wait

Receive

Send

Compute

Page 30: Architectural Interactions in High Performance Clusters

RTPP98 - 30

Constant Problem Size Scaling

4

8163264

128256

Page 31: Architectural Interactions in High Performance Clusters

RTPP98 - 31

Communication Scaling

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

0 10 20 30 40

FT

IS

LU

MG

SP

BT

0

1

2

3

4

5

6

7

8

0 10 20 30 40

FT

IS

LU

MG

SP

BT

Normalized Msgs per Proc Average Message Size

Page 32: Architectural Interactions in High Performance Clusters

RTPP98 - 32

Communication Scaling: Volume

Bytes per Processor

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40

FT

IS

LU

MG

SP

BT

Total Bytes

0.00E+00

1.00E+09

2.00E+09

3.00E+09

4.00E+09

5.00E+09

6.00E+09

7.00E+09

8.00E+09

9.00E+09

0 10 20 30 40

FT

IS

LU

MG

SP

BT

Page 33: Architectural Interactions in High Performance Clusters

RTPP98 - 33

Extra Work

Page 34: Architectural Interactions in High Performance Clusters

RTPP98 - 34

Cache Working Sets: LU

8-fold reductionin miss rate from4 to 8 proc

Page 35: Architectural Interactions in High Performance Clusters

RTPP98 - 35

Cache Working Sets: BT

Page 36: Architectural Interactions in High Performance Clusters

RTPP98 - 36

Cycles per Instruction

Page 37: Architectural Interactions in High Performance Clusters

RTPP98 - 37

MPI Internal Protocol

Sender Receiver

Page 38: Architectural Interactions in High Performance Clusters

RTPP98 - 38

Revised Protocol

Sender Receiver

Page 39: Architectural Interactions in High Performance Clusters

RTPP98 - 39

Sensitivity to Overhead

0.98

1

1.02

1.04

1.06

1.08

1.1

1.12

1.14

0 100 200 300 400

Added Overhead

Slo

wd

ow

n P=32

P=16

P=8

P=4

P=2

Page 40: Architectural Interactions in High Performance Clusters

RTPP98 - 40

Conclusions

• Run Time systems for Parallel Programs must deal with a host of architectural interactions

– communication

– computation

– memory system

• Build a performance model of you RTPP– only way to recognize anomalies

• Build tools along with the RT to reflect characteristics and sensitivity back to PP

• Much can lurk beneath a perfect speedup curve