43
Lecture 13 (part 1) Thread Level Parallelism (6) EEC 171 Parallel Architectures John Owens UC Davis

Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Lecture 13 (part 1)Thread Level Parallelism (6)

EEC 171 Parallel ArchitecturesJohn Owens

UC Davis

Page 2: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Credits• © John Owens / UC Davis 2007–8.

• Thanks to many sources for slide material: Computer Organization and Design (Patterson & Hennessy) © 2005, Computer Architecture (Hennessy & Patterson) © 2007, Inside the Machine (Jon Stokes) © 2007, © Dan Connors / University of Colorado 2007, © Kathy Yelick / UCB 2007, © Wen-Mei Hwu/David Kirk, University of Illinois 2007, © David Patterson / UCB 2003–7, © John Lazzaro / UCB 2006, © Mary Jane Irwin / Penn State 2005, © John Kubiatowicz / UCB 2002, © Krste Asinovic/Arvind / MIT 2002, © Morgan Kaufmann Publishers 1998.

Page 3: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Outline

• Interconnection Networks

• Grab bag:

• Amdahl’s Law

• Novices & Parallel Programming

• Interconnect Technologies

Page 4: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Network TopologyPreliminaries and Evolution

• One switch suffices to connect a small number of devices– Number of switch ports limited by VLSI technology, power

consumption, packaging, and other such cost constraints• A fabric of interconnected switches (i.e., switch fabric or network

fabric) is needed when the number of devices is much larger– The topology must make a path(s) available for every pair of

devices—property of connectedness or full access (What paths?) • Topology defines the connection structure across all components

– Bisection bandwidth: the minimum bandwidth of all links crossing a network split into two roughly equal halves

– Full bisection bandwidth: › Network BWBisection = Injection (or Reception) BWBisection= N/2

– Bisection bandwidth mainly affects performance• Topology is constrained primarily by local chip/board pin-outs;

secondarily, (if at all) by global bisection bandwidth

Page 5: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Network TopologyCentralized Switched (Indirect) Networks

• Crossbar network– Crosspoint switch complexity increases quadratically with the

number of crossbar input/output ports, N, i.e., grows as O(N2)– Has the property of being non-blocking

7

6

5

4

3

2

1

0

76543210

7

6

5

4

3

2

1

0

76543210

Page 6: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Network TopologyCentralized Switched (Indirect) Networks

• Multistage interconnection networks (MINs)– Crossbar split into several stages consisting of smaller crossbars– Complexity grows as O(N × log N), where N is # of end nodes– Inter-stage connections represented by a set of permutation

functions

Omega topology, perfect-shuffle exchange

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

Page 7: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Network TopologyCentralized Switched (Indirect) Networks

16 port, 4 stage Butterfly network

00000001

00100011

01000101

01100111

10001001

10101011

11001101

11101111

00000001

00100011

01000101

01100111

10001001

10101011

11001101

11101111

Page 8: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Network TopologyCentralized Switched (Indirect) Networks

• Reduction in MIN switch cost comes at the price of performance– Network has the property of being blocking– Contention is more likely to occur on network links

› Paths from different sources to different destinations share one or more links

blocking topology

X

non-blocking topology

7

6

5

4

3

2

1

0

76543210

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

Page 9: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Network TopologyCentralized Switched (Indirect) Networks

• How to reduce blocking in MINs? Provide alternative paths!– Use larger switches (can equate to using more switches)

› Clos network: minimally three stages (non-blocking)» A larger switch in the middle of two other switch stages

provides enough alternative paths to avoid all conflicts– Use more switches

› Add logkN - 1 stages, mirroring the original topology» Rearrangeably non-blocking» Allows for non-conflicting paths» Doubles network hop count (distance), d» Centralized control can rearrange established paths

› Benes topology: 2(log2N) - 1 stages (rearrangeably non-blocking)» Recursively applies the three-stage Clos network concept to

the middle-stage set of switches to reduce all switches to 2 x 2

Page 10: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Network TopologyCentralized Switched (Indirect) Networks

16 port, 7 stage Clos network = Benes topology

7

6

5

4

3

2

1

0

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

15

14

13

12

11

10

9

8

Page 11: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Network TopologyCentralized Switched (Indirect) Networks

Alternative paths from 0 to 1. 16 port, 7 stage Clos network = Benes topology

7

6

5

4

3

2

1

0

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

15

14

13

12

11

10

9

8

Page 12: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Network TopologyCentralized Switched (Indirect) Networks

7

6

5

4

3

2

1

0

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

15

14

13

12

11

10

9

8

Alternative paths from 4 to 0. 16 port, 7 stage Clos network = Benes topology

Page 13: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Network Topology

Myrinet-2000 Clos Network for 128 Hosts

• Backplane of the M3-E128 Switch• M3-SW16-8F fiber line card (8 ports)

http://myri.com

Page 14: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Network TopologyDistributed Switched (Direct) Networks

• Bidirectional Ring networks– N switches (3 × 3) and N bidirectional network links– Simultaneous packet transport over disjoint paths– Packets must hop across intermediate nodes– Shortest direction usually selected (N/4 hops, on average)

Page 15: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Network TopologyDistributed Switched (Direct) Networks:

• Fully connected and ring topologies delimit the two extremes• The ideal topology:

– Cost approaching a ring– Performance approaching a fully connected (crossbar) topology

• More practical topologies:– k-ary n-cubes (meshes, tori, hypercubes)

› k nodes connected in each dimension, with n total dimensions› Symmetry and regularity

» network implementation is simplified» routing is simplified

Page 16: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Network TopologyDistributed Switched (Direct) Networks

2D t

oru

s of

16

node

s

hyp

ercu

be

of 1

6 no

des

(16

= 2

4 , s

o n

= 4

)

2D m

esh

or

grid

of

16 n

odes

NetworkBisection

≤ full bisection bandwidth!“Performance Analysis of k-ary n-cube Interconnection Networks,” W. J. Dally,IEEE Trans. on Computers, Vol. 39, No. 6, pp. 775–785, June, 1990.

Page 17: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Network Topology

Company System [Network] Name

Max. numberof nodes

[x # CPUs]Basic network topology

Injection[Recept’n]node BW

inMBytes/s

# of databits perlink per

direction

Raw network link

BW per direction in Mbytes/sec

Raw network bisection

BW (bidir) in Gbytes/s

Intel ASCI RedParagon

4,510[x 2]

2-D mesh64 x 64

400[400] 16 bits 400

IBMASCI WhiteSP Power3[Colony]

512[x 16]

BMIN w/8-port bidirect. switches (fat-

tree or Omega)

500[500]

8 bits (+1 bit of

control)500

IntelThunter Itanium2

Tiger4[QsNetII]

1,024[x 4]

fat tree w/8-portbidirectional

switches

928[928]

8 bits (+2 control for 4b/5b enc)

1,333

51.2

256

1,365

Cray XT3 [SeaStar]

30,508[x 1]

3-D torus40 x 32 x 24

3,200[3,200] 12 bits 3,800

Cray X1E 1,024[x 1]

4-way bristled2-D torus (~ 23 x 11)

with express links

1,600[1,600] 16 bits 1,600

IBMASC Purple pSeries 575[Federation]

>1,280[x 8]

BMIN w/8-portbidirect. switches

(fat-tree or Omega)

2,000[2,000]

8 bits (+2 bits of

control)2,000

IBMBlue Gene/LeServer Sol.[Torus Net]

65,536[x 2]

3-D torus32 x 32 x 64

612,5[1,050]

1 bit (bit serial) 175

5,836.8

51.2

2,560

358.4

Topological Characteristics of Commercial Machines

Page 18: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Routing, Arbitration, and SwitchingRouting

• Performed at each switch, regardless of topology• Defines the “allowed” path(s) for each packet (Which paths?)• Needed to direct packets through network to intended destinations• Ideally:

– Supply as many routing options to packets as there are paths provided by the topology, and evenly distribute network traffic among network links using those paths, minimizing contention

• Problems: situations that cause packets never to reach their dest.– Livelock

› Arises from an unbounded number of allowed non-minimal hops› Solution: restrict the number of non-minimal (mis)hops allowed

– Deadlock› Arises from a set of packets being blocked waiting only for network

resources (i.e., links, buffers) held by other packets in the set› Probability increases with increased traffic & decreased availability

Page 19: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Routing, Arbitration, and SwitchingRouting

• Common forms of deadlock:– Routing-induced deadlock

p

c1 c2

c4 c5

c7 c8

c10 c11

p1

p1

p2p3

p2

p3p4

p4

c0

1

c3

p2

c6

p3

c9

p4

c12

p5

s1

d1

s2

s3

d3

d2

c1 c2

c4

c7 c8

c10

c5 c11

s4

d4

c0

c3

c6

c9

s5

d5

c12

Routing of packets in a 2D mesh Channel dependency graph

ci = channel isi = source node idi = destination node ipi = packet i

“A Formal Model of Message Blocking and Deadlock Resolution in Interconnection Networks,” S. Warnakulasuriyaand T. Pinkston, IEEE Trans. on Parallel and Distributed Systems , Vol. 11, No. 3, pp. 212–229, March, 2000.

Page 20: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

On-Chip Networks (OCNs)Institution &

Processor [Network] name

Year built

Number of network ports [cores or tiles + other

ports]Basic network

topologyLink bandwidth

[link clock speed]

# of chip metal layers;

flow control; # VCs

# of data bits per link per

direction

MIT Raw [General Dynamic Network]

IBM POWER5

2002

2004

16 port [16 tiles]

7 ports [2 PE cores + 5 other ports]

2-D mesh 4 x 4

Crossbar

0.9 GBps [225 MHz, clocked at

proc speed]

[1.9 GHz, clocked at proc

speed]

6 layers; credit-based;

no VCs

7 layers; handshaking;

no virtual channels

32 bits

256 b Inst fetch; 64 b for stores; 256 b LDs

Routing; Arbitration; Switching

XY DOR w/ request-reply deadlock recovery; RR arbitration; wormhole

Shortest-path; non-blocking; circuit

switch

U.T. Austin TRIPS EDGE [Operand

Network]

U.T. Austin TRIPS EDGE [On-Chip

Network]

2005

2005

25 ports [25 execution unit tiles]

40 ports [16 L2 tiles + 24 network interface tile]

2-D mesh 5 x 5

2-D mesh 10 x 4

5.86 GBps [533 MHz clk scaled

by 80%]

6.8 GBps [533 MHz clk scaled

by 80%]

7 layers; on/off flow

control; no VCs

7 layers; credit-based

flow control; 4 VCs

110 bits

128 bits

XY DOR; distributed RR arbitration;

wormhole

XY DOR; distributed RR arbitration; VCT

switched

Sony, IBM, Toshiba Cell BE [Element Interconnect Bus]

Sun UltraSPARC T1 processor

2005

2005

12 ports [1 PPE and 8 SPEs + 3 other ports for memory, I/&O interface]

Up to 13 ports [8 PE cores + 4 L2 banks + 1

shared I/O]

Ring 4 total, 2 in each

direction

Crossbar

25.6 GBps [1.6 GHz, clocked at

half the proc speed]

19.2 GBps [1.2 GHz, clocked at

proc speed]

8 layers; credit-based flow control;

no VCs

9 layers; handshaking;

no VCs

128 bits data (+16 bits tag)

128 b both for the 8

cores and the 4 L2 banks

Shortest-path; tree-based RR arb. (centralized);

pipelined circuit switch

Shortest-path; age-based arbitration;

VCT switched

Examples of Interconnection Networks

Page 21: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Cell Broadband Engine Element Interconnect Bus• Cell BE is successor to PlayStation 2’s Emotion Engine

– 300 MHz MIPS-based– Uses two vector elements– 6.2 GFLOPS (Single Precision)– 72KB Cache + 16KB Scratch Pad RAM– 240mm2 on 0.25-micron process

• PlayStation 3 uses the Cell BE*– 3.2 GHz POWER-based– Eight SIMD (Vector) Processor Elements– >200 GFLOPS (Single Precision)– 544KB cache + 2MB Local Store RAM– 235mm2 on 90-nanometer SOI process

*Sony has decided to use only 7 SPEs for the PlayStation 3 to improve yield. Eight SPEs will be assumed for the purposes of this discussion.

Examples of Interconnection Networks

Page 22: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Cell Broadband Engine Element Interconnect Bus• Cell Broadband Engine (Cell BE): 200 GFLOPS

– 12 Elements (devices) interconnected by EIB:› One 64-bit Power processor element (PPE) with aggregate

bandwidth of 51.2 GB/s› Eight 128-bit SIMD synergistic processor elements (SPE) with local

store, each with a bandwidth of 51.2 GB/s› One memory interface controller (MIC) element with memory

bandwidth of 25.6 GB/s› Two configurable I/O interface elements: 35 GB/s (out) and 25GB/s

(in) of I/O bandwidth– Element Interconnect Bus (EIB):

› Four unidirectional rings (two in each direction) each connect the heterogeneous 12 elements (end node devices)

› Data links: 128 bits wide @ 1.6 GHz; data bandwidth: 25.6 GB/s› Provides coherent and non-coherent data transfer› Should optimize network traffic flow (throughput) and utilization

while minimizing network latency and overhead

Examples of Interconnection Networks

Page 23: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Examples of Interconnection Networks

Page 24: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Examples of Interconnection Networks

Cell Broadband Engine Element Interconnect Bus• Element Interconnect Bus (EIB)

– Packet size: 16B – 128B (no headers); pipelined circuit switching– Credit-based flow control (command bus central token manager)– Two-stage, dual round-robin centralized network arbiter– Allows up to 64 outstanding requests (DMA)

› 64 Request Buffers in the MIC; 16 Request Buffers per SPE– Latency: 1 cycle/hop, transmission time (largest packet) 8 cycles– Effective bandwidth: peak 307.2 GB/s, max. sustainable 204.8 GB/s

Page 25: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Examples of Interconnection Networks

Blue Gene/L 3D Torus Network• 360 TFLOPS (peak)• 2,500 square feet• Connects 65,536 dual-processor nodes and 1,024 I/O nodes

– One processor for computation; other meant for communication

Page 26: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Examples of Interconnection Networks

Blue Gene/L 3D Torus Network

Chip (node)2 processors2.8/5.6 GF/s

512MB

Compute Card2 chips, 1x2x15.6/11.2 GF/s

1.0 GB

Node Card32 chips, 4x4x2

16 compute, 0-2 I/O cards90/180 GF/s

16 GB

System64 Racks,64x32x32

180/360 TF/s32 TB

Rack32 Node cards2.8/5.6 TF/s

512 GB

Node distribution: Two nodes on a 2 x 1 x 1 compute card, 16 compute cards + 2 I/O cards on a 4 x 4 x 2 node board, 16 node boards on an 8 x 8 x 8 midplane, 2 midplanes

on a 1,024 node rack, 8.6 meters maximum physical link length

www.ibm.com

Page 27: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Inte

rcon

nect

ion

Net

wor

ks: ©

Tim

othy

Mar

k Pi

nkst

on a

nd J

osé

Dua

to...

with

maj

or p

rese

ntat

ion

cont

ribut

ion

from

Jos

é Fl

ich

Examples of Interconnection Networks

Blue Gene/L 3D Torus Network• Main network: 32 x 32 x 64 3-D torus

– Each node connects to six other nodes– Full routing in hardware

• Links and Bandwidth– 12 bit-serial links per node (6 in, 6 out)– Torus clock speed runs at 1/4th of processor rate– Each link is 1.4 Gb/s at target 700-MHz clock rate (175 MB/s)– High internal switch connectivity to keep all links busy

› External switch input links: 6 at 175 MB/s each (1,050 MB/s aggregate)› External switch output links: 6 at 175 MB/s each (1,050 MB/s aggregate)› Internal datapath crossbar input links: 12 at 175 MB/s each› Internal datapath crossbar output links: 6 at 175 MB/s each› Switch injection links: 7 at 175 MBps each (2 cores, each with 4 FIFOs)› Switch reception links: 12 at 175 MBps each (2 cores, each with 7 FIFOs)

Page 28: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Encountering Amdahl’s Law• Speedup due to enhancement E is

• Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected

Speedup with E =Exec time w/o EExec time with E

ExTime w/ E = ExTime w/o E! ((1" F ) + F/S)

Speedup w/ E =1

(1! F ) + F/S

Page 29: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Challenges of Parallel Processing

• Application parallelism ⇒ primarily via new

algorithms that have better parallel performance

• Long remote latency impact ⇒ both by architect and

by the programmer

• For example, reduce frequency of remote accesses either by

• Caching shared data (HW)

• Restructuring the data layout to make more accesses local (SW)

Page 30: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Examples: Amdahl’s Law

• Consider an enhancement which runs 20 times faster but which is only usable 25% of the time.

• Speedup w/ E =

• What if it’s usable only 15% of the time?

• Speedup w/ E =

• Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!

• To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less

Speedup w/ E = 1 / ((1-F) + F/S)

Page 31: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Challenges of Parallel Processing

• Second challenge is long latency to remote memory

• Suppose 32 CPU MP, 2 GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.)

• What is performance impact if 0.2% instructions involve remote access?

• 1.5X

• 2.0X

• 2.5X

Page 32: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Challenges of Parallel Processing

• Application parallelism ⇒ primarily via new

algorithms that have better parallel performance

• Long remote latency impact ⇒ both by architect and

by the programmer

• For example, reduce frequency of remote accesses either by

• Caching shared data (HW)

• Restructuring the data layout to make more accesses local (SW)

Page 33: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

How hard is parallel programming anyway?

• Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers

• Lorin Hochstein, Jeff Carver, Forrest Shull, Sima Asgari, Victor Basili, Jeffrey K. Hollingsworth, Marvin V. Zelkowitz

• Supercomputing 2005

Page 34: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Why use students for testing?• First, multiple students are routinely given the same

assignment to perform, and thus we are able to conduct experiments in a way to control for the skills of specific programmers.

• Second, graduate students in a HPC class are fairly typical of a large class of novice HPC programmers who may have years of experience in their application domain but very little in HPC-style programming.

• Finally, due to the relatively low costs, student studies are an excellent environment to debug protocols that might be later used on practicing HPC programmers.

Page 35: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Tests run

4

A final important independent variable is programmer

experience. In the studies reported here, the majority of sub-

jects were novice HPC developers (not surprisingly, as the

studies were run in a classroom environment).

4.2 Dependent Variables

Our studies measured the following as outcomes of the

HPC development practices applied:

A primary measure of the quality of the HPC solution was

the speedup achieved, that is, the relative execution time of a

program running on a multi-processor system compared to a

uniprocessor system. In this paper, all values reported for

speedup were measured when the application was run on 8

parallel processors, as this was the largest number of proces-

sors that was feasible for use in our classroom environments.

A primary measure of cost was the amount of effort re-

quired to develop the final solution, for a given problem and

given approach. The effort undertaken to develop a serial solu-

tion included the following activities: thinking/planning the

solution, coding, and testing. The effort undertaken to develop

a parallel solution included all of the above as well as tuning

the parallel code (i.e. improving performance through optimiz-

ing the parallel instructions). HPC development for the studies

presented in this paper was always done having a serial ver-

sion available.

Another outcome variable studied was the code expansion

factor of the solutions. In order to take full advantage of the

parallel processors, HPC codes can be expected to include

many more lines of code (LOC) than serial solutions. For ex-

ample, message-passing approaches such as MPI require a

significant amount of code to deal with communication across

different nodes. The expansion factor is the ratio of LOC in a

parallel solution to LOC in a serial solution of the same prob-

lem.

Finally, we look at the cost per LOC of solutions in each

of the various approaches. This value is another measure (in

person-hours) of the relative cost of producing code in each of

the HPC approaches.

4.3 Studies Described in this Paper

To validate our methodology, we selected a subset of the

cells in the table for which we have already been able to

gather sufficient data to draw conclusions (i.e., the gray-

shaded cells in Table 1), where the majority of our data lies at

this time

Studies analyzed in this paper include:

C0A1. This data was collected in Fall 2003, from a

graduate-level course with 16 students. Subjects were asked to

implement the “game of life” program in C on a cluster of

PCs, first using a serial solution and then parallelizing the so-

lution with MPI.

C1A1. This data was from a replication of the C0A1 as-

signment in a different graduate-level course at the same uni-

versity in Spring 2004. 10 subjects participated.

C2A2. This data was collected in Spring 2004, from a

graduate-level course with 27 students. Subjects were asked to

parallelize the “grid of resistors” problem in a combination of

Matlab and C on a cluster of PCs. Given a serial version, sub-

Serial MPI OpenMP Co-Array

Fortran

StarP XMT

Nearest-Neighbor Type Problems

Game of Life C3A3 C3A3

C0A1

C1A1

C3A3

Grid of Resistors C2A2 C2A2 C2A2 C2A2

Sharks & Fishes C6A2 C6A2 C6A2

Laplace’s Eq. C2A3 P2A3

SWIM C0A2

Broadcast Type Problems

LU Decomposition C4A1

Parallel Mat-vec C3A4

Quantum Dynamics C7A1

Embarrassingly Parallel Type Problems

Buffon-Laplace Nee-

dle

C2A1

C3A1

C2A1

C3A1

C2A1

C3A1

(Miscellaneous Problem Types)

Parallel Sorting C3A2 C3A2 C3A2

Array Compaction C5A1

Randomized Selection C5A2

Table 1: Matrix describing the problem space of HPC studies being run. Columns show the parallel program-

ming model user. Rows shows the assignment, grouped by communication pattern required. Each study is indi-

cated with a label CxAy, identifying the participating class (C) and the assignment (A). Studies analyzed in this

paper are grey-shaded.

Page 36: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

What They Learned

• Novices are able to achieve speedup on a parallel machine.

• MPI and OpenMP both require more {code, cost per line, effort} than serial implementations

• MPI takes more effort than OpenMP

5

jects were asked to produce an HPC solution first in MPI and

then in OpenMP.

C3A3. This data was collected in Spring 2004, from a

graduate-level course with 16 students. Subjects were asked to

implement the “game of life” program in C on a cluster of

PCs, first as a serial and then as an MPI and OpenMP version.

5 Hypotheses Investigated

The data collected from the classroom studies allow us to

address a number of issues about how novices learn to develop

HPC codes, by looking within and across cells on our matrix.

In Sections 5.1-5.4 we compare the two parallel programming

models, MPI and OpenMP, to serial development to derive a

better understanding of the relationship between serial devel-

opment and parallel development. Then in Section 5.5 we

compare MPI and OpenMP to each other to get a better under-

standing of their relationship with regard to programmer pro-

ductivity.

In the analysis presented in Sections 5.2-5.5, we used the

paired t-test [7] to compare MPI to OpenMP. For example, we

used a paired t-test to investigate whether there was any dif-

ference in the LOC required to implement the same solution in

different HPC approaches for the same subject. This statistical

test calculates the signed difference between two values for

the same subject (e.g. the LOC required for an OpenMP im-

plementation and an MPI implementation of the same prob-

lem). The output of the test is based on the mean of the differ-

ences for all subjects: If this mean value is large it will tend to

indicate a real and significant difference across the subjects for

the two different approaches. Conversely, if the mean is close

to zero then it would tend to indicate that both approaches

performed about the same (i.e. that there was no consistent

effect due to the different approaches). By making a within-

subject comparison we avoid many threats to validity, by

holding factors such as experience level, background, general

programming ability, etc., constant.

5.1 Achieving Speedup

A key question was whether novice developers could

achieve speedup at all. It is highly relevant for characterizing

HPC development, because it addresses the question of

whether the benefits of HPC solutions can be widely achieved

or will only be available to highly skilled developers. A survey

of HPC experts (conducted at the DARPA-sponsored HPCS

project meeting in January 2004) indicated that 87% of the

experts surveyed felt that speedup could be achieved by nov-

ices, although no rigorous data was cited to bolster this asser-

tion. Based on this information, we posed the following hy-

pothesis:

H1 Novices are able to achieve speedup on a parallel ma-

chine

To evaluate this hypothesis, we evaluated speedup in two

ways: 1) Comparing the parallel version of the code to a serial

version of the same application, or 2) comparing the parallel

version of the code run on multiple processors to the same

version of the code run on one processor.

For the game of life application, we have data from 3

classroom studies for both MPI and OpenMP approaches.

These data are summarized in Table 2, which shows the paral-

lel programming model, the application being developed, and

the programming language used in each study for which data

was collected. The OpenMP programs were ran on shared

memory machines. The MPI programs were run on clusters.

Data

set

Programming

Model

Speedup on

8 processors

Speedup w.r.t. serial version

C1A1 MPI mean 4.74, sd 1.97, n=2

C3A3 MPI mean 2.8, sd 1.9, n=3

C3A3 OpenMP mean 6.7, sd 9.1, n=2

Speedup w.r.t. parallel version run on 1 processor

C0A1 MPI mean 5.0, sd 2.1, n=13

C1A1 MPI mean 4.8, sd 2.0, n=3

C3A3 MPI mean 5.6, sd 2.5, n=5

C3A3 OpenMP mean 5.7, sd 3.0, n=4

Table 2: Mean, standard deviation, and number of subjects for

computing speedup. All data sets are for C implementations of

the Game of Life.

Although there are too few data points to say with much

rigor, the data we do have supports our hypothesis H1. Nov-

ices seem able to achieve about 4.5x speedup (on 8 proces-

sors) for the game of life application over the serial version.

Speedup on 8 processors is consistently about 5 compared to

the parallel version run on one processor, for this application.

In addition to the specific hypothesis evaluated, we also

observed that OpenMP seemed to result in speedup values

near the top of that range, but too few data points are available

to make a meaningful comparison and too few processors

were used to compare the parallelism between programming

models.

5.2 Code expansion factor

Although the number of lines of code in a given solution

is not important on its own terms, it is a useful descriptive

measure of the solution produced. Different programming

paradigms are likely to require different types of optimization

their solution; for example, OpenMP typically seems to re-

quire adding only a few key lines of code to a serial program,

while MPI typically requires much more significant changes

to produce a working solution. Therefore we can pose the fol-

lowing hypotheses:

H2 An MPI implementation will require more code than its

corresponding serial implementation

H3 An OpenMP implementation will require more code than

its corresponding serial implementation.

We can use the data from both C2A2 and C3A3 (the first

and second rows of the matrix) to investigate how the size of

the solution (measured in LOC) changes from serial to various

HPC approaches. A summary of the datasets is presented in

Table 3. (In the C2A2 assignment, students were given an

existing serial solution to use as their starting point. It is in-

Page 37: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Ethernet Performance

• Achieves close to theoretical bw even with default 1500 B/message

• Broadcom part: 31 µs latency

Performance Characteristics of Dual-Processor HPC Cluster Nodes Based on 64-bit Commodity Processors, Purkayastha et al.

Page 38: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Myrinet• Interconnect designed

with clustering in mind

• 2 Gbps links, possibly 2 physical links (so 4 Gb/s)

• $850/node up to 128 nodes, $1737/node up to 1024 nodes

• MPI latency 6–7 µs

• TCP/IP latency 27–30 µs

• Strong support!

0

400

800

1200

1600

2000

2400

2800

3200

3600

4000

1.0E+0

0

1.0E+0

1

1.0E+0

2

1.0E+0

3

1.0E+0

4

1.0E+0

5

1.0E+0

6

1.0E+0

7

1.0E+0

8

Message Size (Bytes)

Mbps

MPI - dual port

TCP - dual port

MPI - single port

TCP - single port

Dell Builtin Gig E

Figure 2: Myrinet Performance

���

��

��

���

���

������ ����� ������ ������ �����

��&&������)����('�&�

��

$&

�(%�"�'�������&�"� ��$#%'

����������(�"������

����������(�"�����

�� ��%#���#!�������'

Figure 3: SCI Performance

The big payoff is that the switch elements can be very

simple since they do not perform any routing calcula-

tions. In addition the software design is based on an OS-

bypass like interface to provide low-latency and low-

CPU overhead.

Current Myrinet hardware utilizes a 2 Gbps link speed

with PCI-X based NICs providing one or two optical

links. The dual-port NIC virtualizes the two ports to

provide an effective 4 Gbps channel. The downside to

the two port NIC is cost. Both the cost of the NIC and

the extra switch port it requires. Myrinet is designed to

be scalable. Until recently the limit has been 1024

nodes, but that limit has been removed in the latest

software. However, there are reports that the network

mapping is not yet working reliably with more than

1024 nodes. The cost of Myrinet runs around $850/node

up to 128 nodes, beyond that a second layer of switches

must be added increasing the cost to $1737/node for

1024 nodes. The dual port NICs effectively double the

infrastructure and thus add at least $800 per node to the

cost.

Myricom provides a full open-source based software

stack with support for a variety of OS’s and architec-

tures. Though some architectures, such as PPC64, are

not tuned as well as others. One of the significant

plusses for Myrinet is its strong support for TCP/IP.

Generally TCP/IP performance has nearly matched the

MPI performance, albeit with higher latency.

Figure 2 illustrates the performance of both the single

and dual port NICs with the Broadcom based Gigabit

Ethernet data included from Figure 1 for reference. In

both cases the MPI latencies are very good at 6-7

µseconds and TCP latencies of 27-30 µseconds. Also

both NICs achieve around 90% of their link bandwidth

with over 3750 Mbps on the dual NIC and 1880 Mbps

on the single port NIC. TCP/IP performance is also

excellent with peak performance of 1880 Mbps on the

single port NIC and over 3500Mbps on the dual port

NIC.

2.3. Scalable Coherent Interface3

The Scalable Coherent Interface (SCI) from Dolphin

solutions is the most unique interconnect in this study

in that it is the only interconnect that is not switched.

Instead the nodes are connected in either a 2D wrapped

mesh or a 3D torus depending on the total size of the

cluster. The NIC includes intelligent ASICs that handle

all aspects of pass through routing, thus pass through is

very efficient and does not impact the host CPU at all.

However, one downside is that when a node goes down

(powered off) its links must be routed around thus im-

pacting messaging in the remaining system.

Since the links between nodes are effectively shared the

system size is limited by how many nodes you can ef-

fectively put in a loop before it is saturated. Currently

that number is in the range of 8-10 leading to total scal-

ability in the range of 64-100 nodes for a 2D configura-

tion and 640-1000 nodes for the 3D torus. Because there

are no switches the cost of the systems scales linearly

with the number of nodes, $1095 for the 2D NIC and

$1595 for the 3D NIC including cables. Unfortunately

cable length is a significant limitation with a preferred

length of 1m or less, though 3-5m cables can be used if

necessary. This poses quite a challenge in cabling up a

systems since each node must connected to 4 or 6 other

nodes.

Dolphin provides an open source driver and a 3rd party

MPICH based MPI is under development. However,

Page 39: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Scalable Coherent Interface• Not a “switched”

interconnect

• Max scalability: 64–100 nodes for 2D torus ($1095/node), 640–1000 for 3D torus ($1595/node)

• MPI: 4 µs latency, 1830 Mbps

• TCP/IP: 900 Mbps

0

400

800

1200

1600

2000

2400

2800

3200

3600

4000

1.0E+0

0

1.0E+0

1

1.0E+0

2

1.0E+0

3

1.0E+0

4

1.0E+0

5

1.0E+0

6

1.0E+0

7

1.0E+0

8

Message Size (Bytes)

Mbps

MPI - dual port

TCP - dual port

MPI - single port

TCP - single port

Dell Builtin Gig E

Figure 2: Myrinet Performance

0

400

800

1200

1600

2000

1.0E+00 1.0E+02 1.0E+04 1.0E+06 1.0E+08

Message Size (Bytes)

Mbps

Myrinet - MPI single port

SCI MPI (Tyan 2466N)

SCI TCP (Tyan2466N)

Dell/Broadcom Gigabit

Figure 3: SCI Performance

The big payoff is that the switch elements can be very

simple since they do not perform any routing calcula-

tions. In addition the software design is based on an OS-

bypass like interface to provide low-latency and low-

CPU overhead.

Current Myrinet hardware utilizes a 2 Gbps link speed

with PCI-X based NICs providing one or two optical

links. The dual-port NIC virtualizes the two ports to

provide an effective 4 Gbps channel. The downside to

the two port NIC is cost. Both the cost of the NIC and

the extra switch port it requires. Myrinet is designed to

be scalable. Until recently the limit has been 1024

nodes, but that limit has been removed in the latest

software. However, there are reports that the network

mapping is not yet working reliably with more than

1024 nodes. The cost of Myrinet runs around $850/node

up to 128 nodes, beyond that a second layer of switches

must be added increasing the cost to $1737/node for

1024 nodes. The dual port NICs effectively double the

infrastructure and thus add at least $800 per node to the

cost.

Myricom provides a full open-source based software

stack with support for a variety of OS’s and architec-

tures. Though some architectures, such as PPC64, are

not tuned as well as others. One of the significant

plusses for Myrinet is its strong support for TCP/IP.

Generally TCP/IP performance has nearly matched the

MPI performance, albeit with higher latency.

Figure 2 illustrates the performance of both the single

and dual port NICs with the Broadcom based Gigabit

Ethernet data included from Figure 1 for reference. In

both cases the MPI latencies are very good at 6-7

µseconds and TCP latencies of 27-30 µseconds. Also

both NICs achieve around 90% of their link bandwidth

with over 3750 Mbps on the dual NIC and 1880 Mbps

on the single port NIC. TCP/IP performance is also

excellent with peak performance of 1880 Mbps on the

single port NIC and over 3500Mbps on the dual port

NIC.

2.3. Scalable Coherent Interface3

The Scalable Coherent Interface (SCI) from Dolphin

solutions is the most unique interconnect in this study

in that it is the only interconnect that is not switched.

Instead the nodes are connected in either a 2D wrapped

mesh or a 3D torus depending on the total size of the

cluster. The NIC includes intelligent ASICs that handle

all aspects of pass through routing, thus pass through is

very efficient and does not impact the host CPU at all.

However, one downside is that when a node goes down

(powered off) its links must be routed around thus im-

pacting messaging in the remaining system.

Since the links between nodes are effectively shared the

system size is limited by how many nodes you can ef-

fectively put in a loop before it is saturated. Currently

that number is in the range of 8-10 leading to total scal-

ability in the range of 64-100 nodes for a 2D configura-

tion and 640-1000 nodes for the 3D torus. Because there

are no switches the cost of the systems scales linearly

with the number of nodes, $1095 for the 2D NIC and

$1595 for the 3D NIC including cables. Unfortunately

cable length is a significant limitation with a preferred

length of 1m or less, though 3-5m cables can be used if

necessary. This poses quite a challenge in cabling up a

systems since each node must connected to 4 or 6 other

nodes.

Dolphin provides an open source driver and a 3rd party

MPICH based MPI is under development. However,

Page 40: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Quadrics• “a premium interconnect

choice on high-end systems such as the Compaq AlphaServer SC”

• Max 4096 ports

• $2400/port for small system up to $4078/port for 1024 node system

• MPI: 2–3 µs latency, 6370 Mpbs bw

0

1000

2000

3000

4000

5000

6000

7000

1.0E+00 1.0E+02 1.0E+04 1.0E+06

Message Size (Bytes)M

bps

Quadrics - MPI

Myrinet - MPI single port

Dell/Broadcom Gigabit

Figure 4: Quadrics Performance

currently we have not gotten the MPI to function cor-

rectly in some cases. An alternative to the open source

stack is a package from Scali4. This adds $70 per node,

but does provide good performance. Unfortunately the

Scali packages are quite tied to the Red Hat and Suse

distributions that they support. Indeed it is difficult to

get the included driver to work on kernels other than the

default distribution kernels.

Figure 3 illustrates the performance of SCI on a Tyan

2466N (dual AMD Athlon) based node. Since Dolphin

does not currently offer a PCI-X based NIC it is un-

likely that the performance would be substantially dif-

ferent in the Dell 2650 nodes used for the other tests.

The Scali MPI and driver were used for these tests. The

MPI performance is quite good with a latency of 4

µseconds and peak performance of 1830Mbps nearly

matching that of the single port Myrinet NIC, which is

a PCI-X, based adapter. However, the TCP/IP perform-

ance is much less impressive as it barely gets above

900 Mbps and more importantly proved unreliable in

our tests.

2.4. Quadrics5

The Quadrics QSNET network has been known mostly

as a premium interconnect choice on high-end systems

such as the Compaq AlphaServer SC. On systems such

as the SC the nodes tend to be larger SMPs with a

higher per node cost than the typical cluster. Thus the

premium cost of Quadrics has posed less of a problem.

Indeed some systems are configured with dual Quadrics

NICs to further increase performance and to get around

the performance limitation of a single PCI slot.

The QSNET system basically consists of intelligent

NICs with an on-board IO processor connected via cop-

per cables (up to 10m) to 8 port switch chips arranged

in a fat tree. Quadrics has recently released an updated

version of their interconnect called QSNet II based on

ELAN4 ASICs. Along with the new NICs Quadrics has

introduced new smaller switches, which has brought

down the entry point substantially. In addition the limit

on the total port count has been increased to 4096 from

the original 1024. Still it remains a premium option

with per port costs starting at $2400 for a small sys-

tem, $2800 per port for a 64 way system, up to $4078

for a 1024 node system.

On the software side quadrics provides an open source

software stack including the driver, userland and MPI.

The DMA engine offloads most of the communications

onto the IO processor on the NIC. This includes the

ability to perform DMA on paged virtual memory ad-

dresses eliminating the need to register and pin memory

regions. Unfortunately their supported software configu-

ration also requires a licensed, non-open source resource

manager (RMS). In our experience the RMS system

was by far the hardest part to get working.

Figure 4 illustrates the performance of Quadrics using

the MPI interface. TCP/IP is also provided, but we were

unable to get it to build on our system in time for this

paper due to incompatibilities with our compiler ver-

sion. The MPI performance is extremely good with

latencies of 2-3 µseconds and peak performance of 6370

Mbps. Indeed this is the lowest latency we have seen on

these nodes.

2.5. Infiniband

Infiniband has received a great deal of attention over the

past few years even though actual products are just be-

ginning to appear. Infiniband was designed by an indus-

try consortium to provide high bandwidth communica-

tions for a wide range of purposes. These purposes

range from a low-level system bus to a general purpose

interconnect to be used for storage as well as inter-node

communications. This range of purposes leads to the

hope that Infiniband will enjoy commodity-like pricing

due to its use by many segments of the computer mar-

ket. Another promising development is the use of In-

finiband as the core system bus. Such a system could

provide external Infiniband ports that would connect

directly to the core system bridge chip bypassing the

PCI bus altogether. This would not only lower the cost,

but also provide a significantly lower latency. Another

significant advantage for Infiniband is that is designed

with scalable performance. The basic Infiniband link

speed is 2.5Gbps (known as a 1X link). However the

links are designed to bonded into higher bandwidth con-

nections with 4 link channels (aka a 4X links) provid-

ing 10Gbps and 12 link channels (aka 12X links) pro-

viding 30 Gbps.

Page 41: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Infiniband

• Defined by industry consortium

• Scalable—base is 2.5 Gb/s link, scales to 30 Gb/s

• Current parts are 4x (10 Gb/s)

• $1200–1600/node

• Also used as system interconnect

• 6750 Mb/s, 6–7 µs

0

1000

2000

3000

4000

5000

6000

7000

1.0E+00 1.0E+02 1.0E+04 1.0E+06

Message Size (Bytes)

Mbps

Quadrics - MPI

Infiniband -Infinicon MPI

Infiniband - OSUMVICH

Infiniband - LAMMPI

Infiniband -Infinicon SDP-TCP

Myrinet - MPI singleport

Dell/BroadcomGigabit

Figure 5: Infiniband Performance

0

1000

2000

3000

4000

5000

6000

7000

1.0E+00 1.0E+02 1.0E+04 1.0E+06

Message Size (Bytes)

Mbps

Quadrics - MPI

Infiniband -Infinicon MPI

Infiniband -Infinicon SDP-TCP

10 Gigabit Ethernet

Myrinet - TCP - dualport

Myrinet - MPI singleport

Dell/BroadcomGigabit

Figure 6: 10 Gigabit Ethernet Performance

�855(17��1),1,%$1'�,03/(0(17$7,216�$5(�$9$,/$%/(�$6����#�%$6('����6�$1'���25�� �3257�6:,7&+�&+,36�87,/,=�,1*� #���%36��/,1.6�� +(�6:,7&+�&+,36�&$1�%(�&21�),*85('�)25�27+(5�/,1.�63(('6�$6�:(//�,1&/8',1*��#/,1.6�� +,6�0$.(6�6:,7&+�$**5(*$7,21�620(:+$7�($6,(56,1&(�<28�&$1�&21),*85(�$�6:,7&+�:,7+��� #�/,1.6�72&211(&7�72�12'(6�$1'� ��#�/,1.6�72�&211(&7�72�27+(56:,7&+(6���855(17��1),1,%$1'�35,&,1*�,6�(1-2<,1*�7+(%(1(),76�2)� $**5(66,9(�9(1785(�&$3,7$/�)81',1*�:+,/(�7+(9$5,286�9(1'256�$77(037�72�'(),1(�7+(,5�0$5.(7�� +867+(5(�$5(�$�9$5,(7<�2)�9(1'256�&203(7,1*�:,7+�6/,*+7/<',))(5(17����6�$1'�6:,7&+(6�(9(1�7+28*+�7+(�&25(�6,/,&21,1�0267�&855(17�,03/(0(17$7,216�,6�)520��(//$12;���855(17�35,&,1*�5$1*(6�)520�������3(5�'(3(1',1*21�7+(�9(1'25�$1'�&/867(5�6,=(� �35,&,1*�:28/'�%(�+,*+(5)25�9(5<�/$5*(�&/867(56��

�1(�2)� 7+(�35,0$5<�:$<6�7+(�9(1'256�$5(�$77(037,1*�72',))(5(17,$7(�7+(,5�352'8&76�,6�7+528*+�7+(,5�62)7:$5(67$&.�� +,6�+$6�&5($7('�$�*(1(5$/�5(/8&7$1&(�72�5(/($6(�7+(62)7:$5(�67$&.�$6�)8//�23(1�6285&(���1�$'',7,21�0$1<�2)7+(�62)7:$5(�(/(0(176�$5(�67,//�9(5<�08&+�,1�'(9(/23�0(17���25�285�7(676�:(�$77(037('�72�86�$�&283/(�2)����,03/(0(17$7,216��%87�)281'�7+(�/$7(67��������%$6('&2'(�8167$%/(���167($'�:(�86('�$1�2/'(5� �"����%$6(',03/(0(17$7,21���/62�,1&/8'('�,6�$�121�23(1�6285&(67$&.�)520��1),1,&21��25325$7,21��

�,*85(���3/276�7+(�3(5)250$1&(�2)��1),1,%$1'�9(5686�$&283/(�2)� 27+(5� 7(&+12/2*,(6���/($5/<� 7+(5(�$5(�68%67$1�7,$/�',))(5(1&(6�%(7:((1� 7+(�',))(5(17�����,03/(0(17$�7,216�� +(��1),1,&21�����6+2:6�48,7(�*22'�3($.�3(5�)250$1&(�29(5������%36��7+(�27+(5�7:2�3($.�287$5281'�����%36���//�7+5((�6+2:�$�/$5*(�'523�2))$%29(� ���,1�0(66$*(�6,=(� :+(1�7+(�0(66$*(�6,=(� (;�

&(('6�7+(�21�����&$&+(�6,=(���1�$'',7,21�&855(17�/$7(1&,(62)�����>6(&21'6�$5(�$�%,7�+,*+(5�7+$1�7+(�27+(5����%<3$66�7(&+12/2*,(6�

2.6. 10 Gigabit Ethernet

��,*$%,7��7+(51(7�,6�7+(�1(;7�67(3�,1�7+(��7+(51(76(5,(6���(&$86(�2)�,76�+(5,7$*(�,76�'(6,*1�,1+(5,76�$//�7+($'9$17$*(6�$1'�',6$'9$17$*(6�2)�35(9,286��7+(51(7�,0�3/(0(17$7,216�� +(�35,0$5<�$'9$17$*(6�,1&/8'(�,17(523�(5$%,/,7<�:,7+�35(9,286� �7+(51(7�*(1(5$7,216��:,'(�3257�$%,/,7<� $1'� $� 8%,48,7286� ,17(5)$&(� � �������� +(� 35,0$5<',6$'9$17$*(6�+$9(�$/:$<6�%((1�5(/$7,9(/<�+,*+�/$7(1&<$1'���!�/2$'��$6�:(//�$6�(;3(16,9(�6:,7&+(6���1�7+(35(6(17� &$6(� $� 3257,21� 2)� 7+(� &267� ,6�%(,1*�'5,9(1�%<�7+(&267�2)�7+(�237,&6�:+,&+� &855(17/<�581�$5281'����%<7+(06(/9(6���20(��,*$%,7��7+(51(7�6:,7&+(6�$5(�12:2))(5,1*���,*$%,7�83/,1.�32576�(66(17,$//<�$7�7+(�&2672)�,167$//,1*�7+(�237,&6�7+$7�,6�0$.,1*�,7�620(:+$7025(�35$&7,&$/�72�7581.�08/7,3/(��,*$%,7��7+(51(76:,7&+(6�72*(7+(5���2:(9(5��)8//���,*$%,7��7+(51(76:,7&+(6�$5(�67,//�)$,5/<�(;3(16,9(�$7�$5281'���3(56:,7&+�3257�(9(1�7+28*+�7+$7�),*85(�+$6�'5233('�%<�$)$&725�2)�����,1�7+(�/$67�<($5���1�$'',7,21���,*$%,7�7+(51(7�6:,7&+(6�$5(�&855(17/<�)$,5/<�/,0,7('�,1�3257& 2 8 1 7 � : , 7 + � / $ 5 * ( � 6 : , 7 & + ( 6 � 2 ) ) ( 5 , 1 * � 2 1 / < � $ 5 2 8 1 ' � � �32576� +(�&267�$1'�/2:�3257�'(16,7<�&/($5/<�0$.(�&/867(5�%$6('21���,*$%,7��7+(51(7�81/,.(/<�,1�7+(�1($5�7(50���1�7+(/21*(5�7(50�7+(6(�,668(6�:,//�/,.(/<�0,7,*$7(�$6�7+(7(&+12/2*<�&2002',7,=(6�

�,*85(���3/276�7+(�3(5)250$1&(�2)�7:2��17(/��,*$%,7�7+(51(7�32576�$*$,167�6$03/(6�2)�7+(�27+(5�7(&+12/2�*,(6���/($5/<�7+(�3(5)250$1&(�2)���,*�����,6�',6$3�32,17,1*�:,7+�3(5)250$1&(�7233,1*�287�$5281'

Page 42: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

10 Gb Ethernet

• Similar tradeoffs to previous Ethernets

• $10k per switch port

• Only 3700 Mb/s

Page 43: Lecture 13 (part 1) Thread Level Parallelism (6)€¦ · Network Topology Distributed Switched (Direct) Networks 2D torus of 16 nodes hypercube of 16 nodes (16 = 2 4, so n = 4) 2D

Application Results

���%36���2:(9(5�7+,6�'2(6�$33($5�$�&20021�/,0,72)�!������$021*�7+(�27+(5� 7(&+12/2*,(6� $1'�7+86�0$<%(�,1�3$57�/,0,7('�%<�7+(�29(5+($'�2)�!������,76(/)�

3. Application Performance

�(<21'�5$:�1(7:25.�3(5)250$1&(�7+(5(�$5(�27+(5�,668(67+$7�&$1�'5$0$7,&$//<�())(&7�7+(�29(5$//�3(5)250$1&(�2)7+(�,17(5&211(&7���+,()�$021*�7+(�,668(6�,6�+2:�7+(62)7:$5(�$1'�+$5':$5(�'($/�:,7+�5811,1*�025(� 352&(66(67+$1�7+(5(�$5(�352&(66256���63(&,$//<�,1�7+(�&$6(�2)�$3�3/,&$7,216�7+$7�86(�$8;,/,$5<�352&(66(6�$6�'$7$�6(59(56�!+,6�7<3(�2)�352&(66(6�7<3,&$//<�0867�%/2&.�,1�$�5(&(,9(&$//��"1'(5�0$1<�����,03/(0(17$7,216�7+,6�5(68/76�,1�$32//,1*�%(+$9,25�7+$7�&$1�7$.(�68%67$17,$/���"�7,0($:$<�)520�7+(�&20387(�352&(66(6���1�285�/2&$/�(19,�5210(17�7+(�'20,1$17�$33/,&$7,21�,6�7+(�48$1780�&+(0�,675<�3$&.$*(����� ���!+86�:(�+$9(�5$1�$�6$03/(�&20081,&$7,21�,17(16,9(��&$/&8/$7,21�21�($&+�2)�7+(,17(5&211(&7�&+2,&(6�72�6((�-867�+2:�08&+�,03$&7�7+(1(7:25.�+$6�21�7+(�7,0,1*6�

#node, 2Procs per 1 2 4

�,*$%,7��7+(51(7 ��� �� ����<5,1(7���� ���� �� ����<5,1(7�!�� � �� �� ����1),1,%$1'���� ��� ���� ���8$'5,&6���� ���� ��# node, 1 proc per 1 2 4

�,*$%,7��7+(51(7 ���� ��� ����<5,1(7���� ��� �� � ��<5,1(7�!�� �� � ��� ��1),1,%$1'���� �� � �� ����8$'5,&6���� ��� ���

!$%/(��/,676�620(�7,0,1*6�)25�$�&$/&8/$7,21�7+$7�87,/,=(6$�'8$/�352&(66�02'(/���1(�352&(66�)81&7,216�$6�7+(&20387(�352&(66�$1'�7+(�6(&21'�352&(66�$&76�$6�$�'$7$6(59(5�)25�$�36(8'2�*/2%$/�6+$5('�0(025<�6(*0(17��!+(),567�%/2&.�2)�7,0,1*6�29(5/2$'�7+(���"6�:,7+���&20�387(�$1'���'$7$�6(59(56�21�($&+�'8$/���"�12'(��!+(6(&21'�%/2&.�581�21/<��&20387(�$1'��'$7$�6(59(5�3(512'(�

�1(�2%9,286�5(68/7�,6�7+$7�7+(�)$67(67� 1(7:25.�'2(6�1271(&(66$5,/<�352'8&(�7+(�)$67(67�7,0,1*�)25�7+,6�$33/,&$�7,21���#+(1�7+(���">6�$5(�127�29(5/2$'('��,(� 21(� &20�387(�$1'�21(�'$7$�6(59(5�3(5�12'(��7+(�7,0,1*6�$5(�)$,5/<6,0,/$5�)25�$//�,17(5&211(&7�7<3(6���2:(9(5�:+(1�7+(180%(5� 2)� 352&(66(6�3(5�12'(�,6�'28%/('�7+(�5(68/76�$5(025(�,17(5(67,1*���,567�2))��<5,1(7�78516�,1�68%67$1�

7,$//<�)$67(5�7,0,1*�)25�$�6,1*/(�12'(���267�/,.(/<�7+,6,1',&$7(6�$�*22'�,175$�12'(�0(66$*(�3$66,1*�,03/(0(1�7$7,21����2:(9(5�:+(1�5811,1*�21� �12'(6�7+(��<5,1(7!���581�,6�)$67(5�7+$1�7+(�27+(56��!+(�5($621�7+(����>6$5(�/,.(/<�6/2:(5�,6�'8(�72�7+(�:$<�0$1<����>6�+$1'/(%/2&.,1*�,1�$�5(&(,9(�&$//���$1<�,03/(0(17$7,216�$6�680(�7+$7�($&+�����352&(66�(66(17,$//<�2:16�$���"$1'�7+86�,03/(0(17�$�32//,1*�0(&+$1,60�7+$7�($76���"7,0(���25�$33/,&$7,216�68&+�$6����� �7+,6�5(68/76�,1$�68%67$17,$/�/266�2)�3(5)250$1&(���1�$'',7,21�0$1<�2)7+(�,17(5&211(&76��6((0�72�3(5)250�:256(�:+(1�7+(5(�$5(/276�2)�352&(66(6�87,/,=,1*�7+(�,17(5&211(&7�$7�7+(�6$0(7,0(���<�7+,6�:(�0($1�7+(5(�6((06�72�%(�68%67$17,$/29(5+($'�,1�08/7,3/(;,1*�$1'� '(08/7,3/(;,1*�7+(�0(6�6$*(6�:+(1�08/7,3/(�352&(66(6�$5(�6,08/7$1(286/<�$&�7,9(�

4. Conclusions

!+(�*22'�1(:6�,6�7+$7�72'$<�7+(5(�$5(�6(9(5$/�*22'&+2,&(6��)25�$�+,*+�63(('�,17(5&211(&7�21�$�&/867(5�$7�$5$1*(�2)�35,&(���7�7+(�/2:�(1'���,*$%,7��7+(51(7�+$6(0(5*('�$6�$�62/,'�237,21�:,7+�$�&267�2)�����12'(�25%(/2:�83�72�6(9(5$/�+81'5('�12'(6���)�&2856(��,*$%,7�7+(51(7�,6�$/62�7+(�/2:(67�3(5)250,1*�1(7:25.�,1�7+(6859(<��%87�,76�62/,'��8%,48,7286��62)7:$5(�6833257�0$.(,7�$�*22'�&+2,&(�)25�$33/,&$7,216�7+$7�'2�127�5(48,5(&877,1*�('*(�&20081,&$7,216�3(5)250$1&(�

�1�7+(�0,''/(�2)�7+(�3$&.�,1�7(506�2)�&267�$1'�3(5)250�$1&(�$5(� ���$1'��<5,1(7���27+�352'8&76�2))(5�*22'3(5)250$1&(�$7�$�02'(5$7(�&267�� ���+$6�620(�,17(5(67'8(�72�,76�)/$7�&267�3(5�12'(�$6�7+(�&/867(5�6,=(� ,6�6&$/('83���2:(9(5��7+(�62)7:$5(�67$&.6�$9$,/$%/(�)25� ���+$9(6(9(5$/�,668(6��620(�2)�:+,&+�6,*1,),&$17/<�,03$&7�7+(3(5)250$1&(�2)�7+(�6<67(0���<5,1(7�21�7+(�27+(5�+$1'2))(56��$�&203/(7(�23(1� 6285&(�62)7:$5(�3$&.$*(�7+$7�,67+(�0267�62/,'�62)7:$5(�67$&.�6859(<('�$3$57�)520!������%$6('��7+(51(7��!+(�&855(17��<5,1(7�+$5':$5(3529,'(6�*22'�3(5)250$1&(�$7�$�5($621$%/(�&267�$1'� ,67+86�$�48,7(�62/,'�&+2,&(�)25�0267�$33/,&$7,216�

�1),1,%$1'�67$1'6�287�$6�$�7(&+12/2*<�:,7+�$�*5($7�'($/2)�3520,6(��%87�48,7(�$�)(:�528*+�('*(6���+,()�$021*7+(�528*+�('*(6�$5(�7+(�352%/(06�:,7+�7+(�9$5,('�62)7�:$5(�67$&.6���855(17/<�,7�,6�3266,%/(�72�6(783�$1��1),1,�%$1'�%$6('�1(7:25.�$1'�),1'�$�86$%/(�62)7:$5(�67$&.��2:(9(5��68&+�$�1(7:25.�:,//�5(48,5(�$�%,7�025(� 0$,1�7(1$1&(�$6�7+(�62)7:$5(�,6�5(9,6('�

�1�7+(�/21*�7(50���,*$%,7��7+(51(7�:,//�/,.(/<�3/$<�$52/(�,1�&/867(56��%87�,7�:,//�/,.(/<�7$.(�$7�/($67�$�&283/(025(�<($56�%()25(�7+(�35,&(�%(&20(6�&203(7,7,9(���85�,1*�7+$7�7,0(���"6�$1'�%866(6�:,//�$/62�,1&5($6(�,1

Table 1: GAMESS results