Parallel Computing Systems Part I: Introduction

©2003 Dror Feitelson

Parallel Computing SystemsPart I: Introduction

Dror Feitelson

Hebrew University


Topics

• Overview of the field

• Architectures: vectors, MPPs, SMPs, and clusters

• Networks and routing

• Scheduling parallel jobs

• Grid computing

• Evaluating performance


Today (and next week?)

• What is parallel computing

• Some history

• The Top500 list

• The fastest machines in the world

• Trends and predictions


What is a Parallel System?

In particular, what is the difference between parallel and distributed computing?


What is a Parallel System?

Chandy: it is related to concurrency.

• In distributed computing, concurrency is part of the problem.

• In parallel computing, concurrency is part of the solution.


Distributed Systems

• Concurrency because of physical distribution– Desktops of different users– Servers across the Internet– Branches of a firm– Central bank computer and ATMs

• Need to coordinate among autonomous systems

• Need to tolerate failures and disconnections


Parallel Systems

• High-performance computing: solve problems that are to big for a single machine– Get the solution faster (weather forecast)– Get a better solution (physical simulation)

• Need to parallelize algorithm• Need to control overhead• Can assume friendly system?


The Convergence

Use distributed resources for parallel processing

• Networks of workstations – use available desktop machines within organization

• Grids – use available resources (servers?) across organizations

• Internet computing – use personal PCs across the globe (SETI@home)


Some History


Early HPC

• Parallel systems in academia/research– 1974: C.mmp– 1974: Illiac IV– 1978: Cm*– 1983: Goodyear MPP


Illiac IV•1974•SIMD: all processors do the same

•Numerical calculations at NASA

•Now in Boston computer museum


The Illiac IV in Numbers

• 64 processors arranged as 8 8 grid• Each processor has 104 ECL transistors• Each processor has 2K 64-bit words

(total is 8 Mbit)• Arranged in 210 boards• Packed in 16 cabinets• 500 Mflops peak performance• Cost: $31 million


Sustained vs. Peak

• Peak performance: product of clock rate and number of functional units

• Sustained rate: what you actually achieve on a real application

• Sustained is typically much lower than peak– Application does not require all functional units– Need to wait for data to arrive from memory– Need to synchronize– Best for dense matrix operations (Linpack)

A rate that the vendor guarantees will not be

exceeded


Early HPC

• Parallel systems in academia/research– 1974: C.mmp– 1974: Illiac IV– 1978: Cm*– 1983: Goodyear MPP

• Vector systems by Cray and Japanese firms– 1976: Cray 1 rated at 160 Mflops peak– 1982: Cray X-MP, later Y-MP, C90, …– 1985: Cray 2, NEC SX-2


Cray’s Achievements

• Architectural innovations– Vector operations on vector

registers– All memory is equally close: no

cache– Trade off accuracy and speed

• Packaging– Short and equally long wires– Liquid cooling systems

• Style


Vector Supercomputers

• Vector registers store vectors of fast access

• Vector instructions operate on whole vectors of values– Overhead of instruction decode only once per vector– Pipelined execution of instruction on vector

elements: one result per clock tick

(at least after pipeline is full)– Possible to chain vector operations: start feeding

second functional unit before finishing first one


Cray 1• 1975• 80 MHz clock• 160 Mflops peak• Liquid cooling• World’s most

expensive love seat• Power supply and

cooling under the seat

• Available in red, blue, black…

• No operating system


Cray 1 Wiring

•Round configuration for small and uniform distances

•Longest wire: 4 feet

•Wires connected manually by extra-small engineers


Cray X-MP

• 1982

• 1 Gflop

• Multiprocessor with 2 or 4 Cray1-like processors

• Shard memory


Cray X-MP


Cray 2

•1985

•Smaller and more compact than Cray 1

•4 (or 8) processors

•Total immersion liquid cooling


Cray Y-MP

• 1988

• 8 proc’s

• Achieved 1 Gflop


Cray Y-MP – Opened


Cray Y-MP – From Back

Power supply and cooling


Cray C90

• 1992

• 1 Gflop per processor

• 8 or more processors


The MPP Boom

• 1985: Thinking Machines introduces the Connection Machine CM-1– 16K single-bit processors, SIMD– Followed by CM-2, CM-200– Similar machines by MasPar

• mid ’80s: hypercubes become successful

• Also: Transputers used as building blocks

• Early ’90s: big companies join– IBM, Cray


SIMD Array Processors

• ’80 favorites– Connection machine– Maspar

• Very many single-bit processors with attached memory – proprietary hardware

• Single control unit: everything is totally synchronized(SIMD = single instruction multiple data)

• Massive parallelism even with “correct counting” (i.e. divide by 32)


Connection Machine CM-2• Cube of

64K proc’s

• Acts as backend

• Hyper-cube topology

• Data vault for parallel I/O


Hypercubes

• Early ’80s: Caltech 64-node Cosmic Cube• Mid to late ’80s: Commercialized by

several companies– Intel iPSC, iPSC/2, iPSC/860– nCUBE, nCUBE 2 (later turned into a VoD server…)

• Early ’90s: replaced by mesh/torus – Intel Paragon – i860 processors– Cray T3D, T3E – Alpha processors


Transputers

• A microprocessor with built-in support for communication

• Programmed using Occam

• Used in Meiko and other systems

PARSEQ

x := 13;c ! x;

SEQc ? y;z := y;-- z is 13

Synchronous communication: an assignment

across processes


Attack of the Killer Micros

• Commodity microprocessors advance at a faster rate than vector processors

• Takeover point was around year 2000

• Even before that, using many together could provide lots of power– 1992: TMC uses SPARC in CM-5– 1992: Intel uses i860 in Paragon– 1993: IBM SP uses RS/6000, later PowerPC– 1993: Cray uses Alpha in T3D– Berkeley NoW project


Connection Machine CM-5

• 1992• SPARC-based• Fat-tree network• Dominant in early

’90s• Featured in

Jurassic Park• Support for gang

scheduling!


Intel Paragon

•1992

•2 i860 proc’s per node:

–Compute–Commun.

•Mesh interconnect with spiffi display


Cray T3D/T3E

• 1993 – Cray T3D

• Uses commodity microprocessors (DEC Alpha)

• 3D Torus interconnect

• 1995 – Cray T3E


IBM SP •1993

•16 RS/6000 processors per rack

•Each runs AIX (full Unix)

•Multistage network

•Flexible configurations

•First large IUCC machine


Berkeley NoW

• The building is the computer

• Just need some glue software…


Not Everybody is Convinced…

• Japan’s computer industry continues to build vector machines

• NEC– SX series of supercomputers

• Hitachi– SR series of supercomputers

• Fujitsu– VPP series of supercomputers

• Albeit with less style


Fujitsu VPP700


NEC SX-4


More Recent History

• 1994 – 1995 slump– Cold war is over– Thinking machines files for chapter 11– KSR Research files for chapter 11

• Late ’90s much better– IBM, Cray retain parallel machine market– Later also SGI, Sun, especially with SMPs– ASCI program is started

• 21st century: clusters take over– Based on SMPs


SMPs

• Machines with several CPUs

• Initially small scale: 8-16 processors

• Later achieved large scale of 64-128 processors

• Global shared memory accessed via a bus

• Hard to scale further due to shared memory and cache coherence


SGI Challenge

• 1 to 16 processors

• Bus interconnect

• Dominated low end of Top500 list in mid ’90s

• Not only graphics…


SGI Origin

An Origin 2000 installed at IUCC

• MIPS processors

• Remote memory access


Architectural Convergence

• Shared memory used to be uniform (UMA)– Based on bus or crossbar– Conventional load/store operations

• Distributed memory used message passing• Newer machines support remote memory

access– Nonuniform (NUMA): access to remote

memory costs more– Put/get operations (but handled by NIC)– Cray T3D/T3E, SGI Origin 2000/3000


The ASCI Program

• 1996: nuclear test ban leads to need for simulation of nuclear explosions

• Accelerated Strategic Computing Initiative: Moore’s law not fast enough…

• Budget of a billion dollars


The Vision

Market-driven progress

ASCI requirements

Path

Forw

ard

Technology

transfer

Time

Per

form

ance


ASCI Milestones

• 1996 – ASCI Red: 1 TF Intel

• 1998 – ASCI Blue Mountain: 3 TF

• 1998 – ASCI Blue Pacific: 3 TF

• 2001 – ASCI WhiteWhite: 10 TF

• 2003 – ASCI Purple: 30 TF?

so far two thirds delivered


The ASCI Red Machine

• 9260 processors – PentiumPro 200

• Arranged as 4-way SMPs in 86 cabinets

• 573 GB memory total

• 2.25 TB disk space total

• 2 miles of cables

• 850 KW peak power consumption

• 44 tons (+300 tons air conditioning equipment)

• Cost: $55 million


Clusters vs. MPPs

• Mix and match approach – PCs/SMPs/blades used as processing nodes– Fast switched network for interconnect– Linux on each node– MPI for software development– Something for management

• Lower cost to set up

• Non-trivial to operate effectively


SMP Nodes

• PCs, workstations, or servers with several CPUs

• Small scale (4-8) used as nodes in MPPs or clusters

• Access to shared memory via shared L2 cache

• SMP support (cache coherence) built into modern microprocessors


Myrinet• 1995• Switched gigabit LAN

– As opposed to Ethernet that is a broadcast medium

• Programmable NIC– Offload communication operations from the

CPU

• Allows clusters to achieve communication rates of MPPs

• Very expensive• Later: gigabit Ethernet



Blades

• PCs/SMPs require resources– Floor space– Cables for interconnect– Power supplies and fans

• This is meaningful if you have thousands

• Blades provide dense packaging

• With vertical mounting get < 1U on average

• The hot new thing in 2002


SunFire Servers

16 servers in a rack-mounted box

Used to be called “single-board computers” in the ’80s (Makbilan)


The Cray Name• 1972: Cray Research founded

– Cray 1, X-MP, Cray 2, Y-MP, C90…

– From 1993: MPPs T3D, T3E

• 1989: Cray Computer founded– GaAs efforts, closed

• 1996: SGI Acquires Cray Research– Attempt to merge T3E and Origin

• 2000: sold to Tera– Use name to bolster MTA

• 2002: Cray sells Japanese NEC SX-6• 2002: Announces new X1 supercomputer


Vectors are not Dead!

• 1994: Cray T90– Continues Cray C90 line

• 1996: Cray J90– Continues Cray Y-MP line

• 2000: Cray SV1

• 2002: Cray X1– Only “Big-Iron” company left


Cray J90

• 1996

• Very popular continuation of Y-MP

• 8, then 16, then 32 processors

• One installed at IUCC


Cray X1

• 2002• Up to 1024

nodes• 4 custom vector

proc’s per node• 12.8 GFlops

peak each• Torus

interconnect


Confused?


The Top500 List

• List of the 500 most powerful computer installations in the world

• Separates academic chic from real impact

• Measured using Linpack– Dense matrix operations– Might not be representative of real applications






The Competition

How to achieve a rank:

• Few vector processors– Maximize power per processor– High efficiency

• Many commodity processors– Ride technology curve– Power in numbers– Low efficiency


Vector Programming

• Conventional Fortran• Automatic vectorization of loops by

compiler• Autotasking uses processors that happen to

be available at runtime to execute chunks of loop iterations

• Easy for application writers• Very high efficiency


MPP Programming

• Library added to programming language– MPI for distributed memory– OpenMP for shared memory

• Applications need to be partitioned manually

• Many possible efficiency losses– Fragmentation in allocating processors– Stalls waiting for memory and communication– Imbalance among threads

• Hard for programmers


Also National Competition

• Japan Inc. is “more Cray than Cray”– Computers based on few but powerful proprietary

vector processors– Numerical wind tunnel at rank 1 from 1993 to 1995– CP-PACS at rank 1 in 1996– Earth simulator at rank 1 in 2003

• US industry switched to commodity microprocessors– Even Cray did– ASCI machines at rank 1 in 1997—2002


Vectors vs. MPPs – 1994

Feitelson, Int. J. High-Perf. Comput. App., 1999






The Current Situation


Real Usage

• Control functions for telecomm companies• Reservoir modeling for oil companies• Graphic rendering for Hollywood• Financial modeling for Wall Street• Drug design for pharmaceuticals• Weather prediction• Airplane design for Boeing and Airbus• Hush-hush activities



The Earth Simulator

• Operational in late 2002

• Top rank in Top500 list

• Result of 5-year design and implementation effort

• Equivalent power to top 15 US machines (including all ASCI machines)

• Really big







The Earth Simulator in Numbers

• 640 nodes

• 8 vector processors per node, 5120 total

• 8 Gflops per processor, 40 Tflops total

• 16 GB memory per node, 10 TB total

• 2800 km of cables

• 320 cabinets (2 nodes each)

• Cost: $ 350 million


Trends



Exercise

• Look at 10 years of Top500 lists and try to say something non-trivial about trends

• Are there things that grow?

• Are there things that stay the same?

• Can you make predictions?


Distribution of Vendors – 1994

Feitelson, Int. J. High-Perf. Comput. App., 1999


Distribution of Vendors – 1997


IBM in the Lists

Arrows are the ANL SP1 with 128 processors

Rank doubles each year


Minimal Parallelism


Min vs. Max


Power with Time

• Rmax of last machine doubles each year

This is 8-fold in three years

• Degree of parallelism doubles every three years

• So power of each processor increases 4-fold in three years (=doubles in 18 months)

• Which is Moore’s Law…


Distribution of Power

• Rank of machines doubles each year• Power of rank 500 machine doubles each

year• So rank 250 machine this year has double

the power of rank 500 machine this year• And rank 125 machine has double the

power of rank 250 machine• In short, power decreases exponentially

with rank


Power and Rank


Power and Rank


Power and Rank


Power and Rank

The slope is becoming flatter

rankRmax

rankRmax

1

)log()log(

1994 0.978

1995 0.865

1996 0.839

1997 0.816

1998 0.777

1999 0.761

2000 0.800

2001 0.753

2002 0.746


Machine Ages in Lists


New Machines


Industry Share


Vector Share


Summary

Invariants of the last few years:

• Power grows exponentially with time

• Parallelism grows exponentially with time

• But maximal usable parallelism is ~10000

• Power drops polynomially with rank

• Age in the list drops exponentially

• About 300 new machines each year

• About 50% of machines in industry

• About 15% of power due to vector processors

Documents

Parallel Computing Systems Part I: Introduction