Upload
waneta
View
30
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Parallel Computing Systems Part I: Introduction. Dror Feitelson Hebrew University. Topics. Overview of the field Architectures: vectors, MPPs, SMPs, and clusters Networks and routing Scheduling parallel jobs Grid computing Evaluating performance. Today (and next week?). - PowerPoint PPT Presentation
Citation preview
©2003 Dror Feitelson
Parallel Computing SystemsPart I: Introduction
Dror Feitelson
Hebrew University
©2003 Dror Feitelson
Topics
• Overview of the field
• Architectures: vectors, MPPs, SMPs, and clusters
• Networks and routing
• Scheduling parallel jobs
• Grid computing
• Evaluating performance
©2003 Dror Feitelson
Today (and next week?)
• What is parallel computing
• Some history
• The Top500 list
• The fastest machines in the world
• Trends and predictions
©2003 Dror Feitelson
What is a Parallel System?
In particular, what is the difference between parallel and distributed computing?
©2003 Dror Feitelson
What is a Parallel System?
Chandy: it is related to concurrency.
• In distributed computing, concurrency is part of the problem.
• In parallel computing, concurrency is part of the solution.
©2003 Dror Feitelson
Distributed Systems
• Concurrency because of physical distribution– Desktops of different users– Servers across the Internet– Branches of a firm– Central bank computer and ATMs
• Need to coordinate among autonomous systems
• Need to tolerate failures and disconnections
©2003 Dror Feitelson
Parallel Systems
• High-performance computing: solve problems that are to big for a single machine– Get the solution faster (weather forecast)– Get a better solution (physical simulation)
• Need to parallelize algorithm• Need to control overhead• Can assume friendly system?
©2003 Dror Feitelson
The Convergence
Use distributed resources for parallel processing
• Networks of workstations – use available desktop machines within organization
• Grids – use available resources (servers?) across organizations
• Internet computing – use personal PCs across the globe (SETI@home)
©2003 Dror Feitelson
Some History
©2003 Dror Feitelson
Early HPC
• Parallel systems in academia/research– 1974: C.mmp– 1974: Illiac IV– 1978: Cm*– 1983: Goodyear MPP
©2003 Dror Feitelson
Illiac IV•1974•SIMD: all processors do the same
•Numerical calculations at NASA
•Now in Boston computer museum
©2003 Dror Feitelson
The Illiac IV in Numbers
• 64 processors arranged as 8 8 grid• Each processor has 104 ECL transistors• Each processor has 2K 64-bit words
(total is 8 Mbit)• Arranged in 210 boards• Packed in 16 cabinets• 500 Mflops peak performance• Cost: $31 million
©2003 Dror Feitelson
Sustained vs. Peak
• Peak performance: product of clock rate and number of functional units
• Sustained rate: what you actually achieve on a real application
• Sustained is typically much lower than peak– Application does not require all functional units– Need to wait for data to arrive from memory– Need to synchronize– Best for dense matrix operations (Linpack)
A rate that the vendor guarantees will not be
exceeded
©2003 Dror Feitelson
Early HPC
• Parallel systems in academia/research– 1974: C.mmp– 1974: Illiac IV– 1978: Cm*– 1983: Goodyear MPP
• Vector systems by Cray and Japanese firms– 1976: Cray 1 rated at 160 Mflops peak– 1982: Cray X-MP, later Y-MP, C90, …– 1985: Cray 2, NEC SX-2
©2003 Dror Feitelson
Cray’s Achievements
• Architectural innovations– Vector operations on vector
registers– All memory is equally close: no
cache– Trade off accuracy and speed
• Packaging– Short and equally long wires– Liquid cooling systems
• Style
©2003 Dror Feitelson
Vector Supercomputers
• Vector registers store vectors of fast access
• Vector instructions operate on whole vectors of values– Overhead of instruction decode only once per vector– Pipelined execution of instruction on vector
elements: one result per clock tick
(at least after pipeline is full)– Possible to chain vector operations: start feeding
second functional unit before finishing first one
©2003 Dror Feitelson
Cray 1• 1975• 80 MHz clock• 160 Mflops peak• Liquid cooling• World’s most
expensive love seat• Power supply and
cooling under the seat
• Available in red, blue, black…
• No operating system
©2003 Dror Feitelson
Cray 1 Wiring
•Round configuration for small and uniform distances
•Longest wire: 4 feet
•Wires connected manually by extra-small engineers
©2003 Dror Feitelson
Cray X-MP
• 1982
• 1 Gflop
• Multiprocessor with 2 or 4 Cray1-like processors
• Shard memory
©2003 Dror Feitelson
Cray X-MP
©2003 Dror Feitelson
Cray 2
•1985
•Smaller and more compact than Cray 1
•4 (or 8) processors
•Total immersion liquid cooling
©2003 Dror Feitelson
Cray Y-MP
• 1988
• 8 proc’s
• Achieved 1 Gflop
©2003 Dror Feitelson
Cray Y-MP – Opened
©2003 Dror Feitelson
Cray Y-MP – From Back
Power supply and cooling
©2003 Dror Feitelson
Cray C90
• 1992
• 1 Gflop per processor
• 8 or more processors
©2003 Dror Feitelson
The MPP Boom
• 1985: Thinking Machines introduces the Connection Machine CM-1– 16K single-bit processors, SIMD– Followed by CM-2, CM-200– Similar machines by MasPar
• mid ’80s: hypercubes become successful
• Also: Transputers used as building blocks
• Early ’90s: big companies join– IBM, Cray
©2003 Dror Feitelson
SIMD Array Processors
• ’80 favorites– Connection machine– Maspar
• Very many single-bit processors with attached memory – proprietary hardware
• Single control unit: everything is totally synchronized(SIMD = single instruction multiple data)
• Massive parallelism even with “correct counting” (i.e. divide by 32)
©2003 Dror Feitelson
Connection Machine CM-2• Cube of
64K proc’s
• Acts as backend
• Hyper-cube topology
• Data vault for parallel I/O
©2003 Dror Feitelson
Hypercubes
• Early ’80s: Caltech 64-node Cosmic Cube• Mid to late ’80s: Commercialized by
several companies– Intel iPSC, iPSC/2, iPSC/860– nCUBE, nCUBE 2 (later turned into a VoD server…)
• Early ’90s: replaced by mesh/torus – Intel Paragon – i860 processors– Cray T3D, T3E – Alpha processors
©2003 Dror Feitelson
Transputers
• A microprocessor with built-in support for communication
• Programmed using Occam
• Used in Meiko and other systems
PARSEQ
x := 13;c ! x;
SEQc ? y;z := y;-- z is 13
Synchronous communication: an assignment
across processes
©2003 Dror Feitelson
Attack of the Killer Micros
• Commodity microprocessors advance at a faster rate than vector processors
• Takeover point was around year 2000
• Even before that, using many together could provide lots of power– 1992: TMC uses SPARC in CM-5– 1992: Intel uses i860 in Paragon– 1993: IBM SP uses RS/6000, later PowerPC– 1993: Cray uses Alpha in T3D– Berkeley NoW project
©2003 Dror Feitelson
Connection Machine CM-5
• 1992• SPARC-based• Fat-tree network• Dominant in early
’90s• Featured in
Jurassic Park• Support for gang
scheduling!
©2003 Dror Feitelson
Intel Paragon
•1992
•2 i860 proc’s per node:
–Compute–Commun.
•Mesh interconnect with spiffi display
©2003 Dror Feitelson
Cray T3D/T3E
• 1993 – Cray T3D
• Uses commodity microprocessors (DEC Alpha)
• 3D Torus interconnect
• 1995 – Cray T3E
©2003 Dror Feitelson
IBM SP •1993
•16 RS/6000 processors per rack
•Each runs AIX (full Unix)
•Multistage network
•Flexible configurations
•First large IUCC machine
©2003 Dror Feitelson
Berkeley NoW
• The building is the computer
• Just need some glue software…
©2003 Dror Feitelson
Not Everybody is Convinced…
• Japan’s computer industry continues to build vector machines
• NEC– SX series of supercomputers
• Hitachi– SR series of supercomputers
• Fujitsu– VPP series of supercomputers
• Albeit with less style
©2003 Dror Feitelson
Fujitsu VPP700
©2003 Dror Feitelson
NEC SX-4
©2003 Dror Feitelson
More Recent History
• 1994 – 1995 slump– Cold war is over– Thinking machines files for chapter 11– KSR Research files for chapter 11
• Late ’90s much better– IBM, Cray retain parallel machine market– Later also SGI, Sun, especially with SMPs– ASCI program is started
• 21st century: clusters take over– Based on SMPs
©2003 Dror Feitelson
SMPs
• Machines with several CPUs
• Initially small scale: 8-16 processors
• Later achieved large scale of 64-128 processors
• Global shared memory accessed via a bus
• Hard to scale further due to shared memory and cache coherence
©2003 Dror Feitelson
SGI Challenge
• 1 to 16 processors
• Bus interconnect
• Dominated low end of Top500 list in mid ’90s
• Not only graphics…
©2003 Dror Feitelson
SGI Origin
An Origin 2000 installed at IUCC
• MIPS processors
• Remote memory access
©2003 Dror Feitelson
Architectural Convergence
• Shared memory used to be uniform (UMA)– Based on bus or crossbar– Conventional load/store operations
• Distributed memory used message passing• Newer machines support remote memory
access– Nonuniform (NUMA): access to remote
memory costs more– Put/get operations (but handled by NIC)– Cray T3D/T3E, SGI Origin 2000/3000
©2003 Dror Feitelson
The ASCI Program
• 1996: nuclear test ban leads to need for simulation of nuclear explosions
• Accelerated Strategic Computing Initiative: Moore’s law not fast enough…
• Budget of a billion dollars
©2003 Dror Feitelson
The Vision
Market-driven progress
ASCI requirements
Path
Forw
ard
Technology
transfer
Time
Per
form
ance
©2003 Dror Feitelson
ASCI Milestones
• 1996 – ASCI Red: 1 TF Intel
• 1998 – ASCI Blue Mountain: 3 TF
• 1998 – ASCI Blue Pacific: 3 TF
• 2001 – ASCI WhiteWhite: 10 TF
• 2003 – ASCI Purple: 30 TF?
so far two thirds delivered
©2003 Dror Feitelson
The ASCI Red Machine
• 9260 processors – PentiumPro 200
• Arranged as 4-way SMPs in 86 cabinets
• 573 GB memory total
• 2.25 TB disk space total
• 2 miles of cables
• 850 KW peak power consumption
• 44 tons (+300 tons air conditioning equipment)
• Cost: $55 million
©2003 Dror Feitelson
Clusters vs. MPPs
• Mix and match approach – PCs/SMPs/blades used as processing nodes– Fast switched network for interconnect– Linux on each node– MPI for software development– Something for management
• Lower cost to set up
• Non-trivial to operate effectively
©2003 Dror Feitelson
SMP Nodes
• PCs, workstations, or servers with several CPUs
• Small scale (4-8) used as nodes in MPPs or clusters
• Access to shared memory via shared L2 cache
• SMP support (cache coherence) built into modern microprocessors
©2003 Dror Feitelson
Myrinet• 1995• Switched gigabit LAN
– As opposed to Ethernet that is a broadcast medium
• Programmable NIC– Offload communication operations from the
CPU
• Allows clusters to achieve communication rates of MPPs
• Very expensive• Later: gigabit Ethernet
©2003 Dror Feitelson
©2003 Dror Feitelson
Blades
• PCs/SMPs require resources– Floor space– Cables for interconnect– Power supplies and fans
• This is meaningful if you have thousands
• Blades provide dense packaging
• With vertical mounting get < 1U on average
• The hot new thing in 2002
©2003 Dror Feitelson
SunFire Servers
16 servers in a rack-mounted box
Used to be called “single-board computers” in the ’80s (Makbilan)
©2003 Dror Feitelson
The Cray Name• 1972: Cray Research founded
– Cray 1, X-MP, Cray 2, Y-MP, C90…
– From 1993: MPPs T3D, T3E
• 1989: Cray Computer founded– GaAs efforts, closed
• 1996: SGI Acquires Cray Research– Attempt to merge T3E and Origin
• 2000: sold to Tera– Use name to bolster MTA
• 2002: Cray sells Japanese NEC SX-6• 2002: Announces new X1 supercomputer
©2003 Dror Feitelson
Vectors are not Dead!
• 1994: Cray T90– Continues Cray C90 line
• 1996: Cray J90– Continues Cray Y-MP line
• 2000: Cray SV1
• 2002: Cray X1– Only “Big-Iron” company left
©2003 Dror Feitelson
Cray J90
• 1996
• Very popular continuation of Y-MP
• 8, then 16, then 32 processors
• One installed at IUCC
©2003 Dror Feitelson
Cray X1
• 2002• Up to 1024
nodes• 4 custom vector
proc’s per node• 12.8 GFlops
peak each• Torus
interconnect
©2003 Dror Feitelson
Confused?
©2003 Dror Feitelson
The Top500 List
• List of the 500 most powerful computer installations in the world
• Separates academic chic from real impact
• Measured using Linpack– Dense matrix operations– Might not be representative of real applications
©2003 Dror Feitelson
©2003 Dror Feitelson
©2003 Dror Feitelson
©2003 Dror Feitelson
©2003 Dror Feitelson
The Competition
How to achieve a rank:
• Few vector processors– Maximize power per processor– High efficiency
• Many commodity processors– Ride technology curve– Power in numbers– Low efficiency
©2003 Dror Feitelson
Vector Programming
• Conventional Fortran• Automatic vectorization of loops by
compiler• Autotasking uses processors that happen to
be available at runtime to execute chunks of loop iterations
• Easy for application writers• Very high efficiency
©2003 Dror Feitelson
MPP Programming
• Library added to programming language– MPI for distributed memory– OpenMP for shared memory
• Applications need to be partitioned manually
• Many possible efficiency losses– Fragmentation in allocating processors– Stalls waiting for memory and communication– Imbalance among threads
• Hard for programmers
©2003 Dror Feitelson
Also National Competition
• Japan Inc. is “more Cray than Cray”– Computers based on few but powerful proprietary
vector processors– Numerical wind tunnel at rank 1 from 1993 to 1995– CP-PACS at rank 1 in 1996– Earth simulator at rank 1 in 2003
• US industry switched to commodity microprocessors– Even Cray did– ASCI machines at rank 1 in 1997—2002
©2003 Dror Feitelson
Vectors vs. MPPs – 1994
Feitelson, Int. J. High-Perf. Comput. App., 1999
©2003 Dror Feitelson
Vectors vs. MPPs – 1997
©2003 Dror Feitelson
Vectors vs. MPPs – 1997
©2003 Dror Feitelson
The Current Situation
©2003 Dror Feitelson
Real Usage
• Control functions for telecomm companies• Reservoir modeling for oil companies• Graphic rendering for Hollywood• Financial modeling for Wall Street• Drug design for pharmaceuticals• Weather prediction• Airplane design for Boeing and Airbus• Hush-hush activities
©2003 Dror Feitelson
©2003 Dror Feitelson
The Earth Simulator
• Operational in late 2002
• Top rank in Top500 list
• Result of 5-year design and implementation effort
• Equivalent power to top 15 US machines (including all ASCI machines)
• Really big
©2003 Dror Feitelson
©2003 Dror Feitelson
©2003 Dror Feitelson
©2003 Dror Feitelson
©2003 Dror Feitelson
©2003 Dror Feitelson
The Earth Simulator in Numbers
• 640 nodes
• 8 vector processors per node, 5120 total
• 8 Gflops per processor, 40 Tflops total
• 16 GB memory per node, 10 TB total
• 2800 km of cables
• 320 cabinets (2 nodes each)
• Cost: $ 350 million
©2003 Dror Feitelson
Trends
©2003 Dror Feitelson
©2003 Dror Feitelson
Exercise
• Look at 10 years of Top500 lists and try to say something non-trivial about trends
• Are there things that grow?
• Are there things that stay the same?
• Can you make predictions?
©2003 Dror Feitelson
Distribution of Vendors – 1994
Feitelson, Int. J. High-Perf. Comput. App., 1999
©2003 Dror Feitelson
Distribution of Vendors – 1997
©2003 Dror Feitelson
IBM in the Lists
Arrows are the ANL SP1 with 128 processors
Rank doubles each year
©2003 Dror Feitelson
Minimal Parallelism
©2003 Dror Feitelson
Min vs. Max
©2003 Dror Feitelson
Power with Time
• Rmax of last machine doubles each year
This is 8-fold in three years
• Degree of parallelism doubles every three years
• So power of each processor increases 4-fold in three years (=doubles in 18 months)
• Which is Moore’s Law…
©2003 Dror Feitelson
Distribution of Power
• Rank of machines doubles each year• Power of rank 500 machine doubles each
year• So rank 250 machine this year has double
the power of rank 500 machine this year• And rank 125 machine has double the
power of rank 250 machine• In short, power decreases exponentially
with rank
©2003 Dror Feitelson
Power and Rank
©2003 Dror Feitelson
Power and Rank
©2003 Dror Feitelson
Power and Rank
©2003 Dror Feitelson
Power and Rank
The slope is becoming flatter
rankRmax
rankRmax
1
)log()log(
1994 0.978
1995 0.865
1996 0.839
1997 0.816
1998 0.777
1999 0.761
2000 0.800
2001 0.753
2002 0.746
©2003 Dror Feitelson
Machine Ages in Lists
©2003 Dror Feitelson
New Machines
©2003 Dror Feitelson
Industry Share
©2003 Dror Feitelson
Vector Share
©2003 Dror Feitelson
Summary
Invariants of the last few years:
• Power grows exponentially with time
• Parallelism grows exponentially with time
• But maximal usable parallelism is ~10000
• Power drops polynomially with rank
• Age in the list drops exponentially
• About 300 new machines each year
• About 50% of machines in industry
• About 15% of power due to vector processors