Upload
wilfred-hodges
View
214
Download
1
Tags:
Embed Size (px)
Citation preview
1
CS 594 Spring 2002Lecture 4:
Jack Dongarra University of Tennessee
2
Plan For Today
Dr. David Cronk on Homework #2 Finish Lecture: Parallel
Architectures and Programming Floating point arithmetic
Programming Model 3 Data ParallelSingle sequential thread of control consisting of parallel
operationsParallel operations applied to all (or defined subset) of a
data structureCommunication is implicit in parallel operators and
“shifted” data structuresElegant and easy to understand and reason aboutNot all problems fit this model
Like marching in a regiment
A:
fA:f
sum
A = array of all datafA = f(A)s = sum(fA)
° Think of Matlabs:
Model 3 Vector ComputingOne instruction executed across all the data in a
pipelined fashionParallel operations applied to all (or defined subset) of a
data structureCommunication is implicit in parallel operators and
“shifted” data structuresElegant and easy to understand and reason aboutNot all problems fit this model
Like marching in a regiment
A:
fA:f
sum
A = array of all datafA = f(A)s = sum(fA)
° Think of Matlabs:
5
Machine Model 3
An SIMD (Single Instruction Multiple Data) machine A large number of small processors A single “control processor” issues each instruction
each processor executes the same instructionsome processors may be turned off on any instruction
interconnect
P1
memory
NI P2
memory
NI Pn
memory
NI
. . .
Machines not popular (CM2), but programming model isimplemented by mapping n-fold parallelism to p
processorsmostly done in the compilers (HPF = High Performance
Fortran)
control processor
6
Machine Model 4
Since small shared memory machines (SMPs) are the fastest commodity machine, why not build a larger machine by connecting many of them with a network?
CLUMP = Cluster of SMPs Shared memory within one SMP, message passing outside
Clusters, ASCI Red (Intel), ... Programming model?
Treat machine as “flat”, always use message passing, even within SMP (simple, but ignore important part of memory hierarchy)
Expose two layers: shared memory (OpenMP) and message passing (MPI) higher performance, but ugly to program.
Programming Model 5 Bulk Synchronous Processing (BSP) – L. Valiant
Used within the message passing or shared memory models as a programming convention
Phases separated by global barriersCompute phases: all operate on local data (in
distributed memory)» or read access to global data (in shared
memory)Communication phases: all participate in
rearrangement or reduction of global data Generally all doing the “same thing” in a phaseall do f, but may all do different things within f
Simplicity of data parallelism without restrictions
8
Summary so far
Historically, each parallel machine was unique, along with its programming model and programming language
You had to throw away your software and start over with each new kind of machine - ugh
Now we distinguish the programming model from the underlying machine, so we can write portably correct code, that runs on many machines
MPI now the most portable option, but can be tedious Writing portably fast code requires tuning for the architecture
Algorithm design challenge is to make this process easy Example: picking a blocksize, not rewriting whole
algorithm
9
Recap Parallel Comp. Architecture driven by familiar
technological and economic forcesapplication/platform cycle, but focused on the
most demanding applicationshardware/software learning curve
More attractive than ever because ‘best’ building block - the microprocessor - is also the fastest BB.
History of microprocessor architecture is parallelismtranslates area and denisty into performance
The Future is higher levels of parallelismParallel Architecture concepts apply at many levelsCommunication also on exponential curve
=> Quantitative Engineering approach
New ApplicationsMore Performance
Speedup
13
Performance Numbers on RISC Processors Using Linpack Benchmark
Machine MHz Linpack n=100
Mflop/s
Ax=b n=1000 Mflop/s
Peak Mflop/s
Intel P4 2200 1033 (47%) 1911 (86%) 2200 Compaq Alpha 1000 824 (41%) 1542 (77%) 2000 Intel/HP Itanium 800 600 (19%) 2382 (74%) 3200 AMD Athlon 1200 558 (23%) 998 (42%) 2400 HP PA 550 468 (21%) 1583 (71%) 2200 IBM Power 3 375 424 (28%) 1208 (80%) 1500 Intel P3 933 234 (25%) 514 (55%) 933 PowerPC G4 533 231 (22%) 478 (45%) 1066 SUN Ultra 80 450 208 (23%) 607 (67%) 900 SGI Origin 2K 300 173 (29%) 553 (92%) 600 Cray T90 454 705 (39%) 1603 (89%) 1800 Cray C90 238 387 (41%) 902 (95%) 952 Cray Y-MP 166 161 (48%) 324 (97%) 333 Cray X-MP 118 121 (51%) 218 (93%) 235 Cray J-90 100 106 (53%) 190 (95%) 200 Cray 1 80 27 (17%) 110 (69%) 160
14
Consider Scientific Supercomputing
Proving ground and driver for innovative architecture and techniques
Market smaller relative to commercial as MPs become mainstream
Dominated by vector machines starting in 70s Microprocessors have made huge gains in floating-point
performance» high clock rates » pipelined floating point units (e.g., multiply-add every
cycle)» instruction-level parallelism» effective use of caches (e.g., automatic blocking)
Plus economics
Large-scale multiprocessors replace vector supercomputers
15
Architectures
Single Processor
SMP
MPP
SIMD
Constellation
Cluster - NOW
0
100
200
300
400
500
Y-MP C90
Sun HPC
Paragon
CM5T3D
T3E
SP2
Cluster of Sun HPC
ASCI Red
CM2
VP500
SX3
Constellation: # of p/n n
16
Chip Technology
Alpha
Power
HP
intel
MIPS
Sparcother COTS
proprietary
0
100
200
300
400
500
17
Manufacturer
Cray
SGI
IBM
Sun
HP
TMC
Intel
FujitsuNEC
Hitachiothers
0
100
200
300
400
500
IBM 32%, HP 30%, SGI 8%, Cray 8%, SUN 6%, Fuji 4%, NEC 3%, Hitachi 3%
18
High-Performance Computing Directions: Beowulf-class PC Clusters
COTS PC Nodes Pentium, Alpha, PowerPC,
SMP COTS LAN/SAN Interconnect
Ethernet, Myrinet, Giganet, ATM
Open Source Unix Linux, BSD
Message Passing Computing MPI, PVM HPF
Best price-performance Low entry-level cost Just-in-place
configuration Vendor invulnerable Scalable Rapid technology
tracking
Definition: Advantages:
Enabled by PC hardware, networks and operating system achieving capabilities of scientific workstations at a fraction of the cost and availability of industry standard message passing libraries. However, much more of a contact sport.
19
Peak performance Interconnection http://clusters.top500.org Benchmark results to follow in the coming
months
20
Distributed and Parallel Systems
Distributedsystemshetero-geneous
Massivelyparallelsystemshomo-geneous
Grid
Com
putin
gB
eow
ulf
Ber
kley
NO
WS
NL
Cpl
ant
Ent
ropi
a
AS
CI T
flops
Gather (unused) resources Steal cycles System SW manages resources System SW adds value 10% - 20% overhead is OK Resources drive applications Time to completion is not
critical Time-shared
Bounded set of resources Apps grow to consume all cycles Application manages resources System SW gets in the way 5% overhead is maximum Apps drive purchase of
equipment Real-time constraints Space-shared
SE
TI@
hom
e
Par
alle
l Dis
t mem
21
Different Parallel Architectures Parallel computing: single systems with
many processors working on same problem
Distributed computing: many systems loosely coupled by a scheduler to work on related problems
Grid Computing: many systems tightly coupled by software, perhaps geographically distributed, to work together on single problems or on related problems
22
Historical Development
P
P
C
C
I/O
I/O
M MM M
PP
C
I/O
M MC
I/O
$ $
“Mainframe” approach Motivated by multiprogramming Extends crossbar used for Mem and I/O Processor cost-limited => crossbar Bandwidth scales with p High incremental cost
» use multistage instead
“Minicomputer” approach Almost all microprocessor systems have
bus Motivated by multiprogramming, TP Used heavily for parallel computing Called symmetric multiprocessor (SMP) Latency larger than for uniprocessor Bus is bandwidth bottleneck
» caching is key: coherence problem Low incremental cost
23
Shared Virtual Address Space Process = address space plus thread of control Virtual-to-physical mapping can be established so
that processes shared portions of address space. User-kernel or multiple processes
Multiple threads of control on one address space. Popular approach to structuring OS’s Now standard application capability (ex: POSIX
threads) Writes to shared address visible to other threads
Natural extension of uniprocessors model conventional memory operations for communication special atomic operations for synchronization
» also load/stores
24
Engineering: Intel Pentium Pro Quad
All coherence and multiprocessing glue in processor module
Highly integrated, targeted at high volume
Low latency and bandwidth
P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)
CPU
Bus interface
MIU
P-Promodule
P-Promodule
P-Promodule256-KB
L2 $Interruptcontroller
PCIbridge
PCIbridge
Memorycontroller
1-, 2-, or 4-wayinterleaved
DRAM
PC
I bus
PC
I busPCI
I/Ocards
25
Engineering: SUN Enterprise
Proc + mem card - I/O card16 cards of either typeAll memory accessed over bus, so symmetricHigher bandwidth, higher latency bus
Gigaplane bus (256 data, 41 addr ess, 83 MHz)
SB
US
SB
US
SB
US
2 F
iber
Cha
nnel
100b
T, S
CS
I
Bus interface
CPU/memcardsP
$2
$
P
$2
$
Mem ctrl
Bus interface/switch
I/O cards
26
Scaling Up
Problem is interconnect: cost (crossbar) or bandwidth (bus) Dance-hall: bandwidth still scalable, but lower cost than
crossbar» latencies to memory uniform, but uniformly large
Distributed memory or non-uniform memory access (NUMA)» Construct shared address space out of simple message
transactions across a general-purpose network (e.g. read-request, read-response)
Caching shared (particularly nonlocal) data?
M M M
M M M
NetworkNetwork
P
$
P
$
P
$
P
$
P
$
P
$
“Dance hall” Distributed memory
27
Engineering: Cray T3E
Scale up to 1024 processors, 480MB/s linksMemory controller generates request message for non-local referencesNo hardware mechanism for coherence
» SGI Origin etc. provide this
Switch
P
$
XY
Z
External I/O
Memctrl
and NI
Mem
28
Diminishing Role of Topology
Shift to general links DMA, enabling non-blocking ops
» Buffered by system at destination until recv
Store&forward routing Diminishing role of topology
Any-to-any pipelined routing node-network interface dominates
communication time
Simplifies programming Allows richer design space
» grids vs hypercubes
H x (T0 + n/B)
vs
T0 + H + n/B
Intel iPSC/1 -> iPSC/2 -> iPSC/860
29
Example Intel Paragon
Memory bus (64-bit, 50 MHz)
i860
L1 $
NI
DMA
i860
L1 $
Driver
Memctrl
4-wayinterleaved
DRAM
IntelParagonnode
8 bits,175 MHz,bidirectional2D grid network
with processing nodeattached to every switch
Sandia’ s Intel Paragon XP/S-based Super computer
30
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
DR
AM
IBM SP-2 node
L2 $
Power 2CPU
Memorycontroller
4-wayinterleaved
DRAM
General interconnectionnetwork formed from8-port switches
NIC
Building on the mainstream: IBM SP-2
Made out of essentially complete RS6000 workstations
Network interface integrated in I/O bus (bw limited by I/O bus)
31
A Little History Von Neumann and Goldstine - 1947
“Can’t expect to solve most big [n>15] linear systems without carrying many decimal digits [d>8], otherwise the computed answer would be completely inaccurate.” - WRONG!
Turing - 1949 “Carrying d digits is equivalent to changing the input data in the d-th place and
then solving Ax=b. So if A is only known to d digits, the answer is as accurate as the data deserves.”
Backward Error Analysis Rediscovered in 1961 by Wilkinson and publicized Starting in the 1960s- many papers doing backward error analysis of
various algorithms Many years where each machine did FP arithmetic slightly differently
Both rounding and exception handling differed Hard to write portable and reliable software Motivated search for industry-wide standard, beginning late 1970s First implementation: Intel 8087
ACM Turing Award 1989 to W. Kahan for design of the IEEE Floating Point Standards 754 (binary) and 854 (decimal)
Nearly universally implemented in general purpose machines
32
Defining Floating Point Arithmetic
Representable numbersScientific notation: +/- d.d…d x rexp
sign bit +/-radix r (usually 2 or 10, sometimes 16)significand d.d…d (how many base-r digits d?)exponent exp (range?)others?
Operations:arithmetic: +,-,x,/,...
» how to round result to fit in formatcomparison (<, =, >)conversion between different formats
» short to long FP numbers, FP to integerexception handling
» what to do for 0/0, 2*largest_number, etc.binary/decimal conversion
» for I/O, when radix not 10 Language/library support for these operations
33
IEEE Floating Point Arithmetic Standard 754 - Normalized Numbers
Normalized Nonzero Representable Numbers: +- 1.d…d x 2exp
Macheps = Machine epsilon = 2-#significand bits = relative error in each operation OV = overflow threshold = largest number UN = underflow threshold = smallest number
+- Zero: +-, significand and exponent all zeroWhy bother with -0 later
Format # bits #significand bits macheps #exponent bits exponent range---------- -------- ----------------------- ------------ -------------------- ----------------------Single 32 23+1 2-24 (~10-7) 8 2-126 - 2127 (~10+-38)Double 64 52+1 2-53 (~10-16) 11 2-1022 - 21023 (~10+-308)Double >=80 >=64 <=2-64(~10-19) >=15 2-16382 - 216383 (~10+-4932) Extended (80 bits on all Intel machines)
34
IEEE Floating Point Arithmetic Standard 754 - “Denorms” Denormalized Numbers: +-0.d…d x 2min_exp
sign bit, nonzero significand, minimum exponent Fills in gap between UN and 0
Underflow Exception occurs when exact nonzero result is less than underflow threshold UN Ex: UN/3 return a denorm, or zero
Why bother? Necessary so that following code never divides by zero if (a != b) then x = a/(a-b)
35
IEEE Floating Point Arithmetic Standard 754 - +- Infinity
+- Infinity: Sign bit, zero significand, maximum exponent Overflow Exception
occurs when exact finite result too large to represent accurately
Ex: 2*OVreturn +- infinity
Divide by zero Exceptionreturn +- infinity = 1/+-0 sign of zero important!
Also return +- infinity for3+infinity, 2*infinity, infinity*infinityResult is exact, not an exception!
36
IEEE Floating Point Arithmetic Standard 754 - NAN (Not A Number) NAN: Sign bit, nonzero significand, maximum exponent Invalid Exception
occurs when exact result not a well-defined real number0/0sqrt(-1)infinity-infinity, infinity/infinity, 0*infinityNAN + 3NAN > 3?Return a NAN in all these cases
Two kinds of NANsQuiet - propagates without raising an exceptionSignaling - generate an exception when touched
» good for detecting uninitialized data
37
Error Analysis Basic error formula
fl(a op b) = (a op b)*(1 + d) where» op one of +,-,*,/» |d| <= macheps» assuming no overflow, underflow, or divide by zero
Example: adding 4 numbersfl(x1+x2+x3+x4) = {[(x1+x2)*(1+d1) + x3]*(1+d2) + x4}*(1+d3)
= x1*(1+d1)*(1+d2)*(1+d3) + x2*(1+d1)*(1+d2)*(1+d3) +
x3*(1+d2)*(1+d3) + x4*(1+d3)
= x1*(1+e1) + x2*(1+e2) + x3*(1+e3) + x4*(1+e4)
where each |ei| <~ 3*machepsget exact sum of slightly changed summands xi*(1+ei)Backward Error Analysis - algorithm called numerically
stable if it gives the exact result for slightly changed inputsNumerical Stability is an algorithm design goal
38
Backward error Approximate solution is exact solution to
modified problem. How large a modification to original problem is
required to give result actually obtained? How much data error in initial input would be
required to explain all the error in computed results?
Approximate solution is good if it is exact solution to “nearby” problem.
ff(x)
f’(x)
f’
fx’
x
Forward errorBackward error
39
Sensitivity and Conditioning Problem is insensitive or well
conditioned if relative change in input causes commensurate relative change in solution.
Problem is sensitive or ill-conditioned, if relative change in solution can be much larger than that in input data.
Cond = |Relative change in solution| / |Relative change in input data|
= |[f(x’) – f(x)]/f(x)| / |(x’ – x)/x|
Problem is sensitive, or ill-conditioned, if cond >> 1.
When function f is evaluated for approximate input x’ = x+h instead of true input value of x.
Absolute error = f(x + h) – f(x) h f’(x) Relative error =[ f(x + h) – f(x) ] / f(x) h f’(x) / f(x)
40
Sensitivity: 2 Examplescos(π/2) and 2-d System of Equations
Consider problem of computing cosine function for arguments near π/2.
Let x π/2 and let h be small perturbation to x. Then
absolute error = cos(x+h) – cos(x) -h sin(x) -h,
relative error -h tan(x) ∞
So small change in x near π/2 causes large relative change in cos(x) regardless of method used.
cos(1.57079) = 0.63267949 x 10-5
cos(1.57078) = 1.64267949 x 10-5 Relative change in output is a
quarter million times greater than relative change in input.
.
41
Sensitivity: 2 Examplescos(π/2) and 2-d System of Equations
Consider problem of computing cosine function for arguments near π/2.
Let x π/2 and let h be small perturbation to x. Then
absolute error = cos(x+h) – cos(x) -h sin(x) -h,
relative error -h tan(x) ∞
So small change in x near π/2 causes large relative change in cos(x) regardless of method used.
cos(1.57079) = 0.63267949 x 10-5
cos(1.57078) = 1.64267949 x 10-5 Relative change in output is a
quarter million times greater than relative change in input.
.
.
42
Example: Polynomial Evaluation Using Horner’s Rule
Horner’s rule to evaluate p = ck * xk
p = cn, for k=n-1 down to 0, p = x*p + ck
Numerically StableApply to (x-2)9 = x9 - 18*x8 + … - 512 -29 + x*( 28 - x*( 27 + … )))Evaluated around 2
k=0
n
43
Example: polynomial evaluation (continued)
(x-2)9 = x9 - 18*x8 + … - 512We can compute error bounds using
fl(a op b)=(a op b)*(1+d)
44
What happens when the “exact value” is not a real
number, or is too small or too large to represent accurately?
You get an “exception”
45
Exception Handling
What happens when the “exact value” is not a real number, or too small or too large to represent accurately?
5 Exceptions:Overflow - exact result > OV, too large to representUnderflow - exact result nonzero and < UN, too small
to representDivide-by-zero - nonzero/0Invalid - 0/0, sqrt(-1), …Inexact - you made a rounding error (very common!)
Possible responsesStop with error message (unfriendly, not default)Keep computing (default, but how?)
46
Summary of Values Representable in IEEE FP
+- ZeroNormalized nonzero numbersDenormalized numbers+-InfinityNANs
Signaling and quietMany systems have only quiet
+-
+-
+-
+-
+-
0…0 0……………………0
0…0 nonzero
1….1 0……………………0
1….1 nonzero
Not 0 or all 1s
anything