Day01 Hpc Wrkshp Compiler Opt

8/2/2019 Day01 Hpc Wrkshp Compiler Opt

1/61

1CopyrightC-DAC2004 October5-9, 2004

HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

High Performance Computing Workshop

Day 1 : October 5, 2004

Uni-Processor Optimization -Code Restructuring and Loop

Optimization Techniques

UniUni--Processor OptimizationProcessor Optimization --Code Restructuring and LoopCode Restructuring and Loop

Optimization TechniquesOptimization Techniques


2/61



Why do you need to do optimization of a sequential code

Memory hierarchy and how a codes performance depends on it

Optimization Techniques

Loop Optimization Techniques

Collapsing, Fission, Fusion, Unrolling, Interchange, Invariant CodeExtraction, De-factorization, overheads of if-while-goto, NeighborData Dependency

Arithmetic Optimization

Compiler Optimizations Use of tuned Math Libraries

Performance of selective applications and benchmarks

Conclusions

Lecture Outline


3/61



Improving Single Processor Performance

How much sustained performance one can achieve for givenprogram on a machine ?

It is programmers job to take advantage as much as possible ofthe CPUs hardware /software characteristics to boost theperformance of the program !

Quite often, just a few simple changes to ones code improvesperformance by a factor of 2, 3 or better !

Also, simply compiling with some of the optimization flags (-O3, -

fast, .) can improve the performance dramatically !


4/61



Approximate access times CPU-registers: 0 cycles (thats where the work is done!) L1 Cache: 1 cycle (Data and Instruction cache). Repeated

access to a cache takes only 1 cycle

L2 Cache (static RAM): 3-5 cycles? Memory (DRAM): 10 cycles (Cache miss); 30-60 cycles for Translation Lookaside Buffer (TLB) update Disk: about 100,000 cycles! connecting to other nodes - depending on network latency

IcacheDcache

L2 DISK

RAM

CPU

registers

A lot of time is spent accessing/storing data from/to memory. It isimportant to keep in mind the relative times for each memory types:

The Memory sub-system : Access time

Access Time

is Important


5/61



Hierarchical Memory

A four-level memory hierarchy for a large

computer system.

External Cache(SRAMs)

Main Memory

(DRAMs)Disk Storage(Magnetic)

Tap Units

(Magnetic)

M1

M2

M3

M4

Registers, InternalCashes in CPU

Capacity

Level 0

Level 1

Level 2

Level 3

Level 4 I n

c r e

a s e

i n c

a p a c

i t y

a n

d

a c c e s s

t i m e

I n c r e a s e

i n c

o s

t p e r

b i t


6/61



Loop Optimization Techniques

Classical Optimization techniques Compiler Does

Memory Reference Optimization Compiler does to some extent

Loop Optimizations Compiler does to some extent

Loop Fission and Loop Fusion

Loop Interchange Loop Alignment Loop Collapsing Loop Unrolling


7/61



Loop Collapsing

It attempts to create one (larger) loop out of two or more small ones.

This may be profitable if the size of each of the two loops is too smallfor efficient vectorization, but the resulting single loop can beprofitably vectorized.

REAL A(5,5) B(5,5)

DO 10 J =1, 5DO 10 I=1, 5

A(I,J) = B(I,J) + 2.010 CONTINUE

Before REAL A(25) B(25)

DO 10 JI =1, 25A(JI) = B(JI) +2.0

10 CONTINUE

Loop collapsing is done with multi-dimensional arrays to avoid loopoverheads

After


8/61



Using this technique, the code may be transferred into a single loop,regardless of the size of M and N

This may require some additional statement to restart the codeproperly.

DO 10 L = 1, NxM

I = (L-1)/M+1J = MOD(L-1,M) +1)A(I,J) = B(I,J) + 2.0

10 CONTINUE

After

(Contd)Loop Collapsing

DO 10 J =1, NDO 10 I =1, MA(I,J) = B(I,J) + 2.0

10 CONTINUE

General Versions of this technique is useful for computing

systems which support only a single (not nested) DOALLstatement.

Before


9/61



Loop Collapsing (Contd)

Loop collapsing is done with multi-dimensional arrays to avoidloop overheads

Assume declaring a[50][80][4]

Un Collapsed Loop

for(I = 0; I


10/61



Loop Fusion:

It transforms two adjacent loops into one on the basis ofinformation obtained from data-dependencies analysis.

Two statements will be placed into the same loop if

there is atleast one variable or array which is referredby both.

Remark : Loop Fission and Loop Fusion are related techniques to

Strip mining and loop collapsing

Loop Fission and Loop Fusion

Loop Fission:

Attempts to break a single loop into several loops in order to

optimize data transfer (behavior main memory, cache andregisters)

Primary objective of optimization is data transfer.


11/61



It is merging of several loops into a single loop

Example : Untuned Example : Tuned

Loop Fusion

for(i=0; i < 100000; i++)

x = x * a[i] + b[i];

for(i=0; i < 100000; i++)

y = y * a[i] + c[i];

for(i=0; i < 100000; i++) {x = x * a[i] + b[i];

y = y * a[i] + c[i];}

Tuned code runs atleast 10 times faster on Ultra Sparc (both with O3 flag)

(Contd)


12/61



Advantages

The loop overhead is reduced by a factor of two in the abovecase.

Allows for better instruction overlap in loops with dependencies.

Cache misses can be decreased if both loops reference the

same array.

Loop Fusion (Contd)

Disadvantages

Has the potential to increase cache misses if the fused loops

contain references to more than four arrays and the startingelements of those arrays map to the same cache line.

e.g:x = x * a[i] + b[i] * c[i] + d[i] / e[i]


13/61



Loop Optimizations : Basic Loop Unrolling

Loop unrolling is performing multiple loop iterations per pass.

Loop unrolling is one of the most important optimizations that canbe done on a pipelined machine.

Loop unrolling helps performance because it fattens up a loop with

calculations that can be done in parallel

Remark : Never unroll an inner loop.


14/61



Outer and Inner Loop Unrolling

Remark : The loop or loops in the center are called the innerloops and the surrounding loops are called outer loops

Loopnest: Enabled loops within other created loops

for (i=0; i


15/61



Outer and Inner Loop Unrolling

Reasons for applying outer loop unrolling are:

To expose more computations

To improve memory reference patterns

for(I =0; i


16/61



Loop Unrolling and Sum Reduction Loop Unrolling should be used to reduce data dependency. Different

variables can be used to eliminate the data dependency

a=0.0;

for (i=0; i


17/61



Qualifying Candidates for Loop Unrolling

The previous example is an ideal candidate for loop unrolling.

Study categories of loops that are generally not prime candidatesfor unrolling.

Loops with low trip counts

Fat loops

Loops containing branches

Recursive loops

Vector reductions


18/61



Qualifying Candidates for Loop Unrolling To be effective, loop unrolling requires that there be a fairly large

number of iterations in the original loop.

When a trip count in loop is low, the preconditioning loop is doingproportionally large amount of work.

Loop containing procedure calls

Loop containing subroutine or function calls generally are not good

candidates for unrolling. First : They often contain a fair number of instructions already. The

function call can cancel many more instructions.

Second : When the calling routine and the subroutine are compiledseparately, it is impossible for the compiler to intermix instructions.

Last : Function call overhead is expensive. Registers have to besaved, argument lists have to be prepared.The time spent callingand returning from a subroutine can be much greater than that ofthe loop overhead.


19/61



II=IMOD (N,4)

DO 9 I=1, II

CALL SHORT (A(I),B(I),C)9 CONTINUE

DO 10 I=1+II, N,4CALL SHORT(A(I),B(I),C)

CALL SHORT(A(I+1),B(I+1),C)CALL SHORT(A(I+2),B(I+2),C)CALL SHORT(A(I+3),B(I+3),C)

10 CONTINUE

(Contd)

DO 10 I=1, NCALL SHORT(A(I), B(I),C)

10 CONTINUESUBROUTINE SHORT

(A,B,C)A = A+B+C

RETURNEND


Loop containing procedure calls is not suitable forunrolling


20/61



If a particular loop is already fat, then unrolling is not going to helpmuch and loop overhead will spread over a fair number ofinstructions.

A good rule of thumb is to look elsewhere for performance when

the loop inwards exceed three or four statements.

Since code indicates that inlining is feasible.



21/61




Original

Dependency can be reduced by deriving new set of recursive equations

Decreasing the dependencies at the expense of creating more work.

DO 10 I=2, NA(I) = A(I) + A(I-1) x B

10 CONTINUE

Modified

DO 10 I =2, N,2A(I) = A(I+1) + A(I-1) * B + A(I-1) *B*BA(I) = A(I) + A(I-1)*B

10 CONTINUE

This is an example of vector recursion

A Good compiler can make the rolled up version go faster by

recognizing the dependency as opportunity to save memory traffic.

A(I) = A(I)+A(I-1)*BA(I+1) = A(I+1)+A(I)*BA(I+2) = A(I+2)+A(I+1)*BA(I+3) = A(I+3)+A(I+2)*B

Recursive Loops (Contd..)


22/61



Negatives of Loop Unrolling

Loop unrolling always adds some run time to the program.

If you unroll a loop and see the performance dip little, you canassume that either:

The loop wasnt a good candidate for unrolling in the first placeor

A secondary effort absorbed your performance increase.

Other possible reasons

Unrolling by the wrong factor

Register spitting

Instruction cache miss

Other hardware delays

Outer loop unrolling


23/61



Loop Interchange

Loop interchange is a technique for rearranging a loop nest so thatthe right stuff at the center. What is the right stuff depends upon

what you are trying to accomplish.

Loop interchange to move computations to the center of the loopnest.

It is also good for improving memory access patterns.

Iterations on the wrong subscript can cause a large stride and hurt

your performance.

Inverting the loops, so that the iterating variables causing the lesserstrides are in the center, you can get performance win.


24/61



PARAMETER(IDIM=1000,JDIM=10

00, KDIM = 4)DO 10 K=1, KDIM

DO 20 J=1, JDIMDO 30 I=1, IDIM

D(I,J,K)=D(I,J,K)+ V(I,J,K)*DT30 CONTINUE20 CONTINUE10 CONTINUE

Loop Interchange

PARAMETER(IDIM=1000,JDIM=1

000, KDIM=4)DO 10 I =1, IDIMDO 20 J =1, JDIM

DO 30 K =1, KDIMD(I,J,K)=D(I,J,K)+

V(I,J,K)*DT30 CONTINUE20 CONTINUE10 CONTINUE

Loop interchange to move computations to the center

Frequently, the interchange of nested loops permits a significantincrease in the amount of parallelism

Example is straight forward: it is easy to see that there are no interiteration dependencies.

(Contd)


25/61



float a[2][40][2000]

for(i=0; i


26/61



Statements that do not change within an inner loop can be moved

outside of the loop. (Compiler optimizations can usually detect these).

for(i=0 ; i


27/61



Loop De-factorization consists of removing commonmultiplicative factors outside of inner loops

for(i=0; i


28/61



Untuned Loops (IFs and GOTOs): Turned Loop : I=0 I = 0

10 I = I + I 10 I = I + 1IF(I.GT.100000)GOTO 30 A(I)=A(I)+B(I)*C(I)A(I)=A(I) + B(I)*C(I) IF(I.LE.100000)GOTO 10

GOTO 1030 CONTINUE

Another Untuned Loop (WHILE Loop) : Turned Loop:

I = 0 DO I = 1, 100000DO WHILE (I .LT. 100000) A(I) = A(I)+B(I)*C(I)I = I + 1 END DO

A(I) = A(I)+B(I)*C(I)ENDDO

Avoid IF/GOTO loops and WHILE loops. They inhibit compileroptimizations and they introduce unnecessary overheads.

Loop Optimization: IF, WHILE, and DO Loops


29/61



Example: data wrap around, untuned versionjwrap = ARRAY_SIZE 1;for(i=0; i


30/61



DO 10, JB = 1, N, NB

DO 10, IB = 1, N, NBDO 10, KB = 1, N, NBDO 10, J = JB, JB + NB 1

DO 10, I = IB, IB + NB 1DO 10, K = KB, KB + NB 1

C (I, J) = C (I, J) + A (I, K) * B(K,J)10 CONTINUEThis is most useful as a simple example of cache blocking. Mostcompilers will automatically cache block the original code as part of

ordinary optimization.

Programming Techniques Managing the Cache

DO 10, J = 1, NDO 10, I = 1, N

DO 10, K = 1, NC(I, J) = C (I, J) + A (I, K) * B (K, J)

10 CONTINUE

We can modify the previous code to better use the cache.

Original code

Modified code


31/61



Loop optimizations accomplish three things :

Reduce loop overhead

Increase Parallelism

Improve memory performance patterns

Understanding your tools and how they work is critical for using themwith peak effectiveness. For performance, a compiler is your bestfriend.

Loop Optimizations: Advantages


32/61



Replace frequent divisions by inverse multiplications Multiplications/divisions by integer powers of 2 can be replaced by

bit shifts to the left/right (compilers can usually do this) Small integer exponentials such as an should be replaced by

repeated multiplications a*a*a*a.(compilers will usually do this) Reorganize (or eliminate) repeated (or useless) operation: Use Horners rule to evaluate polynomials.

Recap of Arithmetic Optimization

Example :

Ax5 + Bx4 + Cx3 + Dx2 + E x + F can be written as

((((Ax + B)* x + C)*x+D)* x + E)* x+F

This saves more time in C (speed increases by factor greater than10) than in Fortran (improvement of only about 30%) due to the wayC language handles (poorly ) the function pow(x,5).


33/61



Compiler Optimizations

Compiler optimization From Wikipedia, the free encyclopedia

Compiler optimization is used to improve the efficiency (in terms of

running time or resource usage) of the executables output by a

compiler.

Allow programmers to write source code in a straightforwardmanner, expressing their intentions clearly, while allowing thecomputer to make choices about implementation details that leadto efficient execution.

May or may not result in executables that are perfectly "optimal" byany measure

Ref: http://en.wikipedia.org/wiki/Compiler_optimization


34/61



Sun Workshop Compiler 6.2

- O : Set optimization level

- fast : Select a set of flags likely to improve speed

- stackvar : put local variables on stack

- xlibmopt : link optimized libraries

- xarch : Specify instruction set architecture

- xchip : Specifies the target processor for use by theoptimizer.

- native : Compile for best performance on localhost.

- xprofile : Collects data for a profile or uses a profile to

optimize.

- fns : Turns on the SPARC nonstandard floating-point

mode.- xunroll n : Unroll loops n times.


35/61



-O Optimize at the level most likely to give close to the maximumperformance for many realistic applications (currently -O3)

-O1 Do only the basic local optimizations (peephole).-O2 Do basic local and global optimization. This level usually gives

minimum code size.

-O3 Adds global optimizations at the function level. In general, this level,and -O4, usually result in the minimum code size when used with the

-xspace option.-O4 Adds automatic inlining of functions in the same file. -g suppresses

automatic inlining.

-O5 Does the highest level of optimization, suitable only for the smallfraction of a program that uses the largest fraction of computer time.

Uses optimization algorithms that take more compilation time or thatdo not have as high a certainty of improving execution time.Optimization at this level is more likely to improve performance if it isdone with profile feedback. See -xprofile=collect|use.

Basic Compiler Techniques : Optimizations


36/61



- stackvar

Tells the compiler to put most variables on the stack rather than

statically allocate them. - stackvar is almost always a good idea, and it is crucial when

parallelization.

You can control stack versus static allocation for each variable.

Variables that appear in DATA, COMMON, SAVE, orEQUIVALENCE statements will be static regardless of whetheryou specify -stackvar.

Basic Compiler Techniques : Local variables on the Stack


37/61



Basic Compiler Techniques

-xchip

Specifies the target chip. Specifying the chip lets the compiler know

that certain implementation details such as specific instructionstimings, number of functional units etc.

-xarch

Specifies the target architecture. A target architecture includes theinstruction set but may not include implementation details such asinstruction timing.

-xarch

= v8plus on Sun produces an executable file that will take full

advantage of some UltaSPARC features.

-native

Directs the compiler to produce the best executable (performance)that it can for the system on which the program is being compiled.


38/61




- fast

Run program with a reasonable level of optimization may changeits meaning on different machines.

It strikes balance between speed, portability, and safety.

-fast is often a good way to et a first-cut approximation of how

fast your program can run with a reasonable level of optimization

-fast should not be used to build the production code.

The meaning of fast will often change from one release to

another

As with native, -fast may change its meaning on

different machines


39/61




- fsimple: (simple floating point model)

Tells the compiler to use a floating point system that includes onlynumbers.

- xvector : Vectorization enables the compiler to transform vectorizable loops

from scalar to vector form. It is generally faster and slower forshort vectors

- xlibmil: Tells the compiler to inline certain mathematical operations such

as floor, ceiling, and complex absolute value

- xlibmopt:

Tells the linker to use an optimized math library. This mayproduce slightly different answer than the regular math library

These libraries may get their speed by sacrificing accuracy


40/61



Advanced Compiler Techniques

- xcrossfile

Enables the compiler to optimize and inline source code across

different files. It may compile code to be optimal for the files that are complied

together Produces very fast executable

- xpad

Directs the compiler to insert padding (unused space) betweenadjacent variables in common blocks and local variables to try toimprove cache performance.

C


41/61



Using Your Compiler Effectively - Classical Optimizations

The compiler performs the classical optimizations, plus number ofarchitecture specific optimizations.

Copy propagation

Constant Folding

Dead Code removal

Strength reduction

Induction Variable Elimination

Common Sub-expression Elimination

HPC W k h 2004HPC W k h 2004O ti i ti T h i


42/61



The compiler performs the classical optimizations, plus number of

architecture specific optimizations. Loop in-variant code motion.

Induction variable simplification

Register variable detection

Inlining

Loop Fusion

Loop Unrollling

Classical Optimizations

HPC W k h 2004HPC Workshop 2004O ti i ti T h i


43/61



Copy propagation

Copy propagation is an optimization that occurs both locally andglobally.

x=yz=1.0+x

Compiler may be able to perform copy propagation a cross the flowgraph.

x=yz=1.0+y

PROGRAM MAIN

INTEGER I, K

PARAMETER (I=200)

K=200

J=I+K

END

Constant Folding

A clever compiler can find constantsthroughout your program.




44/61



Dead Code Removal

Dead code comes in two types.

Instructions that are unreachable.

Instructions that produce results whichone never used.

Program main

i=2

write (x,x)istop

i=4

write (x,x)i

endStrength Reduction

Operations or expressions havevarious time costs associated withthem.

There are many opportunities forcompiler generated strengthreductions.

Y=X*2

J=Kx2

Y=X*X

J=K+K




45/61



Variable Renaming

Example: Observe variable in the following fragment of code.

x = y x zq = r+x+x

x = a+b

Variable renaming is an important technique because it clarifies that

calculations are independent of each other, which increases thenumber of things that can be done in parallel.

Common sub expression Elimination

D=Cx(A+B)E=(A+B/2)

Different computer go to different lengths to find common sub expression

xx = y x zq = r+xx+xx

x = a+b

Temp=A+BD=C X temp

E=temp p/2




46/61



Loop invariant code Motion:

The compiler will look for every opportunity to move calculations out of

a loop and into the surrounding.

Loop invariant code motion is simply the act of moving the repeated,unchanging calculations to the outside.Induction Variable Simplification:

Loop can contain what are called induction variables.

DO 10 I=1,N

A(I)=B(I)+CxD

E=G(K)

10 CONTINUE

DO 10 I=1,N

K=I*4+M

10 CONTINUE

temp=CxDDO 10 I=1,N

A(I)=B(I)+temp

10 CONTINUE

E=G(K)

K=M

DO 10 I=1,N

K=K+4

10 CONTINUE

Classical Optimizations(Contd..)



47/61



SUM=0.0DO 10 I=1, N

SUM=SUM+A(I)xB(I)

10 CONTINUE

Example: Dot product of two vectors

SUM=0.0

DO 10 I=1,N,4

SUM = SUM+A(I)xB(I)+A(I+1)*B(I+1) +

A(I+2)*B(I+2)+A(I+2)*B(I+3)

10 CONTINUE

The loop is recursive on that singlevariable, every iteration needs theresult of the previous iteration.

The assignment is being made to a scalar, unrolling isnt as straightforward as before. Obvious way is to calculate several iteration at a

time.

Associative Transformations and Reductions




48/61



Dependency analysis is a technique where by the syntacticconstructs of a program are analyzed with the aim of determining

whether certain values may depend on other previously computedvalues.

The real objective of dependence analysis is to determine whethertwo statements are independent of each other

Example: S1 A=C-AS2 A=B+C

S3 B=A+C

DO ALL transformations: This transformation converts everyiteration of a loop into process that is independent of all others

It assumes that there are no loop-carried dependencies.

The DO ALL transformation is very efficient if it can be applied.However, many loops carry dependencies.




49/61


C o s op 00pOpt sat o ec ques

Register Variable Detection

On many CISC processors there were few general purposeregisters.

On RISC designs, there are many more registers to choose from,and everything has to be brought into a register anyway.

All variables will be registers resident.

The new challenge is determine which variables should live the

greater portion of their lives in registers.

The compiler performs the classical optimizations, plus a number ofarchitecture-specific optimizations.




50/61


ppp q

Inlining

Inlining is the substitution of the body of a subprogram for the call of

that subprogram. This eliminates function call overhead. To enable inlining by the Sun compilers, use fast or xO4

f77 fast a.f Loop Fusion :

Loop fusion is the process of fusing two adjacent loops with thesame loop bounds, which is usually a Good Thing

Induction Values:

Induction values that can be computed as a function of the loop

count variable and possibly other values.




51/61


pp q

Parallel programming-Compilation switchesAutomatic and directives based parallelization

Allow compiler to do automatic and directive based parallelization

-x autopar, -x explicitpar, -x parallel, -tell the compilerto parallelize your program.

xautopar: tells the compiler to do only those parallelization that itcan do automatically

xexplicitpar: tells the compiler to do only those parallelizationthat you have directed it to do with programs in the source

xparallel: tells the compiler to parallelize both automaticallyand under pragma control

xreduction: tells the compiler that it may parallelize reductionloops. A reduction loop is a loop that produces output with smallerdimension than the input.



52/61


Parallel Programming Compiler switches

Remarks

In some cases, parallelizing a reduction loop can give differentanswers depending on the number of processors on which the loop isrun.

Compiler directives can usually over come artificial barriers to

parallelization.

Compiler directives can also overcome legitimate barriers toparallelization, which introduces errors.

The efficiency and effectiveness of automatic compiler parallelizationcan be significantly improved by supplying the switches.



53/61


BLAS, IMSL, NAG, LINPACK, ScaLAPACK LAPACK, etc.

Calls to these math libraries can often simplify coding.

They are portable across different platform

They are usually fine-tuned to the specific hardware as well as tothe sizes of the array variables that are sent to them

Example : Sun performance libraries (-xlic_lib=sunperf), IBM ESSL,ESSLSMP

Use of MATH LIBRARIES



54/61


Optimization of unsteady state 3D Compressible Navier-Stokesequations by finite difference method

Computing System used : Sun Ultra Sparc workstation (Each nodeis quad CPU Ultra Enterprise 450 server, operating at 300Mhz)

Grid Size Iterations Time in seconds

192*16*16 1000 (No compiler options) 4930

192*16*16

1000 (Code restructuring and

compiler optimization)

2620

192*16*16 680

1000 (with compiler optimization)

Conclusions : Re-structuring the code and use of proper compileroptimizations reduces the execution time by a factor of 8.0

Performance of selective application - CFDPerformance of selective application - CFD



55/61


4-way SMP

POWER 4 1.0 Ghz

8 GB Main memory (16 GB Max)

AIX 5.1 and PPC Linux

XL F77, F90, C, C++

Performance Libraries: BLAS 1,2,3 BLACS, ESSL

32-way SMP

POWER 4 1.1 Ghz

64 GB Main memory (256 GB Max) AIX 5.1 and PPC Linux

XL F77, F90, C, C++

Performance Libraries: BLAS 1,2,3 BLACS, ESSL



56/61


LLCBench: Performance on IBM p630



57/61


LLCBench: Performance on IBM p690


58/61



59/61


Reducing Memory Overheads is important for performance ofsequential and parallel programs

Minimization of memory traffic is the single most important goal.

For multiple dimensional arrays, access will be fastest if you iterateon the array subscript offering the smallest stride or step size.

Role of Data Reuse on Memory sub-system will increase the

performance

Basic Compiler and Advanced Compiler Optimization flags can beused for performance

Write code so that a compiler find it easy to locate optimizations

Compiler performs Classical Optimization Techniques and someloop optimization techniques

Conclusions



60/61


1. Ernst L. Leiss, Parallel and Vector Computing A practical Introduction, McGraw-HillSeries on Computer Engineering, Newyork (1995).

2. Albert Y.H. Zomaya, Parallel and distributed Computing Handbook, McGraw-HillSeries on Computing Engineering, Newyork (1996).

3. Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis, Introduction to ParallelComputing, Design and Analysis of Algorithms, Redwood City, CA,Benjmann/Cummings (1994).

4. William Gropp, Rusty Lusk, Tuning MPI Applications for Peak Performance, Pittsburgh

(1996)5. Ian T. Foster, Designing and Building Parallel Programs, Concepts and tools for Parallel

Software Engineering, Addison-Wesley Publishing Company (1995).

6. Kai Hwang, Zhiwei Xu, Scalable Parallel Computing (Technology Architecture

Programming) McGraw Hill Newyork (1997)

7. Culler David E, Jaswinder Pal Singh with Anoop Gupta, Parallel Computer Architecture,A Hardware/Software Approach, Morgan Kaufmann Publishers, Inc, (1999)

References



61/61


Documents

Day01 Hpc Wrkshp Compiler Opt