Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce...

Preview:

Citation preview

Vector Processing

Vector Processors• Combine vector operands (inputs) element by element to produce

an output vector. Typical array-oriented operations are:

– processing one or more vectors to produce a scalar result,

– combining two vectors to produce a third vector,

– combining a scalar and a vector to produce a vector, and

– a combination of the above.

Vector Processor Models

• When dealing with scalar operations such as was shown on the previous slide little can be gained in a vector processor, but vector or non-scalar operations can take advantage of vector processors.

Ci : = Ai + Bi 1 ≤ i ≤ N

On a SISD system we would code this as

for(i=1; i <=N;i++)

C[i]=A[i]+B[i];

Vector Processor Models

On a SISD system we would code this as

for(i=1; i <=N;i++)

C[i]=A[i]+B[i];

• Assuming two machine instructions for loop control and four machines instructions to implement the assignment statement (Read A, Read B, Add, Write C) the execution time is;

(6 x N x T)

where T is average instruction cycle time.

Vector Processor Models

• If memory could be accessed directly without requiring loop control there could be one instruction (add).

• The figure shows a fours stage add pipeline resulting in one add per cycle.

Vector Processor Model

• The pipeline execution time is (4 + N – 1)T• Therefore the speedup is;

TN

NTS

)14(

6

Vector Processor Models

• We can generalize the previous vector model as follows;

Vector Processor Models

• Further Improvements can be made;

Vector Processors

• Vector processors are supercomputers optimized for fast execution (main criteria for design and implementation) of vectorizable scientific code that operates on large data sets.

• Vector processors are extensively pipelined to operate on array-oriented data. The CPU is highly pipelined and with a large set of registers. Memory is also pipelined and interleaved to match CPU demands.

Memory Design• Note that if the pipe provides a result every d cycles (i.e.,

w = 1/d), then memory must supply a pair of operands (ai and bi) every d cycles.

• Note that we need to fetch (read) two values and write a result simultaneously (within d cycles).

Memory Design

• If d = 1, then the memory system must have at least a bandwidth 3 times that of a conventional memory. To meet memory bandwidth requirements, two approaches have been implemented in commercial machines:

1. Use of multiple independent memory modules.

2. Use of intermediate high speed memory to:– shorten the access cycle.

– use data several times between cpu and intermediate memory.

– provide for certain desirable patterns of data access (i.e., rows, columns, diagonals, etc.).

Memory Design• Multiple memory modules - 3-port memory modules used with a

pipeline arithmetic.

• Only one port per module is active at one time but all 3 streams can be active simultaneously.

Memory Design• Care must be taken when laying out data in memory

modules other wise simultaneous access is denied as is seen here.

Memory Design• The following RT shows the effect of 2-cycle memory

access timing. Note the output conflict and resultant delays.

NoteConflict!

Memory Design

Performance Evaluation

• Major characteristics affecting supercomputer performance– Clock speed– Instruction issue rate– Memory Size– Number of concurrent paths to memory– Ability to fetch/store vectors efficiently– Number of duplicate arithmetic functional units– Chaining– Indirect addressing capabilities– Handling conditional blocks of code

Performance Evaluation

• High performance of vector architectures can be attributed to the following characteristics:

1. Pipelined functional units2. Multiple functional units operating in parallel3. Chaining of functional units4. Large number of programmable registers5. Block load/store capabilities with buffer registers6. Multiprocessors operating in parallel in a coarse-grained parallel mode7. Instructions buffers

Performance Evaluation

• Sustained computation rates (as opposed to peak computation rates obtained under ideal circumstances) depend on factor such as:

1. Level of vectorization (fraction of the code that is vectorizable)2. Average vector length3. Possibility of vector chaining4. Possible overlap of scalar, vector, and memory load/store operations5. Mechanisms to resolve memory contention

Performance Evaluation

• What is Amdhal’s Law?

Performance Evaluation

• Amdhal’s Law

• Given that the fraction of serial work in a given problem is small, say s, the maximum speedup obtainable from even an infinite number of parallel processors is only 1/s.

Performance Evaluation• Ideally speedup is

• Ideally parallel execution time is

• Speedup is then ideally P

)(E TimeExecution Parallel

)(E TimeExecution Serial

p

sS

(P) Processors ofNumber

)(E TimeExecution Serial spE

Performance Evaluation• Amdhal’s Law changes this speedup analysis to include the

serial component that cannot be parallelized.

Performance Evaluation

• Let P denote an application program, Tscalar the time to execute P in scalar mode (serial execution)

• s is the maximum speedup

• Ideally the time to execute P on the vector computer is Tscalar/s

Performance Evaluation

• The problem Amdhal pointed out is that there is always some fraction of P, (f) that can be executed in parallel and some fraction that cannot (1-f)

• Therefore the actual parallel execution time is Tactual=(1-f)Tscalar+f ·Tscalar/s

Performance Evaluation

• The speedup now becomes

sff

S

sTfTf

TS

T

TS

scalarscalar

scalar

actual

scalar

)1(

1

)1(

Performance Evaluation

• So if f = 1 speedup is s, the ideal speedup, and for f = 0 speedup is 1.

sff

S

)1(

1

Performance Evaluation

• For number of processors = 10

Performance Evaluation

• Time to execute loops can be used to estimate peak and sustained performance. Let

Performance Evaluation

• Then;

Programming Vector Processors

• the hardware structure that makes vector processors powerful also makes the assembler code difficult.

Programming Vector Processors

• Programming tools:– Languages: to express parallelism inherent in the algorithm

– Compilers: to recognize vectorizable code

– Combination of the above optimizes parallelism

Programming Vector Processors

• Vector pipelining is obviously one benefit that is exploited when executing a program.

Programming Vector Processors

• Chaining is another important characteristic of some vector processors.

• Chaining is the ability to activate additional independent functional units as soon as intermediate results are known.

Chaining

Chaining

• Consider the following

Chaining

Simultaneous

Scalar Renaming• How might this be improved?

Scalar Renaming• This becomes this

• This renaming makes the code segments independent allowing for better vectorization

Scalar Expansion

• How might this be improved?

Scalar Expansion

• If scalar x is expanded into a vector the two statements become independent

Loop Unrolling

• The loop becomes this

• What about this?Loop fusion

• Note that each loop would be equivalent to a vector instruction. X is stored back into memory by the first instruction and then retrieved by the second. If these loops are fused as follows, then memory traffic is reduced:

• What else might be done to improve this?

Loop fusion

Loop fusion

• Note that this is possible if there are enough registers available to retain X. If chaining is supported then the loop can be reduced to:

Recommended