Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce...

Vector Processing

Vector Processors• Combine vector operands (inputs) element by element to produce

an output vector. Typical array-oriented operations are:

– processing one or more vectors to produce a scalar result,

– combining two vectors to produce a third vector,

– combining a scalar and a vector to produce a vector, and

– a combination of the above.

Vector Processor Models

• When dealing with scalar operations such as was shown on the previous slide little can be gained in a vector processor, but vector or non-scalar operations can take advantage of vector processors.

Ci : = Ai + Bi 1 ≤ i ≤ N

On a SISD system we would code this as

for(i=1; i <=N;i++)

C[i]=A[i]+B[i];

On a SISD system we would code this as

for(i=1; i <=N;i++)

C[i]=A[i]+B[i];

• Assuming two machine instructions for loop control and four machines instructions to implement the assignment statement (Read A, Read B, Add, Write C) the execution time is;

(6 x N x T)

where T is average instruction cycle time.

• If memory could be accessed directly without requiring loop control there could be one instruction (add).

• The figure shows a fours stage add pipeline resulting in one add per cycle.

Vector Processor Model

• The pipeline execution time is (4 + N – 1)T• Therefore the speedup is;

• We can generalize the previous vector model as follows;

• Further Improvements can be made;

Vector Processors

• Vector processors are supercomputers optimized for fast execution (main criteria for design and implementation) of vectorizable scientific code that operates on large data sets.

• Vector processors are extensively pipelined to operate on array-oriented data. The CPU is highly pipelined and with a large set of registers. Memory is also pipelined and interleaved to match CPU demands.

Memory Design• Note that if the pipe provides a result every d cycles (i.e.,

w = 1/d), then memory must supply a pair of operands (ai and bi) every d cycles.

• Note that we need to fetch (read) two values and write a result simultaneously (within d cycles).

Memory Design

• If d = 1, then the memory system must have at least a bandwidth 3 times that of a conventional memory. To meet memory bandwidth requirements, two approaches have been implemented in commercial machines:

1. Use of multiple independent memory modules.

2. Use of intermediate high speed memory to:– shorten the access cycle.

– use data several times between cpu and intermediate memory.

– provide for certain desirable patterns of data access (i.e., rows, columns, diagonals, etc.).

Memory Design• Multiple memory modules - 3-port memory modules used with a

pipeline arithmetic.

• Only one port per module is active at one time but all 3 streams can be active simultaneously.

Memory Design• Care must be taken when laying out data in memory

modules other wise simultaneous access is denied as is seen here.

Memory Design• The following RT shows the effect of 2-cycle memory

access timing. Note the output conflict and resultant delays.

NoteConflict!

Memory Design

Performance Evaluation

• Major characteristics affecting supercomputer performance– Clock speed– Instruction issue rate– Memory Size– Number of concurrent paths to memory– Ability to fetch/store vectors efficiently– Number of duplicate arithmetic functional units– Chaining– Indirect addressing capabilities– Handling conditional blocks of code

• High performance of vector architectures can be attributed to the following characteristics:

1. Pipelined functional units2. Multiple functional units operating in parallel3. Chaining of functional units4. Large number of programmable registers5. Block load/store capabilities with buffer registers6. Multiprocessors operating in parallel in a coarse-grained parallel mode7. Instructions buffers

• Sustained computation rates (as opposed to peak computation rates obtained under ideal circumstances) depend on factor such as:

1. Level of vectorization (fraction of the code that is vectorizable)2. Average vector length3. Possibility of vector chaining4. Possible overlap of scalar, vector, and memory load/store operations5. Mechanisms to resolve memory contention

• What is Amdhal’s Law?

• Amdhal’s Law

• Given that the fraction of serial work in a given problem is small, say s, the maximum speedup obtainable from even an infinite number of parallel processors is only 1/s.

Performance Evaluation• Ideally speedup is

• Ideally parallel execution time is

• Speedup is then ideally P

)(E TimeExecution Parallel

)(E TimeExecution Serial

(P) Processors ofNumber

)(E TimeExecution Serial spE

Performance Evaluation• Amdhal’s Law changes this speedup analysis to include the

serial component that cannot be parallelized.

• Let P denote an application program, Tscalar the time to execute P in scalar mode (serial execution)

• s is the maximum speedup

• Ideally the time to execute P on the vector computer is Tscalar/s

• The problem Amdhal pointed out is that there is always some fraction of P, (f) that can be executed in parallel and some fraction that cannot (1-f)

• Therefore the actual parallel execution time is Tactual=(1-f)Tscalar+f ·Tscalar/s

• The speedup now becomes

scalarscalar

scalar

actual

scalar

• So if f = 1 speedup is s, the ideal speedup, and for f = 0 speedup is 1.

• For number of processors = 10

• Time to execute loops can be used to estimate peak and sustained performance. Let

• Then;

Programming Vector Processors

• the hardware structure that makes vector processors powerful also makes the assembler code difficult.

• Programming tools:– Languages: to express parallelism inherent in the algorithm

– Compilers: to recognize vectorizable code

– Combination of the above optimizes parallelism

• Vector pipelining is obviously one benefit that is exploited when executing a program.

• Chaining is another important characteristic of some vector processors.

• Chaining is the ability to activate additional independent functional units as soon as intermediate results are known.

Chaining

• Consider the following

Chaining

Simultaneous

Scalar Renaming• How might this be improved?

Scalar Renaming• This becomes this

• This renaming makes the code segments independent allowing for better vectorization

Scalar Expansion

• How might this be improved?

Scalar Expansion

• If scalar x is expanded into a vector the two statements become independent

Loop Unrolling

• The loop becomes this

• What about this?Loop fusion

• Note that each loop would be equivalent to a vector instruction. X is stored back into memory by the first instruction and then retrieved by the second. If these loops are fused as follows, then memory traffic is reduced:

• What else might be done to improve this?

Loop fusion

• Note that this is possible if there are enough registers available to retain X. If chaining is supported then the loop can be reduced to:

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce...

Documents

Creating a Context for Learning: Activating Children’s Whole … · subtraction, multiplication, and division problems with whole number operands and fraction operands. In one condition,

Unit-2 (Operators) - WordPress.com · Operands & Operators • Any operation is a combination of two things: 1. Operands 2.Operator • Operators works on operands. • Operands may

Operators & Overloading Joe Meehean. Expressions Expression composed of operands combined with operators e.g., a + b Operands variables and literals in

5 Expressions - University of California, San Diegocseweb.ucsd.edu/~hepeng/cse143-w08/labs/VHDLReference/05.pdf · Name Computation (function) Number of operands Type of operands

C Operators, Operands, Expressions & Statements

Finite-Element Electrical Machine Simulation · Finite-Element Electrical Machine ... • Electromagnetic field theory - vector algebra + grad/div/curl ... „Electromagnetic Modeling

04-68283A Manual Digsy Chapter N Table of Operands

Discrete Time Vector Finite Element Methods for Solving

Operator STAT NON LINE - Code Aster · 3 Operands 1)

1 A Primer on the 2D Vector Finite Element Method

A Trace Finite Element Method for Vector-Laplacians on

Complex Division with Prescaling of Operands · 2017-01-27 · Complex Division with Prescaling of Operands Miloı D. Ercegovac , Jean-Michel Mullery ThŁme 2 Š GØnie logiciel et

Delft University of Technology The Vector Form Intrinsic …pure.tudelft.nl/ws/files/20214660/IASS2015_Q.Li_2.pdfThe Vector Form Intrinsic Finite Element method and several other form-finding

Developing Applications for iOS · operands (in addition to still accepting doubles as operands). 1. Go to the MainStoryboard.storyboard tab in Xcode. 2. Open the Utilities area

Access controls for Operands

Nonlinear analysis of RC shear walls by vector form ... · Nonlinear analysis of RC shear walls by vector form intrinsic finite element method *Hongmei Zhang1), Song Liu1), Yuanfeng

Artificial Intelligence · This kind of inference, sanctioned by domain knowledge, is crucial P(Cavity | Toothache) = 2-element vector of 2-element vectors. If we know more, e.g.,

Addressing ModesAddressing Modes •Various ways of specifying the operands or various formats for specifying the operands is called addressing mode •8-bit or 16-bit data may be

Operands and Addressing Modes - Computer Science · Comp 411 L4 –Addressing Modes 1 Operands and Addressing Modes •Where is the data? •Addresses as data •Names and Values

1 A Primer on the 2D Vector Finite Element Methoduday/notes/fem2dprimer.pdf · A Primer on the 2D Vector Finite Element Method ... It is important to note that this condition only