Upload
samson-barker
View
215
Download
2
Tags:
Embed Size (px)
Citation preview
SIMD ComputersSIMD Computers
ECE/CS 757 Spring 2007
J. E. Smith
Copyright (C) 2007 by James E. Smith (unless noted otherwise)
All rights reserved. Except for use in ECE/CS 757, no part of these notes may be reproduced, stored in a retrieval system, or transmitted,in any form or by any means, electronic, mechanical, photocopying,recording, or otherwise, without prior written permission from the author.
04/07 ECE/CS 757; copyright J. E. Smith, 2007 2
OutlineOutline
Automatic Parallelization Vector Architectures
• Cray-1 case study
Data Parallel Programming• CM-2 case study
CUDA Overview (separate slides) Readings
• W. Daniel Hillis and Guy L. Steele, Data Parallel Algorithms, Communications of the ACM, December 1986, pp. 1170-1183.
• S. Ryoo, et al., Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA, Proceedings of PPoPP, Feb. 2008.
04/07 ECE/CS 757; copyright J. E. Smith, 2007 3
Automatic ParallelizationAutomatic Parallelization
Start with sequential programming model Let the compiler attempt to find parallelism
• It can be done…• We will look at one of the success stories
Commonly used for SIMD computing – vectorization • Useful for MIMD systems, also -- concurrentization
Often done with FORTRAN• But, some success can be achieved with C
(Compiler address disambiguation is more difficult with C)
04/07 ECE/CS 757; copyright J. E. Smith, 2007 4
Automatic ParallelizationAutomatic Parallelization
Consider operations on arrays of datado I=1,N
A(I,J) = B(I,J) + C(I,J)end do• Operations along one dimension involve vectors
Loop level parallelism• Do all – all loop iterations are independent
Completely parallel• Do across – some dependence across loop iterations
Partly parallel
A(I,J) = A(I-1,J) + C(I,J) * B(I,J)
04/07 ECE/CS 757; copyright J. E. Smith, 2007 5
Data DependenceData Dependence
Independence ParallelismOR, dependence inhibits parallelism
S1: A=B+C S2: D=A+2S3: A=E+F
True Dependence (RAW): S1 S2
Antidependence (WAR):S2 - S3
Output Dependence (WAW): S1 o S3
04/07 ECE/CS 757; copyright J. E. Smith, 2007 6
Data Dependence Applied to LoopsData Dependence Applied to Loops
Similar relationships for loops• But consider iterations
do I=1,2S1: A(I)=B(I)+C(I) S2: D(I)=A(I)
end do
S1 = S2• Dependence involving A, but on same loop iteration
04/07 ECE/CS 757; copyright J. E. Smith, 2007 7
Data Dependence Applied to LoopsData Dependence Applied to Loops
S1 < S2
do I=1,2S1: A(I)=B(I)+C(I) S2: D(I)=A(I-1)
end do• Dependence involving A, but read occurs on next loop iteration• Loop carried dependence
S2 -< S1
• Antidependence involving A, write occurs on next loop iteration
do I=1,2S1: A(I)=B(I)+C(I) S2: D(I)=A(I+1)
end do
04/07 ECE/CS 757; copyright J. E. Smith, 2007 8
Loop Carried DependenceLoop Carried Dependence Definition
do I = 1, NS1: X(f(i)) = F(...)S2: A = X(g(i)) ...
end do
S1 S2 : is loop-carried
if there exist i1, i2 where
1 i1 < i2 N and f(i1) = g(i2 )
If f and g can be arbitrary functions, the problem is essentially unsolvable.
However, if (for example)
f(i) = c*I + j and g(i) = d*I + k
there are methods for detecting dependence.
04/07 ECE/CS 757; copyright J. E. Smith, 2007 9
Loop Carried DependencesLoop Carried Dependences GCD test
do I = 1, NS1: X(c*I + j ) = F(...)S2: A = X(d*I + k) ...
end do
f(x) = g(y) if c*I + j = d*I + k
This has a solution iff gcd(c, d ) | k- j Example
A(2*I) = = A(2*I +1)GCD(2,2) does not divide 1 - 0
The GCD test is of limited use because it is very conservativeoften gcd(c,d) = 1
X(4i+1) = F(X(5i+2)) Other, more complex tests have been developed
e.g. Banerjee's Inequality
04/07 ECE/CS 757; copyright J. E. Smith, 2007 10
Vector Code GenerationVector Code Generation
In a vector architecture, a vector instruction performs identical operations on vectors of data
Generally, the vector operations are independent
• A common exception is reductions In general, to vectorize:
• There should be no cycles in the dependence graph
• Dependence flows should be downward
some rearranging of code may be needed.
04/07 ECE/CS 757; copyright J. E. Smith, 2007 11
Vector Code Generation: ExampleVector Code Generation: Example
do I = 1, NS1: A(I) = B(I)S2: C(I) = A(I) + B(I)S3: E(I) = C(I+1)
end do
Construct dependence graphS1:
S2:
-
S3:
Vectorizes (after re-ordering S2: and S3: due to antidependence)S1: A(I:N) = B(I:N)S3: E(I:N) = C(2:N+1)S2: C(I:N) = A(I:N) + B(I:N)
04/07 ECE/CS 757; copyright J. E. Smith, 2007 12
Multiple Processors (Concurrentization)Multiple Processors (Concurrentization)
Often used on outer loops Example
do I = 1, Ndo J = 2, N
S1: A(I,J) = B(I,J) + C(I,J)S2: C(I,J) = D(I,J)/2S3: E(I,J) = A(I,J-1)**2 + E(I,J-1)
end doend do
Data Dependences & Directions
S1 =, < S3
S1 =, = S2
S3 =, < S3 Observations
• All dependence directions for I loop are = Iterations of the I loop can be scheduled in parallel
04/07 ECE/CS 757; copyright J. E. Smith, 2007 13
SchedulingScheduling
Data Parallel Programming Model• SPMD (single program, multiple data)
Compiler can pre-schedule:• Processor 1 executes 1st N/P iterations,• Processor 2 executes next N/P iterations• Processor P executes last N/P iterations• Pre-scheduling is effective if execution time is nearly
identical for each iteration Self-scheduling is often used:
• If each iteration is large• Time varies from iteration to iteration
- iterations are placed in a "work queue”- a processor that is idle, or becomes idle takes the next
block of work from the queue (critical section)
04/07 ECE/CS 757; copyright J. E. Smith, 2007 14
Code Generation with DependencesCode Generation with Dependences
do I = 2, NS1: A(I) = B(I) + C(I)S2: C(I) = D(I) * 2S3: E(I) = C(I) + A(I-1)
end do
Data Dependences & Directions
S1 -= S2
S1 < S3S2 = S3
Parallel Code on N-1 Processors
S1: A(I) = B(I) + C(I) signal(I)
S2: C(I) = D(I) * 2 if (I > 2) wait(I-1)
S3: E(I) = C(I) + A(I-1) Observation
• Weak data-dependence tests may add unnecessary synchronization.Good dependence testing crucial for high performance
04/07 ECE/CS 757; copyright J. E. Smith, 2007 15
Reducing SynchronizationReducing Synchronization
do I = 1, N
S1: A(I) = B(I) + C(I)S2: D(I) = A(I) * 2S3: SUM = SUM + A(I)
end do
Parallel Code: Version 1do I = p, N, P
S1: A(I) = B(I) + C(I)S2: D(I) = A(I) * 2
if (I > 1) wait(I-1)S3: SUM = SUM + A(I)
signal(I)end do
04/07 ECE/CS 757; copyright J. E. Smith, 2007 16
Reducing Synchronization, contd.Reducing Synchronization, contd.
Parallel Code: Version 2SUMX(p) = 0
do I = p, N, PS1: A(I) = B(I) + C(I)S2: D(I) = A(I) * 2S3: SUMX(p) = SUMX(p) + A(I)
end dobarrier synchronizeadd partial sums
04/07 ECE/CS 757; copyright J. E. Smith, 2007 17
Vectorization vs ConcurrentizationVectorization vs Concurrentization When a system is a vector MP, when should
vector/concurrent code be generated?
do J = 1,N
do I = 1,NS1: A(I,J+1) = B(I,J) + C(I,J)S2: D(I,J) = A(I,J) * 2
end doend do
Parallel & Vector Code: Version 1doacross J = 1,N
S1: A(1:N,J+1) = B(1:N,J)+C(1:N,J)signal(J)
if (J > 1) wait (J-1)S2: D(1:N,J) = A(1:N,J) * 2
end do
04/07 ECE/CS 757; copyright J. E. Smith, 2007 18
Vectorization vs ConcurrentizationVectorization vs Concurrentization
Parallel & Vector Code: Version 2Vectorize on J, but non-unit stride memory access(assuming Fortran Column Major storage order)
doall I = 1,NS1: A(I,2:N+1) = B(I,1:N) + C(I,1:N)S2: D(I,1:N) = A(I,1:N) * 2
end do
04/07 ECE/CS 757; copyright J. E. Smith, 2007 19
SummarySummary
Vectorizing compilers have been a success Dependence analysis is critical to any auto-
parallelizing scheme• Software (static) disambiguation• C pointers are especially difficult
Can also be used for improving performance of sequential programs
• Loop interchange• Fusion• Etc. (see add’l slides at end of lecture)
04/07 ECE/CS 757; copyright J. E. Smith, 2007 20
Cray-1 ArchitectureCray-1 Architecture
Circa 1976 80 MHz clock
• When high performance mainframes were 20 MHz
Scalar instruction set• 16/32 bit instruction sizes
Otherwise conventional RISC• 8 S register (64-bits)• 8 A registers (24-bits)
In-order pipeline• Issue in order• Can complete out of order (no precise traps)
04/07 ECE/CS 757; copyright J. E. Smith, 2007 21
Cray-1 Vector ISACray-1 Vector ISA
8 vector registers• 64 elements• 64 bits per element (word
length)• Vector length (VL) register
RISC format• Vi Vj OP Vk• Vi mem(Aj, disp)
Conditionals via vector mask (VM) register
• VM Vi pred Vj• Vi V2 conditional on VM
04/07 ECE/CS 757; copyright J. E. Smith, 2007 22
Vector ExampleVector Example
Do 10 i=1,looplength a(i) = b(i) * x + c(i) 10 continue
A1 looplength .initial values: A2 address(a) .for the arrays A3 address(b) . A4 address(c) . A5 0 .index value A6 64 .max hardware VL S1 x .scalar x in register S1
VL A1 .set VL – performs mod function . BrC done, A1<=0 .branch if nothing to do
more: V3 A4,A5 .load c indexed by A5 – addr mode not in Cray-1V1 A3,A5 .load b indexed by A5
V2 V1 * S1 .vector times scalar V4 V2 + V3 .add in c A2,A5 V4 .store to a indexed by A5 A7 VL .read actual VL A1 A1 – A7 .remaining iteration count A5 A5 + A7 .increment index value
VL A6 . set VL for next iteration BrC more, A1>0 .branch if more workdone:
04/07 ECE/CS 757; copyright J. E. Smith, 2007 23
Compare with ScalarCompare with Scalar
Do 10 i=1,looplength a(i) = b(i) * x + c(i) 10 continue
2 loads1 store2 FP1 branch1 index increment (at least)1 loop count increment
total -- 8 instructions per iteration
4-wide superscalar => up to 1 FP op per cyclevector, with chaining => up to 2 FP ops per cycle (assuming mem b/w)
Also, in a CMOS microprocessor would save a lot of energy .
04/07 ECE/CS 757; copyright J. E. Smith, 2007 24
Vector Conditional LoopVector Conditional Loop
do 80 i = 1,looplen if (a(i).eq.b(i)) then c(i) = a(i) + e(i) endif80 continue
V1 A1 .load a(i)V2 A2 .load b(i)VM V1 == V2 .compare a and b; result to VMV3 A3; VM .load e(i) under maskV4 V1 + V3; VM .add under maskA4 V4; VM .store to c(i) under mask
04/07 ECE/CS 757; copyright J. E. Smith, 2007 25
Vector Conditional LoopVector Conditional Loop
Gather/Scatter Method (used in later Cray machines) do 80 i = 1,looplen if (a(i).eq.b(i)) then c(i) = a(i) + e(i) endif80 continue
V1 A1 .load a(i)V2 A2 .load b(i)VM V1 == V2 .compare a and b; result to VMV5 IOTA(VM) .form index setVL pop(VM) .find new VL (population count)V6 A1, V5 .gather a(i) valuesV3 A3, V5 .gather e(i) valuesV4 V6 + V3 .add a and eA4,V11 V4 .scatter sum into c(i)
04/07 ECE/CS 757; copyright J. E. Smith, 2007 26
Thinking Machines CM1/CM2Thinking Machines CM1/CM2
Fine-grain parallelism Looks like intelligent
RAM to host (front-end) Front-end dispatches
"macro" instructions to sequencer
Macro instructions decoded by sequencer and broadcast to bit-serial parallel processors
P M
P
P
P
P
P
P
P
P
P
M
M
M
M
M
M
M
M
M
M
M
P
P
Sequencer host
instructions
04/07 ECE/CS 757; copyright J. E. Smith, 2007 27
CM Basics, contd.CM Basics, contd.
All instructions are executed by all processors Subject to context flag Context flags
• Processor is selected if context flag = 1• saving and restoring of context is unconditional• AND, OR, NOT operations can be done on context flag
Operations• Can do logical, integer, floating point as a series of bit
serial operations
04/07 ECE/CS 757; copyright J. E. Smith, 2007 28
CM Basics, contd.CM Basics, contd.
Front-end can broadcast data• (e.g. immediate values)
SEND instruction does communication• within each processor, pointers can be computed,
stored and re-used Virtual processor abstraction
• time multiplexing of processors
04/07 ECE/CS 757; copyright J. E. Smith, 2007 29
Data Parallel Programming ModelData Parallel Programming Model
“Parallel operations across large sets of data”
SIMD is an example, but can also be driven by multiple (identical) threads
• Thinking Machines CM-2 used SIMD• Thinking Machines CM-5 used multiple threads
04/07 ECE/CS 757; copyright J. E. Smith, 2007 30
Connection Machine ArchitectureConnection Machine Architecture
Nexus: 4x4, 32-bits wide• Cross-bar interconnect for host
communications 16K processors per
sequencer Memory
• 4K mem per processor (CM-1)• 64K mem per processor (CM-
2) CM-1 Processor 16 processors on processor
chip
04/07 ECE/CS 757; copyright J. E. Smith, 2007 31
Instruction ProcessingInstruction Processing
HLLs: C* and FORTRAN 8X Paris virtual machine instruction set Virtual processors
• Allows some hardware independence• Time-share real processors• V virtual processors per real processor
=> 1/V as much memory per virtual processor Nexus contains sequencer
• AMD 2900 bit-sliced micro-sequencer• 16K of 96-bit horizontal microcode
Inst. processing:• 32 bit virtual machine insts (host)
-> 96-bit microcode (nexus sequencer)
-> nanocode (to processors)
04/07 ECE/CS 757; copyright J. E. Smith, 2007 32
CM-2CM-2
re-designed sequencer; 4x microcode memory New processor chip FP accelerator (1 per 32 processors) 16x memory capacity (4K-> 64K) SEC/DED on RAM I/O subsystem Data vault Graphics system Improved router
04/07 ECE/CS 757; copyright J. E. Smith, 2007 33
PerformancePerformance
Computation• 4000 MIPS 32-bit integer• 20 GFLOPS 32-bit FP• 4K x 4K matrix mult: 5 GFLOPS
Communication• 2-d grid: 3 microseconds per bit
96 microseconds per 32 bits
20 billion bits /sec• general router: 600 microseconds/32 bits
3 billion bits /sec Compare with CRAY Y-MP (8 procs.)
• 2.4 GFLOPS
But could come much closer to peak than CM-2• 246 Billion bits/ sec to/from shared memory
04/07 ECE/CS 757; copyright J. E. Smith, 2007 34
OutlineOutline
Automatic Parallelization Vector Architectures
• Cray-1 case study
Data Parallel Programming• CM-1/2 case study
CUDA Overview (separate slides) Readings
• W. Daniel Hillis and Guy L. Steele, Data Parallel Algorithms, Communications of the ACM, December 1986, pp. 1170-1183.
• S. Ryoo, et al., Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA, Proceedings of PPoPP, Feb. 2008.
04/07 ECE/CS 757; copyright J. E. Smith, 2007 36
Improving ParallelismImproving Parallelism
Loop Interchange
do J = 1,N do I = 2,N
S1: A(I,J) = A(I-1,J) + B(I) end doend do
do I = 2,N do J = 1,N
S1: A(I,J) = A(I-1,J) + B(I) end doend do
04/07 ECE/CS 757; copyright J. E. Smith, 2007 39
Improving ParallelismImproving Parallelism
Induction Variable Recognition• Successive values are an arithmetic progression
INC = Ndo I = 1,N
I2 = 2*I-1X(INC) = Y(I) + Z(I2)INC = INC - 1
end do
do I = 1,NX(N-I+1) = Y(I) + Z(2*I - 1)
end do
X(N:1:-1) = Y(1:N) + Z(1:2*N-1:2)
04/07 ECE/CS 757; copyright J. E. Smith, 2007 40
Improving ParallelismImproving Parallelism
Wraparound Variable Recognition
J = Ndo I = 1,N
B(I) = (A(J) + A(I))/2J = I (Is J an induction variable?)
end do
Peel first iteration:if (N >= 1) then
B(1) = (A(N) + A(1))/2do I = 2,N
B(I) = (A(I-1) + A(I))/2end do
end if
if (N >= 1) thenB(1) = (A(N) + A(1))/2B(2:N) = (A(1:N-1) + A(2:N))/2
end if
04/07 ECE/CS 757; copyright J. E. Smith, 2007 41
Improving ParallelismImproving Parallelism Symbolic Dependence Testing
do I = 1,NS1: A(LOW+I-1) = B(I)S2: B(I+N) = A(LOW+I)
end do
Global Forward Substitution NP1 = N+1
NP2 = N+2 ...do I = 1,N
S1: B(I) = A(NP1) + C(I)S2: A(I) = A(I) - 1
do J = 2,NS3: D(J,NP1)=D(J-1,NP2)*C(J)+1
end doend do
Observations• Useful for symbolic dependence testing• Constant propagation is a special case
04/07 ECE/CS 757; copyright J. E. Smith, 2007 42
Improving ParallelismImproving Parallelism
Semantic Analysisdo I = LOW,HIGH
S1: A(I) = B(I) + A(I+M) ( M is unknown) end do
if (M >= 0) thendo I = LOW,HIGH
S1: A(I) = B(I) + A(I+M)end do
elsedo I = LOW,HIGH
S1: A(I) = B(I) + A(I+M)end do
end if
Top if is vectorizable
04/07 ECE/CS 757; copyright J. E. Smith, 2007 43
Improving ParallelismImproving Parallelism
Interprocedural Dependence Analysis• If a procedure call is present in a loop, can we
generate parallel code? Procedure in-lining
• may reduce overhead, and lead to exact dependence analysis
• may increase compilation time
04/07 ECE/CS 757; copyright J. E. Smith, 2007 44
Improving ParallelismImproving Parallelism
Removal of Output and Antidependences Eliminate them by renaming
• But may require extra vector storage
do I = 1,NS1: A(I) = B(I) + C(I)S2: D(I) = (A(I) + A(I+1))/2
end do
do I = 1,NS3: ATEMP(I) = A(I+1)S1: A(I) = B(I) + C(I)S2: D(I) = (A(I) + ATEMP(I))/2
end do
04/07 ECE/CS 757; copyright J. E. Smith, 2007 45
Improving ParallelismImproving Parallelism
Scalar Expansion do I = 1,N
S1: X = A(I) + B(I)S2: C(I) = X ** 2
end do
allocate (XTEMP(1:N))do I = 1,N
S1: XTEMP(I) = A(I) + B(I)S2: C(I) = XTEMP(I) ** 2
end doX = XTEMP(N)free (XTEMP)
04/07 ECE/CS 757; copyright J. E. Smith, 2007 46
Improving ParallelismImproving Parallelism Fission by Name
• Break a loop into several adjacent loops to improve performance of memory
Loop Fusion do I = 2,N
S1: A(I) = B(I) + C(I)end dodo I = 2,N
S2: D(I) = A(I-1)end do
do I = 2,NS1: A(I) = B(I) + C(I)S2: D(I) = A(I-1)
end do Observations
• fusion reduces start-up costs for loops• when does fusion yield benefits over fission?
04/07 ECE/CS 757; copyright J. E. Smith, 2007 47
Improving ParallelismImproving Parallelism
Strip Mining do I = 1,N
A(I) = B(I) + 1 D(I) = B(I) - 1
end do
do J = 1,N,32 (vector reg. len. = 32)do I = J,MIN(J+31,N)
A(I) = B(I) + 1 D(I) = B(I) - 1
end do end do
04/07 ECE/CS 757; copyright J. E. Smith, 2007 48
Loop CollapsingLoop Collapsing
real A(5,5), B(5,5)do I = 1,5
do J = 1,5A(I,J) = B(I,J) + 2end do
end do
real A(25),B(25)do IJ = 1,25
A(IJ) = B(IJ) + 2end do
04/07 ECE/CS 757; copyright J. E. Smith, 2007 49
Conditional StatementsConditional Statements
do I = 1,NIF (A(I) .LE. 0) GOTO 100A(I+1) = B(I) + 3
end do
do I = 1,N (Is this vectorizable?)BR1 = A(I) .LE. 0IF (.NOT. BR1) A(I+1) = B(I) + 3
end do
do I = 1,N (Is this vectorizable?)BR1 = A(I) .LE. 0IF (.NOT. BR1) A(I) = B(I) + 3
end do
04/07 ECE/CS 757; copyright J. E. Smith, 2007 50
Conditional Statements, contd.Conditional Statements, contd.
do I = 1,N
BR1(I) = A(I) .LE. 0IF (.NOT. BR1(I)) A(I) = B(I) + 3
end do
BR1(1:N)=A(1:N).LE.0 ( Scalar Expansion)WHERE(.NOT.BR1(1:N)) A(1:N)=B(1:N)+3
04/07 ECE/CS 757; copyright J. E. Smith, 2007 52
Example: ReductionExample: Reduction
Each processor (k) unconditionally tests context flag add if if flag set; else 0 perform unconditional summation of integers
• Special case: count processors
04/07 ECE/CS 757; copyright J. E. Smith, 2007 53
Example: Compute Partial SumsExample: Compute Partial Sums
Similar to reduction, except use sum-prefix computation
• Special case: enumerate -give each active processor a sequence number
04/07 ECE/CS 757; copyright J. E. Smith, 2007 55
Example: Find End of Linked ListExample: Find End of Linked List
04/07 ECE/CS 757; copyright J. E. Smith, 2007 57
Burroughs BSPBurroughs BSP
Developed during the "supercomputer wars" of the late '70s, early '80s.
Taken to prototype stage, but never shipped Draws distinct division between vector and scalar
processing• control and parallel processors have totally different
memories (for both insts. and data)
04/07 ECE/CS 757; copyright J. E. Smith, 2007 59
Burroughs BSPBurroughs BSP
Control (scalar) processor• processes all instructions from control memory• 80 ns clock => up to 1.5 MFLOPS
Parallel (vector) processor• 16 processors• 160 ns clock• 2 cp latency for major FP operations• pipelined at a high level
04/07 ECE/CS 757; copyright J. E. Smith, 2007 60
BSP FAPAS PipelineBSP FAPAS Pipeline
stages: Fetch, Align, Process, Align, Store (FAPAS)
Memory
Processor
AlignmentNet
AlignmentNet
Store Fetch
Align
Process
AlignOutput Input
04/07 ECE/CS 757; copyright J. E. Smith, 2007 61
ExampleExample
Example: D = A * B + C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Fetch A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4Align A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 Process * * + + * * + + * * + Align D1 D2 Store D1 D2
Note that memory bandwidth and fp bandwidth are both fully committed for triad.
04/07 ECE/CS 757; copyright J. E. Smith, 2007 62
FAPAS pipeline contd.FAPAS pipeline contd.
Throughput: 1 FP op/320ns * 16 ops in ||=> 50 MFLOPS peak
=> big difference between performance in scalar and vector modes
also, many scalar arith operations were done in the vector unit because of partitioning of memory
04/07 ECE/CS 757; copyright J. E. Smith, 2007 63
ArchitectureArchitecture
Memory to Memory Unlimited vector lengths
• (stripmining is automatic in hardware) Up to 5 input operands per instruction
(pentads)• takes into account all evaluation trees• refer to table 1 for list of forms• Any fixed stride
Supports two levels of loop nesting
04/07 ECE/CS 757; copyright J. E. Smith, 2007 64
Example:Example:
VFORM TRIAD, op1, op2OBV (operand bit vector)RBV (result bit vector)VOPERAND AVOPERAND BVOPERAND CVRESULT Z=> RBV,Z = A op1 B op2 C, OBV
04/07 ECE/CS 757; copyright J. E. Smith, 2007 65
FORTRAN ExampleFORTRAN Example
Livermore Fortran Kernel 1, Hydro Excerpt• inner loop: x(k)=u(k)*(r*z(k+10)+t*z(k+11))
VLEN = "one level of nesting, length 100"VFORM PENTAD2, *, +, *, *, no bit vectorsno OBVno RBVVOPERAND r (broadcast)VOPERAND z+10 (stride 1)VOPERAND t (broadcast)VOPERAND z+11 (stride 1)VOPERAND u (stride 1)VRESULT x (stride 1)
04/07 ECE/CS 757; copyright J. E. Smith, 2007 66
Architecture, contd.Architecture, contd. Strided loads/stores
• Implemented with prime memory system to avoid conflicts Sparse matrices
• compress• expand• random fetch• random store
IF statements• bit vectors embedded in vector insts.
Recurrences/reductions• special instructions
Chaining• built into instructions via multi-operands• saves loads and stores in a mem-to-mem arch.• 10 temp registers per arithmetic unit
04/07 ECE/CS 757; copyright J. E. Smith, 2007 67
Template ProcessingTemplate Processingcontrol memory
F
A
P
A
S
Vector Data Buffer16 entries
120 bits
from 16 GPR's (if needed)
fields filled by control processor
Vector Input and
Validation Unit
shipped to vector unit
assembles sequence of instructions
into a global description of operation
and checks memory hazards
TemplateDescriptorMemory
16 entries
Template Control
Unit
read
write control