Mark Hampton and Krste Asanović April 9, 2008

Mark Hampton and Krste Asanović

April 9, 2008

Compiling for Vector-Thread Architectures

MIT Computer Science and Artificial Intelligence

Laboratory

University of California at Berkeley

Vector-thread (VT) architectures efficiently encode parallelism in a variety of applications

• A VT architecture unifies the vector and multithreaded execution models

• The Scale VT architecture exploits DLP, TLP, and ILP (with clustering) simultaneously

• Previous work [Krashinsky04] has shown the ability of Scale to take advantage of the parallelism available in several different types of loops However,thatevaluationreliedonmappingcodetoScaleusinghandwrittenassembly

This work presents a back end code generator for the Scale architecture

• Compiler infrastructure is relatively immature, as much of the work to this point consisted of getting all the pieces to run together

• We prioritized taking advantage of Scale’s unique features to enable support for “difficult” types of loops rather than focusing on optimizations Compilercanparallelizeloopswithinternalcontrolflow,outerloops,loopswithcross-iterationdependences

However,compilerdoesnotcurrentlyhandlewhileloops

• Despite lack of optimizations, compiler is still able to produce some significant speedups

Talk outline

• Vector-thread architecture background• Compiler overview

EmphasisonhowcodeismappedtoScale

• Performance evaluation• Conclusions

Vector-thread architectures use a virtual processor (VP) abstraction

• VPs contains registers and ALUs• VPs execute RISC-like instructions

grouped into atomic instruction blocks (AIBs)

• AIBs must be explicitly fetched CaneitherbefetchedforagroupofVPs(vector-fetch)orforasingleVP(thread-fetch)

• Fetches can be predicated to allow conditional branching

• A VP stops after it executes an AIB that does not issue a fetch instruction

thread-fetch

VP

Registers

ALUs

vector-fetch

(p) fetch

fetch

instruction

vector-fetched AIB

thread-fetched AIBs

thread-fetch

VP0

Registers

ALUs

vector-fetch

thread-fetch

VP1

Registers

ALUs

vector-fetch

thread-fetch

VPNvector-fetch

ControlProcessor

Memory

VectorMemory

Unit

vector-fetch

vector-load

vector-store cross-VP

queue

Registers

ALUs

A control processor interacts with a vector of virtual processors

Memory

ControlProcessor

Vector-Thread Unit

Vector Memory

Unit

Lane 0

ALUs

VP0

VP4

VP8

VP12

shared

cr0 cr1

vector-load

vector-store

vector-fetch

private registers

Lane 1

ALUs

VP1

VP5

VP9

VP13

shared

cr0 cr1

Lane 2

ALUs

VP2

VP6

VP10

VP14

shared

cr0 cr1

Lane 3

ALUs

VP3

VP7

VP11

VP15

shared

cr0 cr1

cross-VP queue

The Scale processor prototype implements the vector-thread architectural paradigm

Scale is a high-performance, energy-efficient embedded design [Krashinsky07]

Scale excels at exploiting loop-level parallelism

• Typical programming model is to have control processor launch a group of VPs, with each VP executing a single iteration

• The ability of VPs to direct their own control flow and to use the cross-VP network enables support for a wider variety of loop types than traditional vector designs

• Ability to support vector execution in data-parallel code sections enables a higher degree of performance and energy efficiency than a traditional multithreaded design

The compiler for Scale ties together three existing infrastructures

SUIF Front End

C Source Code

Memory DependenceAnalysis

SUIF-to-TrimaranConversion

ClassicalOptimizations

Scalar-to-VP CodeTransformation

Prepass InstructionScheduling

Register Allocation

Postpass InstructionScheduling

Cluster Assignment

AIB Formation

Chain RegisterInsertion

Assembly CodeGeneration

GCC CrossCompilation

Binary Executable

SUIF

Trimaran

GCC

The compiler conducts a dependence analysis to select which loop to parallelize

• SUIF’s dependence library is used to annotate memory operations with direction vectors

• The restrict keyword is required to indicate there is no aliasing Thisistheextentofmanualprogrammerintervention

• Trimaran uses the results of the SUIF analysis to detect whether a particular loop in a nest has any cross-iteration dependences PriorityisgiventoparallelizinginnermostDOALLloops IfaloopnestcontainsnoDOALLloops,thecompilertriestoparallelizeaDOACROSSloop

HeaderBlock:Vector-fetchedcode,VTUcommands,scalar

instructions

BackEdge/ExitBlock:Vector-fetchedcode,VTUcommands,

scalarinstructions

InternalLoopBlocks:Thread-fetchedcode

LoopEntry

LoopExit

Once a loop is selected, it is mapped to the VTU without any restructuring

Any established front end loop transformation can also be used, but that doesn’t change the back end code generation strategy

Simple DOALL loops are handled similarly to traditional vectorization

for (i = 0; i < len; i++)

out[i] = COEFF*in1[i] + in2[i];

loop: lw r1, in1 mult r2, r0, r1 lw r3, in2 add r4, r2, r3 sw r4, out add in1, 4 add in2, 4 add out, 4 sub len, 1 bnez len, loop

Compiler tasks:• Add a command to

configure the VTU• Strip mine the loop• Map scalar code to VTU

code• Propagate loop-invariant

values to shared registers

li r0, COEFF

Simple DOALL loops are handled similarly to traditional vectorization

for (i = 0; i < len; i++)

out[i] = COEFF*in1[i] + in2[i];

loop: lw r1, in1 mult r2, r0, r1 lw r3, in2 add r4, r2, r3 sw r4, out add in1, 4 add in2, 4 add out, 4 sub len, 1 bnez len, loop

loop: setvl r6, len vlw v0, in1 vmult v1, v0, s0 vlw v2, in2 vadd sd0, v1, v2 vsw sd0, out sll r7, r6, 2 add in1, r7 add in2, r7 add out, r7 sub len, r6 bnez len, loop

vcfgvl r5, 128, ... vwrsh s0, COEFF li r0, COEFF

Internal control flow can be handled by allowing VPs to fetch their own code

for (i = 0; i < len; i++) {

if (in[i] < 4) temp = in[i] * 4;

else temp = in[i] * 2;

out[i] = temp; }

loop: lw r0, in slt r1, r0, 4 bnez r1, b3

b2: sll r2, r0, 1 j b4

b4: sw r2, out# bookkeeping code ... bnez len, loop

b3: sll r2, r0, 2

Additional compiler tasks beyond simple DOALL case:•Map branches and fall-through paths to VP fetches– Place AIB addresses in

shared regs as optimization

•Compute induction variable values used in internal loop blocks (not required for this example)


for (i = 0; i < len; i++) {

if (in[i] < 4) temp = in[i] * 4;


out[i] = temp; }

loop: lw r0, in slt r1, r0, 4 bnez r1, b3

b2: sll r2, r0, 1 j b4

b4: sw r2, out# bookkeeping code ... bnez len, loop

b3: sll r2, r0, 2

vcfgvl r3, 128, ...vwrsh s0, b2vwrsh s1, b3

loop: setvl r4, len vlw v0, in vslt p, v0, 4 psel.fetch s1, s0

b2: vsll sd0, v0, 1

b4: vsw sd0, out# bookkeeping code ... bnez len, loop

b3: vsll sd0, v0, 2


for (i = 0; i < len; i++) {

if (in[i] < 4) temp = in[i] * 4;


out[i] = temp; }vcfgvl r3, 128, ...vwrsh s0, b2vwrsh s1, b3

loop: setvl r4, len vlw v0, in vslt p, v0, 4 psel.fetch s1, s0

b2: vsll sd0, v0, 1

b4: vsw sd0, out# bookkeeping code ... bnez len, loop

b3: vsll sd0, v0, 2

• Although example is simple, it illustrates how compiler is able to map complex control flow to VPs

• No need to execute both sides of a branch and throw away one set of results

• However, it is possible to perform if-conversion (although that is not currently implemented)

The ability of VPs to direct their control flow allows outer loop parallelization

for (i = 0; i < len; i++) {

sum = 0;

for (j = 0; j < len-i; j++)

sum += in[j] * in[j+i];

out[i] = sum; }

loop1: li r0, 0 sub r1, len, i move r2, in sll r3, i, 2 add r4, r3, in

loop2: lw r5, r2 lw r6, r4 mult r7, r5, r6 add sum, r7# bookkeeping code... bnez r1, loop2

sw sum, out# bookkeeping code... bnez len, loop1

Compiler has same tasks as in previous case:•New aspect illustrated by this example is need to compute induction variables in internal loop blocks

No need to perform loop interchange or unrolling

The ability of VPs to direct their control flow allows outer loop parallelization

for (i = 0; i < len; i++) {

sum = 0;

for (j = 0; j < len-i; j++)

sum += in[j] * in[j+i];

out[i] = sum; }

loop1: li r0, 0 sub r1, len, i move r2, in sll r3, i, 2 add r4, r3, in

loop2: lw r5, r2 lw r6, r4 mult r7, r5, r6 add sum, r7# bookkeeping code... bnez r1, loop2

sw sum, out# bookkeeping code... bnez len, loop1

vcfgvl r8, 128, ...vwrsh s0, lenvwrsh s1, inla r9, vp_numbersvlb v0, r9

loop1: setvl r10, len vwrsh s2, i vadd v1, s2, v0 ...

loop2: vplw... ...

... bnez len, loop1

Loop-carried dependences can be mapped to the cross-VP network

for (i = 1; i < len; i++)

out[i] = in[i] * out[i-1];

loop: lw r1, in mult r0, r1 sw r0, out add in, 4 add out, 4 sub len, 1 bnez len, loop

sub len, 1 lw r0, -4(out)

Additional compiler tasks beyond simple DOALL case:•Insert commands to push initial value into cross-VP network and to pop final value•Map loop-carried values to prevVP/nextVP queues in VP code•Copy any cross-VP queue values that have more than one reader to registers

Loop-carried dependences can be mapped to the cross-VP network

for (i = 1; i < len; i++)

out[i] = in[i] * out[i-1];

loop: setvl r4, len vlw v0, in vmult v1, v0, prevVP vmove sd0, v1 vmove nextVP, v1 vsw sd0, out# bookkeeping code ... bnez len, loop

sub len, 1 lw r0, -4(out) vcfgvl r2, 128, ... xvppush r3, x0

xvppop r5, x0

loop: lw r1, in mult r0, r1 sw r0, out add in, 4 add out, 4 sub len, 1 bnez len, loop

sub len, 1 lw r0, -4(out)

The compiler focuses on improving throughput rather than reducing single-thread latency

• Various phases are aimed at minimizing physical register usage Clusterassignmentattemptstobalanceworkevenattheexpenseofinter-clustermoves

Instructionschedulingtriestopackdependencechainstogether

Chainregisterinsertionisdesignedtoavoidusingtheregisterfileforshort-livedvalues

• Additional details in paper

Evaluation methodology

• Scale simulator uses detailed models for VTU and cache, but a single-instruction-per-cycle latency for control processor Reducesmagnitudeofparallelizedcodespeedups

• Performance is evaluated across a limited number of EEMBC benchmarks EEMBCbenchmarksaredifficulttoautomaticallyparallelize

Continuedimprovementstothecompilerinfrastructure(e.g.if-conversion,frontendlooptransformations)wouldenablebroaderbenchmarkcoverage

The speedups of (relatively) unoptimized code reflect Scale’s advantages

• Speedups exceed or are comparable to those observed in a limit study [Islam07] performed for an idealized 16-core multiprocessor supporting thread-level speculation; same is true for an infinite number of cores

• Results point to benefits of exploiting parallelism within a single core

More accurately ~11x

There is a variety of related work

• TRIPS also exploits multiple forms of parallelism, but the compiler’s focus is on forming blocks of useful instructions and mapping instructions to ALUs

• Stream processing compilers share some similarities with our approach, but also have somewhat different priorities, such as managing the utilization of the Stream Register File

• IBM’s Cell compiler has to deal with issues such as alignment and branch hints, which are not present for Scale

• GPGPU designs (Nvidia’s CUDA, AMD’s Stream Computing) also have similarities with Scale, but the differences in the programming models result in different focuses in the compilers

Concluding remarks

• Vector-thread architectures exploit multiple forms of parallelism

• This work presented a compiler for the Scale vector-thread architecture

• The compiler can parallelize a variety of loop types

• Significant performance gains were achieved over a single-issue scalar processor

A comparison to handwritten code shows there is still significant room for improvement

0

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0

1 6 0

a u to c o r_ o lv a u to c o r_ x v p fir h p g rg b c m y rg b y iq

Spee

dup

over

sca

lar c

ode

C o m p ile d

H a n d C o ded

There are several optimizations that can be employed to narrow the performance gap

Documents

Mark Hampton and Krste Asanović April 9, 2008