45
Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary, Canada smithmr @ ucalgary.ca

Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

Systematic development of programs with parallel

instructions SHARC ADSP2106X processor

M. Smith,Electrical and Computer

Engineering,University of Calgary, Canada

smithmr @ ucalgary.ca

Page 2: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 2 / 44 -- two days

To be tackled today What’s the problem? Standard Code Development of “C”-

code Process for “Code with parallel

instruction” Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort

Page 3: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

ADSP-2106x -- Parallelism opportunitiesADSP-2106x -- Parallelism opportunities

DAG 2

8 x 4 x 24

DAG 1

8 x 4 x 32

CACHE

MEMORY

32 x 48

PROGRAM

SEQUENCER

PMD BUS

DMD BUS

24PMA BUS

PMD

DMD

PMA

32DMA BUSDMA

48

40

JTAG TEST &

EMULATION

FLAGS

FLOATING & FIXED-POINT

MULTIPLIER,

FIXED-POINT

ACCUMULATOR

32-BIT

BARREL

SHIFTER

FLOATING-POINT

& FIXED-POINT

ALU

REGISTER

FILE

16 x 40

BUS CONNECT

TIMER

Ability for parallel memory operation,One each on pm, dm and instruction cache busses

Memory pointer operationsPost modify 2 index registers

Automatic circular buffer operations

Automatic bit reverse addressing

Many parallel operations and register to register bus transfersRn = Rx + Ry or Rn = Rx * Ry

Rm = Rx + Ry, Rn = Rx - Ry with/without Rp = Rq * Rr

Zero overhead loops

Instruction pipeline issues

Key issue -- Only 48? bits available in OPCODE to describe

16 data registers in 3 destinations and 6 sources = 135 bits

2 * (8 index + 8 modify + 16 data) = 64 bits

Condition code selection, 32 bit constants etc.

Page 4: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 4 / 44 -- two days

Compiler is only -- somewhat useful

See article in course notes fromEmbedded System Design Sept./October 2000

Need to get a systematic process to provide Parallelism without pain

Need to know what to worry about and what not to Lab 3 -- Implement FIR filter in Parallel -- Help

provided Lab. Library version of FFT, custom version of Burg

Algorithm (AR modeling)

Page 5: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 5 / 44 -- two days

Basic code development -- any system

Write the “C” code for the function

void Convert(float *temperature, int N)

which converts an array of temperatures measured in “Celsius” (Canadian Market) to Fahrenheit (American Market)

Convert the code to ADSP 21061/68K etc. assembly code, following the standard coding and documentation practices

Page 6: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 6 / 44 -- two days

Parallel Instruction Code Development

Write the 21k assembly code for the functionvoid Convert(float *temperature, int N)which etc…...

Determine the instruction flow through the architecture using a resource usage diagram

Theoretically optimize the code -- a 2 minute counting process

Compare and contrast the amount of time to perform the subroutine before and after customization.

Page 7: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 7 / 44 -- two days

Standard “C” code

void Convert(float *temperature, int N) {int count;

for (count = 0; count < N; count++) {*temperature = (*temperature) * 9 / 5

+ 32;temperature++

}Standard Warning -- What does optimizing

compiler do with 9 / 5 becomes 1 or 1.8?

Page 8: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 8 / 44 -- two days

Process for developing parallel code

Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture

Write the assembly code using a hardware loop Rewrite the assembly code using instructions that

could be used in parallel you could find the correct optimization approach

Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop

Page 9: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 9 / 44 -- two days

21061-style load/store “C” code

void Convert(register float *temperature, register int N) {register int count;register float *pt = temperature;register float scratch;

for (count = 0; count < N; count++) {scratch = *pt;scratch = scratch * (9 / 5);scratch = scratch + 32;*pt = scratch;pt++;

}

Page 10: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 10 / 44 -- two days

Process for developing parallel code

Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture

Write the assembly code using a hardware loop Check that end of loop label is in the correct place

Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach

Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop

Page 11: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 11 / 44 -- two days

Assembly code

PROLOGUE Appropriate defines to make easy reading of

code Saving of non-volatile registers

BODY Try to plan ahead for parallel operations Know which 21k “multi-function” instructions are

valid EPILOGUE

Recover non-volatile registers

Page 12: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 12 / 44 -- two days

Straight conversion -- PROLOGUE

// void Convert(reg float *temperature, reg int N) {.segment/pm seg_pmco;.global _Convert;_Convert:

// register int count = GARBAGE;#define countR1 scratchR1

// register float *pt = temperature;#define pt scratchDMpt

pt = INPAR1;// float scratch = GARBAGE;

#define scratchF2 F2// For the CURRENT code -- no volatile

// registers are needed -- may not remain true

Page 13: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 13 / 44 -- two days

Straight conversion of code// for (count = 0; count < N; count++) {

LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE:// scratch = *pt;

scratchF2 = dm(0, pt); // Not ++ as pt re-used// scratch = scratch * (9 / 5);// INPAR1 (R4) is dead -- can reuse as F4

#define constantF4 F4 // Must be floatconstantF4 = 1.8 // No division, Use register constant

scratchF2 = scratchF2 * constantF4;

// scratch = scratch + 32;#define F0_32 F0 // Must be float

F0_32 = 32.0; scratchF2 = scratchF2 + F0_32;

// *pt = scratch; pt++; dm(pt, 1) = scratchF2;LOOP_END: 5 magic lines of code

// NOT F0 = 32 gives F0 = 1 * 10 -45

Page 14: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 14 / 44 -- two days

Avoid this errorLCNTR = INPAR2, DO LOOP_END UNTIL LCE:

scratchF2 = dm(0, pt);scratchF2 = scratchF2 * constantF4;F0_32 = 32.0;scratchF2 = scratchF2 + F0_32;

LOOP_END: dm(pt, 1) = scratchF2; INTENDED LAST LINE OF LOOP

LCNTR = INPAR2, DO LOOP_END UNTIL LCE:scratchF2 = dm(0, pt);scratchF2 = scratchF2 * constantF4;F0_32 = 32.0;scratchF2 = scratchF2 + F0_32;dm(pt, 1) = scratchF2;

LOOP_END: Rest of the code STILL LAST LINE OF LOOP First line of “rest of code” has now become part of loop

Page 15: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 15 / 44 -- two days

Process to avoid the errorThis particularly error is going to be very easy to make as the “Rest of the code”is going to look very similar to the “loop internals” once we have takenaccount of the ALU/FPU pipeline to maximize parallelism

SUGGESTED APPROACH TO AVOID THIS TIME WASTING ERROR

LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE:scratchF2 = dm(0, pt);scratchF2 = scratchF2 * constantF4;F0_32 = 32.0;scratchF2 = scratchF2 + F0_32;dm(pt, 1) = scratchF2;

LOOP_END: Rest of the code

This was a process adopted from the compiler output -- the concept of a labelwas beyond most people in ENCM415

Page 16: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 16 / 44 -- two days

Process for developing parallel code

Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture

Write the assembly code using a hardware loop Check that end of loop label is in the correct place

Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach.

Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point.

Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop

Page 17: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 17 / 44 -- two days

Speed rules for memory access

scratch = dm(0, pt);

scratch = dm(pt, 0); // Not ++ as to be re-useddm(pt, 1) = scratch;

Use of constants as modifiers is not allowed -- not enough bits in the opcode -- need 32 bits for each constant

Must use Modify registers to store these constants.Several useful constants placed in modify registers

(DAG1 and DAG2) during “C-code” initialization (if linked in)

scratch = dm(pt, zeroDM); // Not ++ as to be re-useddm(pt, plus1DM) = scratch;

Can’t use PREMODIFY PERIOD

Can’t use POST MODIFYOPERATIONS with CONSTANTS

Page 18: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 18 / 44 -- two days

Speed rules IF you want adds and multiplys to occur on the same line

F1 = F2 * F3, F4 = F5 + F6; Want to do as a single instruction Not enough bits in the opcode Register description 4 + 4 + 4 + 4 + 4 + 4 (bits)

Plus bits for describing math operations, conditions and memory ops? Fn = F(0, 1, 2 or 3) * F(4, 5, 6 or 7) Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) Must rearrange register usage with program code for this

to be possible Register description 4 + 2 + 2 + 4 + 2 + 2 (bits) -- other bits

“understood” Inconvenient rather than limiting e.g. F6 = F0 * F4, F7 = F8 + F12, F9 = F8 - F12; Not accepted F6 = F4 * F0, F7 = F8 + F12, F9 = F8 - F12; Not accepted F7 = F8 + F12, F9 = F8 - F12, F6 = F0 * F4;

Page 19: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 19 / 44 -- two days

When should we worry about the register assignment?

#define count scratchR1#define pt scratchDMpt#define scratchF2 F2

LCNTR = INPAR2, DO LOOP_END- 1 UNTIL LCE:scratchF2 = dm(pt, 0); // Not ++ as to be re-used

// INPAR1 (R4) is dead -- can reuse#define constantF4 F4 // Must be float

constantF4 = 1.8;scratchF2 = scratchF2 * constantF4

#define F0_32 F0 // Must be floatF0_32 = 32.0;scratchF2 = scratchF2 + F0_32;dm(pt, 1) = F0_32;

LOOP_END:

Page 20: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 20 / 44 -- two days

Check on required register use

#define count scratchR1#define pt scratchDMpt#define scratchF2 F2

LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE:scratchF2 = dm(pt, zeroDM);

Are there special requirements here on F2 -- becomes source later?? // INPAR1 (R4) is dead -- can reuse

#define constantF4 F4 // Must be floatconstantF4 = 1.8;scratchF2 = scratchF2 * constantF4

Fn = F(0,1,2 or 3) * F(4,5,6 or 7),#define F0_32 F0 // Must be float

F0_32 = 32.0;scratchF2 = scratchF2 + F0_32;

Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2;

Page 21: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 21 / 44 -- two days

Register re-assignment -- Step 1

#define count scratchR1#define pt scratchDMpt#define scratchF2 F2 -- OKAY

LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE:scratchF2 = dm(pt, zeroDM);

// INPAR1 (R4) is dead -- can reuse#define constantF4 // Must be float -- OKAY

constantF4 = 1.8;scratchF2 = scratchF2 * constantF4 -- SOURCES okay hereFn = F(0,1,2 or 3) * F(4,5,6 or 7),#define F0_32 F0 // Must be floatF0_32 = 32.0; -- WRONG to use F0 here -- ADDITIONscratchF2 = scratchF2 + F0_32; -- WRONG to use F2 as DEST earlyFm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)

dm(pt, plus1DM) = scratchF2; -- OKAY

Page 22: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 22 / 44 -- two days

Register re-assignment -- Step 2

#define count scratchR1#define pt scratchDMpt#define scratchF2 F2

LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE:scratchF2 = dm(pt, zeroDM);

// INPAR1 (R4) is dead -- can reuse#define constantF4 F4 // Must be float

constantF4 = 1.8;scratchF8 = scratchF2 * constantF4answer must be in F(8, 9, 10 or 11)

#define F12_32 F12 // INPAR3 is availableF12_32 = 32.0; scratchF2 = scratchF8 + F12_32 ; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)

dm(pt, plus1DM) = scratchF2;

Page 23: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 23 / 44 -- two days

Fix poor coding practice -- “C” or assembly

#define count scratchR1#define pt scratchDMpt#define scratchF2 F2

LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE:scratchF2 = dm(pt, zeroDM);

// INPAR1 (R4) is dead -- can reuse#define constantF4 F4 // Must be float

constantF4 = 1.8; MOVE OUTSIDE LOOPscratchF8 = scratchF2 * constantF4

answer must be in F(8, 9, 10 or 11)#define F12_32 F12 // INPAR3 is available

F12_32 = 32.0; MOVE OUTSIDE LOOPscratchF2 = scratchF8 + F12_32 ; Fm = F(8,

9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2;

Page 24: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 24 / 44 -- two days

Process for developing parallel code

Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture

Write the assembly code using a hardware loop Check that end of loop label is in the correct place

Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach

Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point.

Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop

Page 25: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 25 / 44 -- two days

Resource Management -- Chart1 -- Basic code

ADDER MULTIPLIER DM ACCESS PMACCESS

_Convert: pt = INPAR1; F12_32 = 32.0 // bring constants outside the loop F4_1_8 = 1.8 LCNTR = INPAR2, DO LOOP_END UNTIL LCE;

F2 = dm(pt, ZERODM)F8 = F2 * F4_1_8

F2 = F8 + F12_32LOOP_END: dm(pt, PLUS1DM) = F2 5 magic lines of “C” Time = 4 + N * 4 + 5 + 5 to do the callLOOPEND:

-1 UNTIL LCE

In theory -- if we could find out how *, + and dm in parallel DATA-BUS is limiting resource

dm 2 cycle loop possibleBefore proceeding -- Is 2 cycle loop needed? Is 2 cycle loop enough?

Page 26: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 26 / 44 -- two days

Process for developing parallel code

Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture

Write the assembly code using a hardware loop Check that end of loop label is in the correct place

Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach

Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point.

Move algorithm to “Resource Usage Chart” Optimize parallelism using techniques

Attempt to -- watch out for special situations where code will fail

Compare and contrast time -- setup and loop

Page 27: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 27 / 44 -- two days

Un-roll the loop For various methods on “unrolling the

loop” see papers by Jeanne Anne Booth

Final Exam question -- What are relative advantages of the various techniques (with examples)?

Page 28: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 28 / 44 -- two days

Resource 2 -- unroll the loop -- 5 times here

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) R1

F8 = F2 * F4_1_8 M1 F2 = F8 + F12_32 A1

dm(pt, PLUS1DM) = F2 W1F2 = dm(pt, ZERODM) R2

F8 = F2 * F4_1_8 M2 F2 = F8 + F12_32 A2

dm(pt, PLUS1DM) = F2 W2F2 = dm(pt, ZERODM) R3

F8 = F2 * F4_1_8 M3 F2 = F8 + F12_32 A3

dm(pt, PLUS1DM) = F2 W3F2 = dm(pt, ZERODM) R4

F8 = F2 * F4_1_8 M4 F2 = F8 + F12_32 A4

dm(pt, PLUS1DM) = F2 W4F2 = dm(pt, ZERODM) R5

F8 = F2 * F4_1_8 M5 F2 = F8 + F12_32 A5

dm(pt, PLUS1DM) = F2 W5

Each

pass

through

the loop

involves

Read

Multiply

Add

Write

Page 29: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 29 / 44 -- two days

Resource Management 3 -- identify resource usage during decode and writeback stages of each instructions

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) Decode(Mem)

Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)

Writeback(F8) F2 = F8 + F12_32 Decode(F8,F4)

Writeback(F2)dm(pt, PLUS1DM) = F2 Decode(F2)

Writeback(Mem)F2 = dm(pt, ZERODM) Decode(Mem)

Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)

Writeback(F8) F2 = F8 + F12_32 Decode(F8,F4)

Writeback(F2)dm(pt, PLUS1DM) = F2 Decode(F2)

Writeback(Mem)

Model used -- depends on where operands are relative to equals sign‘Reading’ -- fetching things for ALU/FPU -- Like 68K decode phase ‘Writeback’ -- storing results from ALU/FPU

THESE

PHASES

ARE

‘CONCEPTS’

RATHER

THAN “

IMPLEMENTED’

Reading

Reading

Reading

Reading

Reading

Reading

Reading

Reading

Page 30: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 30 / 44 -- two days

Resource Management 4 Check what can be moved in parallel with other instructions

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) Decode(Mem)

Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)

Writeback(F8) F2 = F8 + F12_32 Decode(F8,F4)

Writeback(F2)

NO dm(pt, PLUS1DM) = F2 Decode(F2)Writeback(Mem)

NO F2 = dm(pt, ZERODM) Decode(Mem)Writeback(F2)

F8 = F2 * F4_1_8 Decode(F2,F4)Writeback(F8)

F2 = F8 + F12_32 Decode(F8,F4)Writeback(F2)

dm(pt, PLUS1DM) = F2 Decode(F2)Writeback(Mem)

OKAY TO MOVEF2 src freed up

before F2 dest occurs

OKAY TO MOVEEmpty spot ifcan move * and + instructs

which this instruction

MUST follow

NO !!! or just

possible NO?

Why a problem?

F2 =

Page 31: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 31 / 44 -- two days

Memory resource availability Move up F2 = dm(pt, ZERODM) from

second loop into first loop

However now we have a possible conflict about which F2 should be used for the

dm(pt, plus1DM) = F2

instruction if we further optimize by trying to fill the other empty delay slots -- see next slide

Page 32: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 32 / 44 -- two days

Resource managementOverlapping two parts of the loop

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) Decode(Mem)

Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)

Writeback(F8) F2 = F8 + F12_32 Decode(F8,F4)

Writeback(F2)

NO dm(pt, PLUS1DM) = F2 Decode(F2)Writeback(Mem)

NO F2 = dm(pt, ZERODM) Decode(Mem)Writeback(F2)

F8 = F2 * F4_1_8 Decode(F2,F4)Writeback(F8)

F2 = F8 + F12_32 Decode(F8,F4)Writeback(F2)

dm(pt, PLUS1DM) = F2 Decode(F2)Writeback(Mem)

Page 33: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 33 / 44 -- two days

Resource Management 5 -- What’s up, Doc?Attempting to fill all unused resource availability

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) Decode(Mem)

Writeback(F2)F8 = F2 * F4_1_8 F2 = Decode(F2,F4)

Writeback(F8) F2 = F8 + F12_32 F8 = F2 = Decode(F8,F4)

Writeback(F2) F2 = F8 = NO dm(pt, PLUS1DM) = F2 Decode(F2)

Writeback(Mem)

NO F2 = dm(pt, ZERODM) Decode(Mem)Writeback(F2)

F8 = F2 * F4_1_8 Decode(F2,F4)Writeback(F8)

F2 = F8 + F12_32 Decode(F8,F4)Writeback(F2)

dm(pt, PLUS1DM) = F2 Decode(F2)Writeback(Mem)

Why spend time on simulating algorithm to see if problem really exists when there is a simple solution -- use different registers

Problem may/may not exist with this simple example but very likely to exist in more complex algorithm

Page 34: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 34 / 44 -- two days

Resource 6 -- Solution -- Save and then use F9

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) Decode(Mem)

Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)

Writeback(F8) F9 = F8 + F12_32 Decode(F8,F4)

Writeback(F9)dm(pt, PLUS1DM) = F9 Decode(F9)

Writeback(Mem)F2 = dm(pt, ZERODM) Decode(Mem)

Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)

Writeback(F8) F9 = F8 + F12_32 Decode(F8,F4)

Writeback(F9)dm(pt, PLUS1DM) = F9 Decode(F9)

Writeback(Mem)

Page 35: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 35 / 44 -- two days

Resource Management 7 -- Some parallelism possiblewith Read, Mult, Add and Write mixed across 5 loop comps.

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) R1

F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2 F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2

dm(pt, PLUS1DM) = F9 W2F2 = dm(pt, ZERODM) R3

F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4 F9 = F8 + F12_32 F8 = F2 * F4_1_8 STALL A3, M4 F9 = F8 + F12_32 STALL dm(pt, PLUS1DM) = F9 W3, A4

STALL STALL dm(pt, PLUS1DM) = F9 W4STALL STALL F2 = dm(pt, ZERODM) R5STALL F8 = F2 * F4_1_8 M5

F9 = F8 + F12_32 A5dm(pt, PLUS1DM) = F9 W5

Problem 1 -- No resource in maximum usage -- code in-efficient

Problem 2 -- Worth about 50% on an exam question on parallelism. We have answered “Optimize the straight line code for a loop

of the form ‘for count = 0, count <5’ “ -- What if loop size 2048 or more?

Page 36: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 36 / 44 -- two days

WRONG -- CONCEPT GOOD, IMPLEMENTATION BADas we are no longer indexing correctly through the data.

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) R1

F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2 F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2

dm(pt, PLUS1DM) = F9 W2F2 = dm(pt, ZERODM) R3

F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4 F9 = F8 + F12_32 F8 = F2 * F4_1_8 STALL A3, M4 F9 = F8 + F12_32 STALL dm(pt, PLUS1DM) = F9 W3, A4

STALL STALL dm(pt, PLUS1DM) = F9 W4STALL STALL F2 = dm(pt, ZERODM) R5STALL F8 = F2 * F4_1_8 M5

F9 = F8 + F12_32 A5dm(pt, PLUS1DM) = F9 W5

Problem 1 -- No resource in maximum usage -- code in-efficient

Problem 2 -- Worth about 50% on an exam question on parallelism. We have answered “Optimize the straight line code for a loop

of the form ‘for count = 0, count <5’ “ -- What if loop size 2048 or more?

Page 37: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 37 / 44 -- two days

Need 1 resource to be maxed outOtherwise algorithm is inefficient

Have to try a lot of different approaches

Here is my code

Page 38: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 38 / 44 -- two days

Resource Management 8 Unroll the loop a bit more -- 9 loop components

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) R1

F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2 F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2

dm(pt, PLUS1DM) = F9 W2F2 = dm(pt, ZERODM) R3

F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4 F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A3, M4, R5 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W3, A4, M5F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W4, A5

dm(pt, PLUS1DM) = F9 W5F2 = dm(pt, ZERODM) R6

F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M6, R7 F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A6, M7, R8 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W6 A7, M8F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W7, A8

dm(pt, PLUS1DM) = F9 W9

DM BUS USAGENOW MAXed OUT(after a while)

CODE PATTERN APPEARING

Page 39: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 39 / 44 -- two days

Now to to “reroll the loop” The loop is currently just straight line coded.

Must put back into the “loop format” for coding efficiency, maintainability and seg_pmco limitations.

Three components of “rerolled loop” forloop of form “count = 0, count <N”

Fill the ALU/FPU pipeline (typically 1 stage from loop)

Overlap N - 2 stages Empty the ALU/FPU pipeline (typically 1 stage)

Page 40: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 40 / 44 -- two days

Resource Management 9Identify the loop components

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) R1

F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2 F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2

dm(pt, PLUS1DM) = F9 W2F2 = dm(pt, ZERODM) R3

F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4 F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A3, M4, R5 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W3, A4, M5F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W4, A5

dm(pt, PLUS1DM) = F9 W5F2 = dm(pt, ZERODM) R6

F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M6, R7 F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A6, M7, R8 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W6 A7, M8F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W7, A8

dm(pt, PLUS1DM) = F9 W9

LOOPBODY

FILLALU/FPUPIPE

EMPTYALU/FPU

PIPELINE

Page 41: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 41 / 44 -- two days

Resource 9 -- Final code version

ADDER MULTIPLIER DM ACCESS_Convert: Modify(CTOPofSTACK, -1); dm(FP, -2) = R9; pt = INPAR1; F12_32 = 32.0 // bring constants outside the loop F4_1_8 = 1.8

F2 = dm(pt, ZERODM) R1F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2

F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2

dm(pt, PLUS1DM) = F9 W2 LCNTR = (N-2)/3, DO LOOP_END UNTIL LCE;

F2 = dm(pt, ZERODM) R3F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4

F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A3, M4, R5 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W3, A4, M5F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W4, A5

LOOP_END: dm(pt, PLUS1DM) = F9 W5 R9 = dm(FP, -2); 5 magic lines of C

-1 UNTIL LCE

LOOPEND:

FILL

USE

EMPTY

ALU/FPUPIPE

Page 42: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 42 / 44 -- two days

Speed improvements

BEFORESTART LOOP EXIT ENTRY

4 + N*4 + 5 + 5= 14 + 4 * N

NOW with 2-fold loop unfoldingSTART LOOP EXIT ENTRY

4 + 7 + (N – 2) * 5 / 2 + 5 + 8 + 5 = 24 + 2.5 * N

NOW with 3-fold loop unfoldingSTART LOOP EXIT ENTRY

4 + 5 + (N – 2) * 6 / 3 + 5 + 1 + 5 = 16 + 2 * N

Factor of 4 / 2.5 with a little effort -- Factor of 4 /2 with more effort

Page 43: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 43 / 44 -- two days

Question to Ask

We now know the final code Should we have made the substitution F2 to

F9? Who cares -- do it anyway as more likely to

be necessary rather than unnecessary in most algorithms! No real disadvantage since we can probably

overlap the save and recovery of the non-volatile R9 with other instructions!

Will the code work?

Page 44: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 44 / 44 -- two days

Resource 9 -- Final code version

ADDER MULTIPLIER DM ACCESS_Convert: Modify(CTOPofSTACK, -1); dm(FP, -2) = R9; pt = INPAR1; F12_32 = 32.0 // bring constants outside the loop F4_1_8 = 1.8

F2 = dm(pt, ZERODM) R1F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2

F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2

dm(pt, PLUS1DM) = F9 W2 LCNTR = (N-2)/3, DO LOOP_END UNTIL LCE;

F2 = dm(pt, ZERODM) R3F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4

F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A3, M4, R5 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W3, A4, M5F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W4, A5

LOOP_END: dm(pt, PLUS1DM) = F9 W5 R9 = dm(FP, -2); 5 magic lines of C

-1 UNTIL LCE

LOOPEND:

N = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14

Only works if (N - 2) / 3 is an integer.

Page 45: Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 45 / 44 -- two days

Tackled today What’s the problem? Standard Code Development of “C”-code Process for “Code with parallel instruction”

Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort

To come -- Tutorial practice of parallel coding To come -- Optimum FIR filter with parallelism