Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,

Systematic development of programs with parallel

instructions SHARC ADSP2106X processor

M. Smith,Electrical and Computer

Engineering,University of Calgary, Canada

smithmr @ ucalgary.ca

04/18/23

ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061

Copyright [email protected] 2 / 44 -- two days

To be tackled today What’s the problem? Standard Code Development of “C”-

code Process for “Code with parallel

instruction” Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort

ADSP-2106x -- Parallelism opportunitiesADSP-2106x -- Parallelism opportunities

DAG 2

8 x 4 x 24

DAG 1

8 x 4 x 32

CACHE

MEMORY

32 x 48

PROGRAM

SEQUENCER

PMD BUS

DMD BUS

24PMA BUS

PMD

DMD

PMA

32DMA BUSDMA

48

40

JTAG TEST &

EMULATION

FLAGS

FLOATING & FIXED-POINT

MULTIPLIER,

FIXED-POINT

ACCUMULATOR

32-BIT

BARREL

SHIFTER

FLOATING-POINT

& FIXED-POINT

ALU

REGISTER

FILE

16 x 40

BUS CONNECT

TIMER

Ability for parallel memory operation,One each on pm, dm and instruction cache busses

Memory pointer operationsPost modify 2 index registers

Automatic circular buffer operations

Automatic bit reverse addressing

Many parallel operations and register to register bus transfersRn = Rx + Ry or Rn = Rx * Ry

Rm = Rx + Ry, Rn = Rx - Ry with/without Rp = Rq * Rr

Zero overhead loops

Instruction pipeline issues

Key issue -- Only 48? bits available in OPCODE to describe

16 data registers in 3 destinations and 6 sources = 135 bits

2 * (8 index + 8 modify + 16 data) = 64 bits

Condition code selection, 32 bit constants etc.

04/18/23



Compiler is only -- somewhat useful

See article in course notes fromEmbedded System Design Sept./October 2000

Need to get a systematic process to provide Parallelism without pain

Need to know what to worry about and what not to Lab 3 -- Implement FIR filter in Parallel -- Help

provided Lab. Library version of FFT, custom version of Burg

Algorithm (AR modeling)

04/18/23



Basic code development -- any system

Write the “C” code for the function

void Convert(float *temperature, int N)

which converts an array of temperatures measured in “Celsius” (Canadian Market) to Fahrenheit (American Market)

Convert the code to ADSP 21061/68K etc. assembly code, following the standard coding and documentation practices

04/18/23



Parallel Instruction Code Development

Write the 21k assembly code for the functionvoid Convert(float *temperature, int N)which etc…...

Determine the instruction flow through the architecture using a resource usage diagram

Theoretically optimize the code -- a 2 minute counting process

Compare and contrast the amount of time to perform the subroutine before and after customization.

04/18/23



Standard “C” code

void Convert(float *temperature, int N) {int count;

for (count = 0; count < N; count++) {*temperature = (*temperature) * 9 / 5

+ 32;temperature++

}Standard Warning -- What does optimizing

compiler do with 9 / 5 becomes 1 or 1.8?

04/18/23



Process for developing parallel code

Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture

Write the assembly code using a hardware loop Rewrite the assembly code using instructions that

could be used in parallel you could find the correct optimization approach

Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop

04/18/23



21061-style load/store “C” code

void Convert(register float *temperature, register int N) {register int count;register float *pt = temperature;register float scratch;

for (count = 0; count < N; count++) {scratch = *pt;scratch = scratch * (9 / 5);scratch = scratch + 32;*pt = scratch;pt++;

}

04/18/23





Write the assembly code using a hardware loop Check that end of loop label is in the correct place

Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach

Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop

04/18/23



Assembly code

PROLOGUE Appropriate defines to make easy reading of

code Saving of non-volatile registers

BODY Try to plan ahead for parallel operations Know which 21k “multi-function” instructions are

valid EPILOGUE

Recover non-volatile registers

04/18/23



Straight conversion -- PROLOGUE

// void Convert(reg float *temperature, reg int N) {.segment/pm seg_pmco;.global _Convert;_Convert:

// register int count = GARBAGE;#define countR1 scratchR1

// register float *pt = temperature;#define pt scratchDMpt

pt = INPAR1;// float scratch = GARBAGE;

#define scratchF2 F2// For the CURRENT code -- no volatile

// registers are needed -- may not remain true

04/18/23



Straight conversion of code// for (count = 0; count < N; count++) {

LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE:// scratch = *pt;

scratchF2 = dm(0, pt); // Not ++ as pt re-used// scratch = scratch * (9 / 5);// INPAR1 (R4) is dead -- can reuse as F4

#define constantF4 F4 // Must be floatconstantF4 = 1.8 // No division, Use register constant

scratchF2 = scratchF2 * constantF4;

// scratch = scratch + 32;#define F0_32 F0 // Must be float

F0_32 = 32.0; scratchF2 = scratchF2 + F0_32;

// *pt = scratch; pt++; dm(pt, 1) = scratchF2;LOOP_END: 5 magic lines of code

// NOT F0 = 32 gives F0 = 1 * 10 -45

04/18/23



Avoid this errorLCNTR = INPAR2, DO LOOP_END UNTIL LCE:

scratchF2 = dm(0, pt);scratchF2 = scratchF2 * constantF4;F0_32 = 32.0;scratchF2 = scratchF2 + F0_32;

LOOP_END: dm(pt, 1) = scratchF2; INTENDED LAST LINE OF LOOP

LCNTR = INPAR2, DO LOOP_END UNTIL LCE:scratchF2 = dm(0, pt);scratchF2 = scratchF2 * constantF4;F0_32 = 32.0;scratchF2 = scratchF2 + F0_32;dm(pt, 1) = scratchF2;

LOOP_END: Rest of the code STILL LAST LINE OF LOOP First line of “rest of code” has now become part of loop

04/18/23



Process to avoid the errorThis particularly error is going to be very easy to make as the “Rest of the code”is going to look very similar to the “loop internals” once we have takenaccount of the ALU/FPU pipeline to maximize parallelism

SUGGESTED APPROACH TO AVOID THIS TIME WASTING ERROR

LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE:scratchF2 = dm(0, pt);scratchF2 = scratchF2 * constantF4;F0_32 = 32.0;scratchF2 = scratchF2 + F0_32;dm(pt, 1) = scratchF2;

LOOP_END: Rest of the code

This was a process adopted from the compiler output -- the concept of a labelwas beyond most people in ENCM415

04/18/23






Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach.

Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point.

Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop

04/18/23



Speed rules for memory access

scratch = dm(0, pt);

scratch = dm(pt, 0); // Not ++ as to be re-useddm(pt, 1) = scratch;

Use of constants as modifiers is not allowed -- not enough bits in the opcode -- need 32 bits for each constant

Must use Modify registers to store these constants.Several useful constants placed in modify registers

(DAG1 and DAG2) during “C-code” initialization (if linked in)

scratch = dm(pt, zeroDM); // Not ++ as to be re-useddm(pt, plus1DM) = scratch;

Can’t use PREMODIFY PERIOD

Can’t use POST MODIFYOPERATIONS with CONSTANTS

04/18/23



Speed rules IF you want adds and multiplys to occur on the same line

F1 = F2 * F3, F4 = F5 + F6; Want to do as a single instruction Not enough bits in the opcode Register description 4 + 4 + 4 + 4 + 4 + 4 (bits)

Plus bits for describing math operations, conditions and memory ops? Fn = F(0, 1, 2 or 3) * F(4, 5, 6 or 7) Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) Must rearrange register usage with program code for this

to be possible Register description 4 + 2 + 2 + 4 + 2 + 2 (bits) -- other bits

“understood” Inconvenient rather than limiting e.g. F6 = F0 * F4, F7 = F8 + F12, F9 = F8 - F12; Not accepted F6 = F4 * F0, F7 = F8 + F12, F9 = F8 - F12; Not accepted F7 = F8 + F12, F9 = F8 - F12, F6 = F0 * F4;

04/18/23



When should we worry about the register assignment?

#define count scratchR1#define pt scratchDMpt#define scratchF2 F2

LCNTR = INPAR2, DO LOOP_END- 1 UNTIL LCE:scratchF2 = dm(pt, 0); // Not ++ as to be re-used

// INPAR1 (R4) is dead -- can reuse#define constantF4 F4 // Must be float

constantF4 = 1.8;scratchF2 = scratchF2 * constantF4

#define F0_32 F0 // Must be floatF0_32 = 32.0;scratchF2 = scratchF2 + F0_32;dm(pt, 1) = F0_32;

LOOP_END:

04/18/23



Check on required register use


LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE:scratchF2 = dm(pt, zeroDM);

Are there special requirements here on F2 -- becomes source later?? // INPAR1 (R4) is dead -- can reuse

#define constantF4 F4 // Must be floatconstantF4 = 1.8;scratchF2 = scratchF2 * constantF4

Fn = F(0,1,2 or 3) * F(4,5,6 or 7),#define F0_32 F0 // Must be float

F0_32 = 32.0;scratchF2 = scratchF2 + F0_32;

Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2;

04/18/23



Register re-assignment -- Step 1

#define count scratchR1#define pt scratchDMpt#define scratchF2 F2 -- OKAY


// INPAR1 (R4) is dead -- can reuse#define constantF4 // Must be float -- OKAY

constantF4 = 1.8;scratchF2 = scratchF2 * constantF4 -- SOURCES okay hereFn = F(0,1,2 or 3) * F(4,5,6 or 7),#define F0_32 F0 // Must be floatF0_32 = 32.0; -- WRONG to use F0 here -- ADDITIONscratchF2 = scratchF2 + F0_32; -- WRONG to use F2 as DEST earlyFm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)

dm(pt, plus1DM) = scratchF2; -- OKAY

04/18/23



Register re-assignment -- Step 2




constantF4 = 1.8;scratchF8 = scratchF2 * constantF4answer must be in F(8, 9, 10 or 11)

#define F12_32 F12 // INPAR3 is availableF12_32 = 32.0; scratchF2 = scratchF8 + F12_32 ; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)

dm(pt, plus1DM) = scratchF2;

04/18/23



Fix poor coding practice -- “C” or assembly




constantF4 = 1.8; MOVE OUTSIDE LOOPscratchF8 = scratchF2 * constantF4

answer must be in F(8, 9, 10 or 11)#define F12_32 F12 // INPAR3 is available

F12_32 = 32.0; MOVE OUTSIDE LOOPscratchF2 = scratchF8 + F12_32 ; Fm = F(8,

9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2;

04/18/23








Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop

04/18/23



Resource Management -- Chart1 -- Basic code

ADDER MULTIPLIER DM ACCESS PMACCESS

_Convert: pt = INPAR1; F12_32 = 32.0 // bring constants outside the loop F4_1_8 = 1.8 LCNTR = INPAR2, DO LOOP_END UNTIL LCE;

F2 = dm(pt, ZERODM)F8 = F2 * F4_1_8

F2 = F8 + F12_32LOOP_END: dm(pt, PLUS1DM) = F2 5 magic lines of “C” Time = 4 + N * 4 + 5 + 5 to do the callLOOPEND:

-1 UNTIL LCE

In theory -- if we could find out how *, + and dm in parallel DATA-BUS is limiting resource

dm 2 cycle loop possibleBefore proceeding -- Is 2 cycle loop needed? Is 2 cycle loop enough?

04/18/23








Move algorithm to “Resource Usage Chart” Optimize parallelism using techniques

Attempt to -- watch out for special situations where code will fail

Compare and contrast time -- setup and loop

04/18/23



Un-roll the loop For various methods on “unrolling the

loop” see papers by Jeanne Anne Booth

Final Exam question -- What are relative advantages of the various techniques (with examples)?

04/18/23



Resource 2 -- unroll the loop -- 5 times here

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) R1

F8 = F2 * F4_1_8 M1 F2 = F8 + F12_32 A1

dm(pt, PLUS1DM) = F2 W1F2 = dm(pt, ZERODM) R2

F8 = F2 * F4_1_8 M2 F2 = F8 + F12_32 A2


F8 = F2 * F4_1_8 M3 F2 = F8 + F12_32 A3


F8 = F2 * F4_1_8 M4 F2 = F8 + F12_32 A4


F8 = F2 * F4_1_8 M5 F2 = F8 + F12_32 A5

dm(pt, PLUS1DM) = F2 W5

Each

pass

through

the loop

involves

Read

Multiply

Add

Write

04/18/23



Resource Management 3 -- identify resource usage during decode and writeback stages of each instructions

ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) Decode(Mem)

Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)

Writeback(F8) F2 = F8 + F12_32 Decode(F8,F4)

Writeback(F2)dm(pt, PLUS1DM) = F2 Decode(F2)

Writeback(Mem)F2 = dm(pt, ZERODM) Decode(Mem)




Writeback(Mem)

Model used -- depends on where operands are relative to equals sign‘Reading’ -- fetching things for ALU/FPU -- Like 68K decode phase ‘Writeback’ -- storing results from ALU/FPU

THESE

PHASES

ARE

‘CONCEPTS’

RATHER

THAN “

IMPLEMENTED’

Reading

Reading

Reading

Reading

Reading

Reading

Reading

Reading

04/18/23



Resource Management 4 Check what can be moved in parallel with other instructions




Writeback(F2)

NO dm(pt, PLUS1DM) = F2 Decode(F2)Writeback(Mem)

NO F2 = dm(pt, ZERODM) Decode(Mem)Writeback(F2)

F8 = F2 * F4_1_8 Decode(F2,F4)Writeback(F8)

F2 = F8 + F12_32 Decode(F8,F4)Writeback(F2)

dm(pt, PLUS1DM) = F2 Decode(F2)Writeback(Mem)

OKAY TO MOVEF2 src freed up

before F2 dest occurs

OKAY TO MOVEEmpty spot ifcan move * and + instructs

which this instruction

MUST follow

NO !!! or just

possible NO?

Why a problem?

F2 =

04/18/23



Memory resource availability Move up F2 = dm(pt, ZERODM) from

second loop into first loop

However now we have a possible conflict about which F2 should be used for the

dm(pt, plus1DM) = F2

instruction if we further optimize by trying to fill the other empty delay slots -- see next slide

04/18/23



Resource managementOverlapping two parts of the loop




Writeback(F2)

NO dm(pt, PLUS1DM) = F2 Decode(F2)Writeback(Mem)





04/18/23



Resource Management 5 -- What’s up, Doc?Attempting to fill all unused resource availability


Writeback(F2)F8 = F2 * F4_1_8 F2 = Decode(F2,F4)

Writeback(F8) F2 = F8 + F12_32 F8 = F2 = Decode(F8,F4)

Writeback(F2) F2 = F8 = NO dm(pt, PLUS1DM) = F2 Decode(F2)

Writeback(Mem)





Why spend time on simulating algorithm to see if problem really exists when there is a simple solution -- use different registers

Problem may/may not exist with this simple example but very likely to exist in more complex algorithm

04/18/23



Resource 6 -- Solution -- Save and then use F9





Writeback(Mem)F2 = dm(pt, ZERODM) Decode(Mem)




Writeback(Mem)

04/18/23



Resource Management 7 -- Some parallelism possiblewith Read, Mult, Add and Write mixed across 5 loop comps.


F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2 F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2


F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4 F9 = F8 + F12_32 F8 = F2 * F4_1_8 STALL A3, M4 F9 = F8 + F12_32 STALL dm(pt, PLUS1DM) = F9 W3, A4

STALL STALL dm(pt, PLUS1DM) = F9 W4STALL STALL F2 = dm(pt, ZERODM) R5STALL F8 = F2 * F4_1_8 M5

F9 = F8 + F12_32 A5dm(pt, PLUS1DM) = F9 W5

Problem 1 -- No resource in maximum usage -- code in-efficient

Problem 2 -- Worth about 50% on an exam question on parallelism. We have answered “Optimize the straight line code for a loop

of the form ‘for count = 0, count <5’ “ -- What if loop size 2048 or more?

04/18/23



WRONG -- CONCEPT GOOD, IMPLEMENTATION BADas we are no longer indexing correctly through the data.




F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4 F9 = F8 + F12_32 F8 = F2 * F4_1_8 STALL A3, M4 F9 = F8 + F12_32 STALL dm(pt, PLUS1DM) = F9 W3, A4

STALL STALL dm(pt, PLUS1DM) = F9 W4STALL STALL F2 = dm(pt, ZERODM) R5STALL F8 = F2 * F4_1_8 M5

F9 = F8 + F12_32 A5dm(pt, PLUS1DM) = F9 W5

Problem 1 -- No resource in maximum usage -- code in-efficient

Problem 2 -- Worth about 50% on an exam question on parallelism. We have answered “Optimize the straight line code for a loop

of the form ‘for count = 0, count <5’ “ -- What if loop size 2048 or more?

04/18/23



Need 1 resource to be maxed outOtherwise algorithm is inefficient

Have to try a lot of different approaches

Here is my code

04/18/23



Resource Management 8 Unroll the loop a bit more -- 9 loop components




F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4 F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A3, M4, R5 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W3, A4, M5F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W4, A5


F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M6, R7 F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A6, M7, R8 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W6 A7, M8F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W7, A8


DM BUS USAGENOW MAXed OUT(after a while)

CODE PATTERN APPEARING

04/18/23



Now to to “reroll the loop” The loop is currently just straight line coded.

Must put back into the “loop format” for coding efficiency, maintainability and seg_pmco limitations.

Three components of “rerolled loop” forloop of form “count = 0, count <N”

Fill the ALU/FPU pipeline (typically 1 stage from loop)

Overlap N - 2 stages Empty the ALU/FPU pipeline (typically 1 stage)

04/18/23



Resource Management 9Identify the loop components




F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4 F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A3, M4, R5 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W3, A4, M5F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W4, A5


F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M6, R7 F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A6, M7, R8 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W6 A7, M8F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W7, A8


LOOPBODY

FILLALU/FPUPIPE

EMPTYALU/FPU

PIPELINE

04/18/23



Resource 9 -- Final code version

ADDER MULTIPLIER DM ACCESS_Convert: Modify(CTOPofSTACK, -1); dm(FP, -2) = R9; pt = INPAR1; F12_32 = 32.0 // bring constants outside the loop F4_1_8 = 1.8

F2 = dm(pt, ZERODM) R1F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2

F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2

dm(pt, PLUS1DM) = F9 W2 LCNTR = (N-2)/3, DO LOOP_END UNTIL LCE;


F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A3, M4, R5 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W3, A4, M5F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W4, A5

LOOP_END: dm(pt, PLUS1DM) = F9 W5 R9 = dm(FP, -2); 5 magic lines of C

-1 UNTIL LCE

LOOPEND:

FILL

USE

EMPTY

ALU/FPUPIPE

04/18/23



Speed improvements

BEFORESTART LOOP EXIT ENTRY

4 + N*4 + 5 + 5= 14 + 4 * N

NOW with 2-fold loop unfoldingSTART LOOP EXIT ENTRY

4 + 7 + (N – 2) * 5 / 2 + 5 + 8 + 5 = 24 + 2.5 * N

NOW with 3-fold loop unfoldingSTART LOOP EXIT ENTRY

4 + 5 + (N – 2) * 6 / 3 + 5 + 1 + 5 = 16 + 2 * N

Factor of 4 / 2.5 with a little effort -- Factor of 4 /2 with more effort

04/18/23



Question to Ask

We now know the final code Should we have made the substitution F2 to

F9? Who cares -- do it anyway as more likely to

be necessary rather than unnecessary in most algorithms! No real disadvantage since we can probably

overlap the save and recovery of the non-volatile R9 with other instructions!

Will the code work?

04/18/23



Resource 9 -- Final code version

ADDER MULTIPLIER DM ACCESS_Convert: Modify(CTOPofSTACK, -1); dm(FP, -2) = R9; pt = INPAR1; F12_32 = 32.0 // bring constants outside the loop F4_1_8 = 1.8


F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2

dm(pt, PLUS1DM) = F9 W2 LCNTR = (N-2)/3, DO LOOP_END UNTIL LCE;


F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A3, M4, R5 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W3, A4, M5F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W4, A5

LOOP_END: dm(pt, PLUS1DM) = F9 W5 R9 = dm(FP, -2); 5 magic lines of C

-1 UNTIL LCE

LOOPEND:

N = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14

Only works if (N - 2) / 3 is an integer.

04/18/23



Tackled today What’s the problem? Standard Code Development of “C”-code Process for “Code with parallel instruction”

Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort

To come -- Tutorial practice of parallel coding To come -- Optimum FIR filter with parallelism

Documents

Systematic development of programs with parallel instructions SHARC ADSP2106X processor M. Smith, Electrical and Computer Engineering, University of Calgary,