View
215
Download
2
Tags:
Embed Size (px)
Citation preview
Systematic development of programs with parallel
instructions SHARC ADSP2106X processor
M. Smith,Electrical and Computer
Engineering,University of Calgary, Canada
smithmr @ ucalgary.ca
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 2 / 44 -- two days
To be tackled today What’s the problem? Standard Code Development of “C”-
code Process for “Code with parallel
instruction” Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort
ADSP-2106x -- Parallelism opportunitiesADSP-2106x -- Parallelism opportunities
DAG 2
8 x 4 x 24
DAG 1
8 x 4 x 32
CACHE
MEMORY
32 x 48
PROGRAM
SEQUENCER
PMD BUS
DMD BUS
24PMA BUS
PMD
DMD
PMA
32DMA BUSDMA
48
40
JTAG TEST &
EMULATION
FLAGS
FLOATING & FIXED-POINT
MULTIPLIER,
FIXED-POINT
ACCUMULATOR
32-BIT
BARREL
SHIFTER
FLOATING-POINT
& FIXED-POINT
ALU
REGISTER
FILE
16 x 40
BUS CONNECT
TIMER
Ability for parallel memory operation,One each on pm, dm and instruction cache busses
Memory pointer operationsPost modify 2 index registers
Automatic circular buffer operations
Automatic bit reverse addressing
Many parallel operations and register to register bus transfersRn = Rx + Ry or Rn = Rx * Ry
Rm = Rx + Ry, Rn = Rx - Ry with/without Rp = Rq * Rr
Zero overhead loops
Instruction pipeline issues
Key issue -- Only 48? bits available in OPCODE to describe
16 data registers in 3 destinations and 6 sources = 135 bits
2 * (8 index + 8 modify + 16 data) = 64 bits
Condition code selection, 32 bit constants etc.
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 4 / 44 -- two days
Compiler is only -- somewhat useful
See article in course notes fromEmbedded System Design Sept./October 2000
Need to get a systematic process to provide Parallelism without pain
Need to know what to worry about and what not to Lab 3 -- Implement FIR filter in Parallel -- Help
provided Lab. Library version of FFT, custom version of Burg
Algorithm (AR modeling)
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 5 / 44 -- two days
Basic code development -- any system
Write the “C” code for the function
void Convert(float *temperature, int N)
which converts an array of temperatures measured in “Celsius” (Canadian Market) to Fahrenheit (American Market)
Convert the code to ADSP 21061/68K etc. assembly code, following the standard coding and documentation practices
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 6 / 44 -- two days
Parallel Instruction Code Development
Write the 21k assembly code for the functionvoid Convert(float *temperature, int N)which etc…...
Determine the instruction flow through the architecture using a resource usage diagram
Theoretically optimize the code -- a 2 minute counting process
Compare and contrast the amount of time to perform the subroutine before and after customization.
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 7 / 44 -- two days
Standard “C” code
void Convert(float *temperature, int N) {int count;
for (count = 0; count < N; count++) {*temperature = (*temperature) * 9 / 5
+ 32;temperature++
}Standard Warning -- What does optimizing
compiler do with 9 / 5 becomes 1 or 1.8?
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 8 / 44 -- two days
Process for developing parallel code
Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture
Write the assembly code using a hardware loop Rewrite the assembly code using instructions that
could be used in parallel you could find the correct optimization approach
Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 9 / 44 -- two days
21061-style load/store “C” code
void Convert(register float *temperature, register int N) {register int count;register float *pt = temperature;register float scratch;
for (count = 0; count < N; count++) {scratch = *pt;scratch = scratch * (9 / 5);scratch = scratch + 32;*pt = scratch;pt++;
}
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 10 / 44 -- two days
Process for developing parallel code
Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture
Write the assembly code using a hardware loop Check that end of loop label is in the correct place
Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach
Move algorithm to “Resource Usage Chart” Optimize using techniques Compare and contrast time -- setup and loop
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 11 / 44 -- two days
Assembly code
PROLOGUE Appropriate defines to make easy reading of
code Saving of non-volatile registers
BODY Try to plan ahead for parallel operations Know which 21k “multi-function” instructions are
valid EPILOGUE
Recover non-volatile registers
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 12 / 44 -- two days
Straight conversion -- PROLOGUE
// void Convert(reg float *temperature, reg int N) {.segment/pm seg_pmco;.global _Convert;_Convert:
// register int count = GARBAGE;#define countR1 scratchR1
// register float *pt = temperature;#define pt scratchDMpt
pt = INPAR1;// float scratch = GARBAGE;
#define scratchF2 F2// For the CURRENT code -- no volatile
// registers are needed -- may not remain true
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 13 / 44 -- two days
Straight conversion of code// for (count = 0; count < N; count++) {
LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE:// scratch = *pt;
scratchF2 = dm(0, pt); // Not ++ as pt re-used// scratch = scratch * (9 / 5);// INPAR1 (R4) is dead -- can reuse as F4
#define constantF4 F4 // Must be floatconstantF4 = 1.8 // No division, Use register constant
scratchF2 = scratchF2 * constantF4;
// scratch = scratch + 32;#define F0_32 F0 // Must be float
F0_32 = 32.0; scratchF2 = scratchF2 + F0_32;
// *pt = scratch; pt++; dm(pt, 1) = scratchF2;LOOP_END: 5 magic lines of code
// NOT F0 = 32 gives F0 = 1 * 10 -45
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 14 / 44 -- two days
Avoid this errorLCNTR = INPAR2, DO LOOP_END UNTIL LCE:
scratchF2 = dm(0, pt);scratchF2 = scratchF2 * constantF4;F0_32 = 32.0;scratchF2 = scratchF2 + F0_32;
LOOP_END: dm(pt, 1) = scratchF2; INTENDED LAST LINE OF LOOP
LCNTR = INPAR2, DO LOOP_END UNTIL LCE:scratchF2 = dm(0, pt);scratchF2 = scratchF2 * constantF4;F0_32 = 32.0;scratchF2 = scratchF2 + F0_32;dm(pt, 1) = scratchF2;
LOOP_END: Rest of the code STILL LAST LINE OF LOOP First line of “rest of code” has now become part of loop
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 15 / 44 -- two days
Process to avoid the errorThis particularly error is going to be very easy to make as the “Rest of the code”is going to look very similar to the “loop internals” once we have takenaccount of the ALU/FPU pipeline to maximize parallelism
SUGGESTED APPROACH TO AVOID THIS TIME WASTING ERROR
LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE:scratchF2 = dm(0, pt);scratchF2 = scratchF2 * constantF4;F0_32 = 32.0;scratchF2 = scratchF2 + F0_32;dm(pt, 1) = scratchF2;
LOOP_END: Rest of the code
This was a process adopted from the compiler output -- the concept of a labelwas beyond most people in ENCM415
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 16 / 44 -- two days
Process for developing parallel code
Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture
Write the assembly code using a hardware loop Check that end of loop label is in the correct place
Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach.
Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point.
Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 17 / 44 -- two days
Speed rules for memory access
scratch = dm(0, pt);
scratch = dm(pt, 0); // Not ++ as to be re-useddm(pt, 1) = scratch;
Use of constants as modifiers is not allowed -- not enough bits in the opcode -- need 32 bits for each constant
Must use Modify registers to store these constants.Several useful constants placed in modify registers
(DAG1 and DAG2) during “C-code” initialization (if linked in)
scratch = dm(pt, zeroDM); // Not ++ as to be re-useddm(pt, plus1DM) = scratch;
Can’t use PREMODIFY PERIOD
Can’t use POST MODIFYOPERATIONS with CONSTANTS
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 18 / 44 -- two days
Speed rules IF you want adds and multiplys to occur on the same line
F1 = F2 * F3, F4 = F5 + F6; Want to do as a single instruction Not enough bits in the opcode Register description 4 + 4 + 4 + 4 + 4 + 4 (bits)
Plus bits for describing math operations, conditions and memory ops? Fn = F(0, 1, 2 or 3) * F(4, 5, 6 or 7) Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) Must rearrange register usage with program code for this
to be possible Register description 4 + 2 + 2 + 4 + 2 + 2 (bits) -- other bits
“understood” Inconvenient rather than limiting e.g. F6 = F0 * F4, F7 = F8 + F12, F9 = F8 - F12; Not accepted F6 = F4 * F0, F7 = F8 + F12, F9 = F8 - F12; Not accepted F7 = F8 + F12, F9 = F8 - F12, F6 = F0 * F4;
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 19 / 44 -- two days
When should we worry about the register assignment?
#define count scratchR1#define pt scratchDMpt#define scratchF2 F2
LCNTR = INPAR2, DO LOOP_END- 1 UNTIL LCE:scratchF2 = dm(pt, 0); // Not ++ as to be re-used
// INPAR1 (R4) is dead -- can reuse#define constantF4 F4 // Must be float
constantF4 = 1.8;scratchF2 = scratchF2 * constantF4
#define F0_32 F0 // Must be floatF0_32 = 32.0;scratchF2 = scratchF2 + F0_32;dm(pt, 1) = F0_32;
LOOP_END:
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 20 / 44 -- two days
Check on required register use
#define count scratchR1#define pt scratchDMpt#define scratchF2 F2
LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE:scratchF2 = dm(pt, zeroDM);
Are there special requirements here on F2 -- becomes source later?? // INPAR1 (R4) is dead -- can reuse
#define constantF4 F4 // Must be floatconstantF4 = 1.8;scratchF2 = scratchF2 * constantF4
Fn = F(0,1,2 or 3) * F(4,5,6 or 7),#define F0_32 F0 // Must be float
F0_32 = 32.0;scratchF2 = scratchF2 + F0_32;
Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2;
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 21 / 44 -- two days
Register re-assignment -- Step 1
#define count scratchR1#define pt scratchDMpt#define scratchF2 F2 -- OKAY
LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE:scratchF2 = dm(pt, zeroDM);
// INPAR1 (R4) is dead -- can reuse#define constantF4 // Must be float -- OKAY
constantF4 = 1.8;scratchF2 = scratchF2 * constantF4 -- SOURCES okay hereFn = F(0,1,2 or 3) * F(4,5,6 or 7),#define F0_32 F0 // Must be floatF0_32 = 32.0; -- WRONG to use F0 here -- ADDITIONscratchF2 = scratchF2 + F0_32; -- WRONG to use F2 as DEST earlyFm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)
dm(pt, plus1DM) = scratchF2; -- OKAY
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 22 / 44 -- two days
Register re-assignment -- Step 2
#define count scratchR1#define pt scratchDMpt#define scratchF2 F2
LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE:scratchF2 = dm(pt, zeroDM);
// INPAR1 (R4) is dead -- can reuse#define constantF4 F4 // Must be float
constantF4 = 1.8;scratchF8 = scratchF2 * constantF4answer must be in F(8, 9, 10 or 11)
#define F12_32 F12 // INPAR3 is availableF12_32 = 32.0; scratchF2 = scratchF8 + F12_32 ; Fm = F(8, 9, 10 or 11) + F(12, 13, 14 or 15)
dm(pt, plus1DM) = scratchF2;
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 23 / 44 -- two days
Fix poor coding practice -- “C” or assembly
#define count scratchR1#define pt scratchDMpt#define scratchF2 F2
LCNTR = INPAR2, DO LOOP_END - 1 UNTIL LCE:scratchF2 = dm(pt, zeroDM);
// INPAR1 (R4) is dead -- can reuse#define constantF4 F4 // Must be float
constantF4 = 1.8; MOVE OUTSIDE LOOPscratchF8 = scratchF2 * constantF4
answer must be in F(8, 9, 10 or 11)#define F12_32 F12 // INPAR3 is available
F12_32 = 32.0; MOVE OUTSIDE LOOPscratchF2 = scratchF8 + F12_32 ; Fm = F(8,
9, 10 or 11) + F(12, 13, 14 or 15) dm(pt, plus1DM) = scratchF2;
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 24 / 44 -- two days
Process for developing parallel code
Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture
Write the assembly code using a hardware loop Check that end of loop label is in the correct place
Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach
Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point.
Move algorithm to “Resource Usage Chart” Optimize using techniques (Attempt to) Compare and contrast time -- setup and loop
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 25 / 44 -- two days
Resource Management -- Chart1 -- Basic code
ADDER MULTIPLIER DM ACCESS PMACCESS
_Convert: pt = INPAR1; F12_32 = 32.0 // bring constants outside the loop F4_1_8 = 1.8 LCNTR = INPAR2, DO LOOP_END UNTIL LCE;
F2 = dm(pt, ZERODM)F8 = F2 * F4_1_8
F2 = F8 + F12_32LOOP_END: dm(pt, PLUS1DM) = F2 5 magic lines of “C” Time = 4 + N * 4 + 5 + 5 to do the callLOOPEND:
-1 UNTIL LCE
In theory -- if we could find out how *, + and dm in parallel DATA-BUS is limiting resource
dm 2 cycle loop possibleBefore proceeding -- Is 2 cycle loop needed? Is 2 cycle loop enough?
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 26 / 44 -- two days
Process for developing parallel code
Rewrite the “C” code using “LOAD/STORE” techniques Accounts for the SHARC super scalar RISC DSP architecture
Write the assembly code using a hardware loop Check that end of loop label is in the correct place
Rewrite the assembly code using instructions that could be used in parallel you could find the correct optimization approach
Means -- place values in appropriate registers to permit parallelism BUT don’t actually write the parallel operations at this point.
Move algorithm to “Resource Usage Chart” Optimize parallelism using techniques
Attempt to -- watch out for special situations where code will fail
Compare and contrast time -- setup and loop
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 27 / 44 -- two days
Un-roll the loop For various methods on “unrolling the
loop” see papers by Jeanne Anne Booth
Final Exam question -- What are relative advantages of the various techniques (with examples)?
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 28 / 44 -- two days
Resource 2 -- unroll the loop -- 5 times here
ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) R1
F8 = F2 * F4_1_8 M1 F2 = F8 + F12_32 A1
dm(pt, PLUS1DM) = F2 W1F2 = dm(pt, ZERODM) R2
F8 = F2 * F4_1_8 M2 F2 = F8 + F12_32 A2
dm(pt, PLUS1DM) = F2 W2F2 = dm(pt, ZERODM) R3
F8 = F2 * F4_1_8 M3 F2 = F8 + F12_32 A3
dm(pt, PLUS1DM) = F2 W3F2 = dm(pt, ZERODM) R4
F8 = F2 * F4_1_8 M4 F2 = F8 + F12_32 A4
dm(pt, PLUS1DM) = F2 W4F2 = dm(pt, ZERODM) R5
F8 = F2 * F4_1_8 M5 F2 = F8 + F12_32 A5
dm(pt, PLUS1DM) = F2 W5
Each
pass
through
the loop
involves
Read
Multiply
Add
Write
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 29 / 44 -- two days
Resource Management 3 -- identify resource usage during decode and writeback stages of each instructions
ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) Decode(Mem)
Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)
Writeback(F8) F2 = F8 + F12_32 Decode(F8,F4)
Writeback(F2)dm(pt, PLUS1DM) = F2 Decode(F2)
Writeback(Mem)F2 = dm(pt, ZERODM) Decode(Mem)
Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)
Writeback(F8) F2 = F8 + F12_32 Decode(F8,F4)
Writeback(F2)dm(pt, PLUS1DM) = F2 Decode(F2)
Writeback(Mem)
Model used -- depends on where operands are relative to equals sign‘Reading’ -- fetching things for ALU/FPU -- Like 68K decode phase ‘Writeback’ -- storing results from ALU/FPU
THESE
PHASES
ARE
‘CONCEPTS’
RATHER
THAN “
IMPLEMENTED’
Reading
Reading
Reading
Reading
Reading
Reading
Reading
Reading
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 30 / 44 -- two days
Resource Management 4 Check what can be moved in parallel with other instructions
ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) Decode(Mem)
Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)
Writeback(F8) F2 = F8 + F12_32 Decode(F8,F4)
Writeback(F2)
NO dm(pt, PLUS1DM) = F2 Decode(F2)Writeback(Mem)
NO F2 = dm(pt, ZERODM) Decode(Mem)Writeback(F2)
F8 = F2 * F4_1_8 Decode(F2,F4)Writeback(F8)
F2 = F8 + F12_32 Decode(F8,F4)Writeback(F2)
dm(pt, PLUS1DM) = F2 Decode(F2)Writeback(Mem)
OKAY TO MOVEF2 src freed up
before F2 dest occurs
OKAY TO MOVEEmpty spot ifcan move * and + instructs
which this instruction
MUST follow
NO !!! or just
possible NO?
Why a problem?
F2 =
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 31 / 44 -- two days
Memory resource availability Move up F2 = dm(pt, ZERODM) from
second loop into first loop
However now we have a possible conflict about which F2 should be used for the
dm(pt, plus1DM) = F2
instruction if we further optimize by trying to fill the other empty delay slots -- see next slide
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 32 / 44 -- two days
Resource managementOverlapping two parts of the loop
ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) Decode(Mem)
Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)
Writeback(F8) F2 = F8 + F12_32 Decode(F8,F4)
Writeback(F2)
NO dm(pt, PLUS1DM) = F2 Decode(F2)Writeback(Mem)
NO F2 = dm(pt, ZERODM) Decode(Mem)Writeback(F2)
F8 = F2 * F4_1_8 Decode(F2,F4)Writeback(F8)
F2 = F8 + F12_32 Decode(F8,F4)Writeback(F2)
dm(pt, PLUS1DM) = F2 Decode(F2)Writeback(Mem)
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 33 / 44 -- two days
Resource Management 5 -- What’s up, Doc?Attempting to fill all unused resource availability
ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) Decode(Mem)
Writeback(F2)F8 = F2 * F4_1_8 F2 = Decode(F2,F4)
Writeback(F8) F2 = F8 + F12_32 F8 = F2 = Decode(F8,F4)
Writeback(F2) F2 = F8 = NO dm(pt, PLUS1DM) = F2 Decode(F2)
Writeback(Mem)
NO F2 = dm(pt, ZERODM) Decode(Mem)Writeback(F2)
F8 = F2 * F4_1_8 Decode(F2,F4)Writeback(F8)
F2 = F8 + F12_32 Decode(F8,F4)Writeback(F2)
dm(pt, PLUS1DM) = F2 Decode(F2)Writeback(Mem)
Why spend time on simulating algorithm to see if problem really exists when there is a simple solution -- use different registers
Problem may/may not exist with this simple example but very likely to exist in more complex algorithm
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 34 / 44 -- two days
Resource 6 -- Solution -- Save and then use F9
ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) Decode(Mem)
Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)
Writeback(F8) F9 = F8 + F12_32 Decode(F8,F4)
Writeback(F9)dm(pt, PLUS1DM) = F9 Decode(F9)
Writeback(Mem)F2 = dm(pt, ZERODM) Decode(Mem)
Writeback(F2)F8 = F2 * F4_1_8 Decode(F2,F4)
Writeback(F8) F9 = F8 + F12_32 Decode(F8,F4)
Writeback(F9)dm(pt, PLUS1DM) = F9 Decode(F9)
Writeback(Mem)
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 35 / 44 -- two days
Resource Management 7 -- Some parallelism possiblewith Read, Mult, Add and Write mixed across 5 loop comps.
ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) R1
F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2 F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2
dm(pt, PLUS1DM) = F9 W2F2 = dm(pt, ZERODM) R3
F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4 F9 = F8 + F12_32 F8 = F2 * F4_1_8 STALL A3, M4 F9 = F8 + F12_32 STALL dm(pt, PLUS1DM) = F9 W3, A4
STALL STALL dm(pt, PLUS1DM) = F9 W4STALL STALL F2 = dm(pt, ZERODM) R5STALL F8 = F2 * F4_1_8 M5
F9 = F8 + F12_32 A5dm(pt, PLUS1DM) = F9 W5
Problem 1 -- No resource in maximum usage -- code in-efficient
Problem 2 -- Worth about 50% on an exam question on parallelism. We have answered “Optimize the straight line code for a loop
of the form ‘for count = 0, count <5’ “ -- What if loop size 2048 or more?
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 36 / 44 -- two days
WRONG -- CONCEPT GOOD, IMPLEMENTATION BADas we are no longer indexing correctly through the data.
ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) R1
F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2 F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2
dm(pt, PLUS1DM) = F9 W2F2 = dm(pt, ZERODM) R3
F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4 F9 = F8 + F12_32 F8 = F2 * F4_1_8 STALL A3, M4 F9 = F8 + F12_32 STALL dm(pt, PLUS1DM) = F9 W3, A4
STALL STALL dm(pt, PLUS1DM) = F9 W4STALL STALL F2 = dm(pt, ZERODM) R5STALL F8 = F2 * F4_1_8 M5
F9 = F8 + F12_32 A5dm(pt, PLUS1DM) = F9 W5
Problem 1 -- No resource in maximum usage -- code in-efficient
Problem 2 -- Worth about 50% on an exam question on parallelism. We have answered “Optimize the straight line code for a loop
of the form ‘for count = 0, count <5’ “ -- What if loop size 2048 or more?
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 37 / 44 -- two days
Need 1 resource to be maxed outOtherwise algorithm is inefficient
Have to try a lot of different approaches
Here is my code
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 38 / 44 -- two days
Resource Management 8 Unroll the loop a bit more -- 9 loop components
ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) R1
F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2 F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2
dm(pt, PLUS1DM) = F9 W2F2 = dm(pt, ZERODM) R3
F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4 F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A3, M4, R5 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W3, A4, M5F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W4, A5
dm(pt, PLUS1DM) = F9 W5F2 = dm(pt, ZERODM) R6
F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M6, R7 F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A6, M7, R8 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W6 A7, M8F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W7, A8
dm(pt, PLUS1DM) = F9 W9
DM BUS USAGENOW MAXed OUT(after a while)
CODE PATTERN APPEARING
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 39 / 44 -- two days
Now to to “reroll the loop” The loop is currently just straight line coded.
Must put back into the “loop format” for coding efficiency, maintainability and seg_pmco limitations.
Three components of “rerolled loop” forloop of form “count = 0, count <N”
Fill the ALU/FPU pipeline (typically 1 stage from loop)
Overlap N - 2 stages Empty the ALU/FPU pipeline (typically 1 stage)
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 40 / 44 -- two days
Resource Management 9Identify the loop components
ADDER MULTIPLIER DM ACCESSF2 = dm(pt, ZERODM) R1
F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2 F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2
dm(pt, PLUS1DM) = F9 W2F2 = dm(pt, ZERODM) R3
F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4 F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A3, M4, R5 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W3, A4, M5F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W4, A5
dm(pt, PLUS1DM) = F9 W5F2 = dm(pt, ZERODM) R6
F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M6, R7 F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A6, M7, R8 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W6 A7, M8F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W7, A8
dm(pt, PLUS1DM) = F9 W9
LOOPBODY
FILLALU/FPUPIPE
EMPTYALU/FPU
PIPELINE
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 41 / 44 -- two days
Resource 9 -- Final code version
ADDER MULTIPLIER DM ACCESS_Convert: Modify(CTOPofSTACK, -1); dm(FP, -2) = R9; pt = INPAR1; F12_32 = 32.0 // bring constants outside the loop F4_1_8 = 1.8
F2 = dm(pt, ZERODM) R1F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2
F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2
dm(pt, PLUS1DM) = F9 W2 LCNTR = (N-2)/3, DO LOOP_END UNTIL LCE;
F2 = dm(pt, ZERODM) R3F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4
F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A3, M4, R5 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W3, A4, M5F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W4, A5
LOOP_END: dm(pt, PLUS1DM) = F9 W5 R9 = dm(FP, -2); 5 magic lines of C
-1 UNTIL LCE
LOOPEND:
FILL
USE
EMPTY
ALU/FPUPIPE
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 42 / 44 -- two days
Speed improvements
BEFORESTART LOOP EXIT ENTRY
4 + N*4 + 5 + 5= 14 + 4 * N
NOW with 2-fold loop unfoldingSTART LOOP EXIT ENTRY
4 + 7 + (N – 2) * 5 / 2 + 5 + 8 + 5 = 24 + 2.5 * N
NOW with 3-fold loop unfoldingSTART LOOP EXIT ENTRY
4 + 5 + (N – 2) * 6 / 3 + 5 + 1 + 5 = 16 + 2 * N
Factor of 4 / 2.5 with a little effort -- Factor of 4 /2 with more effort
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 43 / 44 -- two days
Question to Ask
We now know the final code Should we have made the substitution F2 to
F9? Who cares -- do it anyway as more likely to
be necessary rather than unnecessary in most algorithms! No real disadvantage since we can probably
overlap the save and recovery of the non-volatile R9 with other instructions!
Will the code work?
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 44 / 44 -- two days
Resource 9 -- Final code version
ADDER MULTIPLIER DM ACCESS_Convert: Modify(CTOPofSTACK, -1); dm(FP, -2) = R9; pt = INPAR1; F12_32 = 32.0 // bring constants outside the loop F4_1_8 = 1.8
F2 = dm(pt, ZERODM) R1F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M1, R2
F9 = F8 + F12_32 F8 = F2 * F4_1_8 A1, M2 F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W1, A2
dm(pt, PLUS1DM) = F9 W2 LCNTR = (N-2)/3, DO LOOP_END UNTIL LCE;
F2 = dm(pt, ZERODM) R3F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) M3, R4
F9 = F8 + F12_32 F8 = F2 * F4_1_8 F2 = dm(pt, ZERODM) A3, M4, R5 F9 = F8 + F12_32 F8 = F2 * F4_1_8 dm(pt, PLUS1DM) = F9 W3, A4, M5F9 = F8 + F12_32 dm(pt, PLUS1DM) = F9 W4, A5
LOOP_END: dm(pt, PLUS1DM) = F9 W5 R9 = dm(FP, -2); 5 magic lines of C
-1 UNTIL LCE
LOOPEND:
N = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
Only works if (N - 2) / 3 is an integer.
04/18/23
ENCM515 -- Systematic development of parallel instructions on SHARC ADSP21061
Copyright [email protected] 45 / 44 -- two days
Tackled today What’s the problem? Standard Code Development of “C”-code Process for “Code with parallel instruction”
Rewrite with specialized resources Move to “resource chart” Unroll the loop Adjust code Reroll the loop Check if worth the effort
To come -- Tutorial practice of parallel coding To come -- Optimum FIR filter with parallelism