32
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline

Understanding the TigerSHARC ALU pipeline

Embed Size (px)

DESCRIPTION

Understanding the TigerSHARC ALU pipeline. Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline. Understanding the TigerSHARC ALU pipeline. TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes down - PowerPoint PPT Presentation

Citation preview

Page 1: Understanding the TigerSHARC ALU pipeline

Understanding the TigerSHARC ALU pipeline

Determining the speed of one stage of IIR filter – Part 2Understanding the pipeline

Page 2: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 2 / 3204/19/23

Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes

down Need to understand how the ALU pipeline works

Learn to use the pipeline viewer Understanding what the pipeline viewer tells in detail

Avoiding having to use the pipeline viewer Improving code efficiency

Excel and Project (Gantt charts) are useful tool

Page 3: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 3 / 3204/19/23

Register File and COMPUTE Units

Page 4: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 4 / 3204/19/23

Simple ExampleIIR -- Biquad For (Stages = 0 to 3) Do

S0 = Xin * H5 + S2 * H3 + S1 * H4 Yout = S0 * H0 + S1 * H1 + S2 * H2 S2 = S1 S1 = S0

S0

S1

S2

Horrible IIR codeexample as can’t re-use in a loop

Works as asimple example for understanding TigerSHARCpipeline

Page 5: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 5 / 3204/19/23

Code return float when using XR8 register – NOTE NOT XFR8

Page 6: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 6 / 3204/19/23

Step 2 – Using C++ code as comments set up the coefficients

XFR0 = 0.0;;Does not exist

XR0 = 0.0;;DOES EXIST

Bit-patternsrequireintegerregisters

Leave what youwanted to dobehind ascomments

Page 7: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 7 / 3204/19/23

Expect to take8 cycles to execute

Page 8: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 8 / 3204/19/23

PIPELINE STAGESSee page 8-34 of Processor manual

10 pipeline stages, but may be completely desynchronized (happen semi-independently)

Instruction fetch -- F1, F2, F3 and F4 Integer ALU – PreDecode, Decode,

Integer, Access Compute Block – EX1 and EX2

Page 9: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 9 / 3204/19/23

Pipeline Viewer Result

XR0 = 1.0 enters PD stage @ 39025, enters E2 stage at cycle 39830 is stored into XR0 at cycle 39831 -- 7 cycles execution time

Page 10: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 10 / 3204/19/23

Pipeline Viewer Result

XR6 = 5.5 enters PD stage at cycle 39032 enters E2 stage at cycle 39837 is stored into XR6 at cycle 39838 -- 7 cycles execution time

Each instruction takes 7 cycles but one new result each cycleResult – ONCE pipeline filled 8 cycles = 8 register transfer operations

Key – don’t break pipeline with any jumps

Page 11: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 11 / 3204/19/23

Doing filter operations – generates different results

XR8 = XR6 enters PD at 39833, enters EX2 at 39838, stored 39839 – 7 cyclesXFR23 = R9 * R4 enters PD at 39834, enters EX2 at 39839, stored 39840 – 7 cyclesXFR0 = R0 + R23 enters PD at 39835, enters EX2 at 39841, stored 39842 – 8 cycles

WHY? – FIND OUT WITH MOUSE CLICK ON S MARKER THEN CONTROL

Page 12: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 12 / 3204/19/23

Instruction 0x17e XFR8 = R8 + R23 is STALLED (waiting) for instruction 0x17d XFR23 = R8 * R4 to complete

Bubble B means that the pipeline is doing “nothing”Meaning that the instruction shown is “place holder” (garbage)

Page 13: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 13 / 3204/19/23

Information on Window Event Icons

Page 14: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 14 / 3204/19/23

Result of Analysis

Can’t use Float result immediately after calculation Writing

XFR23 = R8 * R4;; XFR8 = R8 + R23;; // MUST WAIT FOR XFR23 // calculation to be completed

Is the same as coding XFR23 = R8 * R4;; NOP;; Note DOUBLE ;; -- extra cycle because of stall XFR8 = R8 + R23;;

Proof – write the code with the stalls shown in it Writing this way means we don’t have to use the pipeline viewer

all the time Pipeline viewer is only available with (slow) simulator #define SHOW_ALU_STALL nop

Page 15: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 15 / 3204/19/23

Code withstalls shown

8 code lines 5 expected stalls

Expect 13 cyclesto completeif theory is correct

Page 16: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 16 / 3204/19/23

Analysis approach IS correctSame speed with and without nops

Page 17: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 17 / 3204/19/23

Process for coding for improved speed – code re-organization Make a copy of the code so can test iirASM( )

and iirASM_Optimized( ) to make sure get correct result

Make a table of code showing ALU resource usage (paper, EXCEL, Project (Gantt chart) )

Identify data dependencies Make all “temp operations” use different register Move instructions “forward” to fill delay slots,

BUT don’t break data dependencies

Page 18: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 18 / 3204/19/23

Copy and paste to makeIIRASM_Optimized( )

Page 19: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 19 / 3204/19/23

Need to re-order instructionsto fill delay slots with useful instructions

After refactoring code to fill delay slots, must run tests to ensure that still have the correct result

Change – and “retest” NOT EASY TO DO MUST HAVE A

SYSTEMATIC PLAN TO HANDLEOPTIMIZATION

I USE EXCEL

Page 20: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 20 / 3204/19/23

Show resource usage and data dependencies

All temporaryregister usageinvolves theSAME XFR23register

This typically stallsout the processor

Page 21: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 21 / 3204/19/23

Change all temporary registers to use different register names

Then check code produces correct answer

All temporaryregister usageinvolves a DIFFERENTRegister

ALWAYS FOLLOWTHIS PROCESSWHENOPTIMIZING

Page 22: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 22 / 3204/19/23

Move instructions forward, without breaking data dependencies

What appears possible!

DO one thing at a time and then check that code still works

Page 23: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 23 / 3204/19/23

Check that code still operates1 cycle saved

Have put “our” marker stall instructionin parallel with moved instructionusing ; rather than ;;

Move this instruction up in code sequence to fill delay slot

Check that code still runsafter this optimization stage

Page 24: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 24 / 3204/19/23

Move next multiplication up. NOTE certain stalls remain, although reason for STALL

changes from why they were inserted before

Page 25: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 25 / 3204/19/23

Move up the R10 and R9 assignment operations -- check

4 cycle improvement?

Page 26: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 26 / 3204/19/23

CHECK THE PIPELINE AFTER TESTING

Page 27: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 27 / 3204/19/23

Are there still more improvements possible (I can see 4 more moves)

Page 28: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 28 / 3204/19/23

Problems with approach

Identifying all the data dependencies Keep track of how the data dependencies change as you

move the code around Handling all of this “automatically” I started the following design tool as something that

might work, but it actually turned out very useful.

M. R. Smith and J. Miller, "Microprocessor Scheduling -- the irony of using Microsoft Project", "Don’t say “CAN’T do it - Say “Gantt it”! The irony of organizing microprocessors with a big business tool" Circuit Cellar magazine, Vol. 184, pp 26 - 35, November 2005.

Page 29: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 29 / 3204/19/23

Using Microsoft Project – Step 1

Page 30: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 30 / 3204/19/23

Add dependencies and resource usage – then activate level

Page 31: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 31 / 3204/19/23

Microsoft Project as a microprocessor design tool Will look at this in more detail when we

start using memory operations to fill the coefficient and state arrays

Page 32: Understanding the TigerSHARC ALU pipeline

Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada 32 / 3204/19/23

Understanding the TigerSHARC ALU pipeline TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes

down Need to understand how the ALU pipeline works

Learn to use the pipeline viewer Understanding what the pipeline viewer tells in detail

Avoiding having to use the pipeline viewer Improving code efficiency

Excel and Project (Gantt charts) are useful tool