DSP Lecture 02

DSP Lecture 02Chapter 2TMS320C6000 Architectural Overview

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2004

Learning ObjectivesDescribe C6000 CPU architecture.Introduce some basic instructions.Describe the C6000 memory map.Provide an overview of the peripherals.PerformanceSoftware Development


General DSP System Block DiagramP E R I P H E R A L SCentral Processing UnitInternal MemoryInternal BusesExternalMemory


Implementation of Sum of Products (SOP) It has been shown in Chapter 1 that SOP is the key element for most DSP algorithms.So lets write the code for this algorithm and at the same time discover the C6000 architecture.


Implementation of Sum of Products (SOP)

Two basicoperations are requiredfor this algorithm.(1) Multiplication(2) Addition Therefore two basicinstructions are requiredSo lets implement the SOP algorithm!

The implementation in this module will be done in assembly.

= a1 * x1 + a2 * x2 +... + aN * xN


Multiply (MPY)

The multiplication of a1 by x1 is done in assembly by the following instruction:

MPYa1, x1, Y

This instruction is performed by a multiplier unit that is called .M= a1 * x1 + a2 * x2 +... + aN * xN


Multiply (.M unit).M

The . M unit performs multiplications in hardware

MPY.Ma1, x1, YNote: 16-bit by 16-bit multiplier provides a 32-bit result.32-bit by 32-bit multiplier provides a 64-bit result.


Addition (.?).M.?

MPY.Ma1, x1, prodADD.?Y, prod, Y


Add (.L unit).M.L

MPY.Ma1, x1, prodADD.LY, prod, YRISC processors such as the C6000 use registers to hold the operands, so lets change this code.


Register File - A

MPY.Ma1, x1, prodADD.LY, prod, YLet us correct this by replacing a, x, prod and Y by the registers as shown above.


Specifying Register Names

MPY.MA0, A1, A3ADD.LA4, A3, A4The registers A0, A1, A3 and A4 contain the values to be used by the instructions.


Specifying Register Names

MPY.MA0, A1, A3ADD.LA4, A3, A4Register File A contains 16 registers (A0 -A15) which are 32-bits wide.


Data loadingQ: How do we load the operands into the registers?


Load Unit .DA: The operands are loaded into the registers by loading them from the memory using the .D unit.Q: How do we load the operands into the registers?


Load Unit .DIt is worth noting at this stage that the only way to access memory is through the .D unit.


Load InstructionQ: Which instruction(s) can be used for loading operands from the memory to the registers?


Load Instructions (LDB, LDH,LDW,LDDW)A: The load instructions.Q: Which instruction(s) can be used for loading operands from the memory to the registers?


Using the Load Instructions0000000000000002000000040000000600000008Data16-bitsBefore using the load unit you have to be aware that this processor is byte addressable, which means that each byte is represented by a unique address.Also the addresses are 32-bit wide.addressFFFFFFFF


Using the Load InstructionsThe syntax for the load instruction is:

Where:Rn is a register that contains the address of the operand to be loaded andRm is the destination register.

0000000000000002000000040000000600000008Dataa1x1prod16-bitsYaddressFFFFFFFFLD *Rn,Rm



The question now is how many bytes are going to be loaded into the destination register?

0000000000000002000000040000000600000008Dataa1x1prod16-bitsYaddressFFFFFFFFLD *Rn,Rm



0000000000000002000000040000000600000008Dataa1x1prod16-bitsYaddressFFFFFFFFThe answer, is that it depends on the instruction you choose: LDB: loads one byte (8-bit) LDH: loads half word (16-bit) LDW: loads a word (32-bit) LDDW: loads a double word (64-bit)Note: LD on its own does not exist.


Using the Load Instructions0000000000000002000000040000000600000008Data16-bitsaddressFFFFFFFF0xB0xA0xD0xCExample:If we assume that A5 = 0x4 then:(1) LDB *A5, A7 ; gives A7 = 0x00000001(2) LDH *A5,A7; gives A7 = 0x00000201(3) LDW *A5,A7; gives A7 = 0x04030201(4) LDDW *A5,A7:A6; gives A7:A6 = 0x08070605040302010x10x20x30x40x50x60x70x8The syntax for the load instruction is:

01


Using the Load Instructions0000000000000002000000040000000600000008Data16-bitsaddressFFFFFFFF0xB0xA0xD0xCQuestion: If data can only be accessed by the load instruction and the .D unit, how can we load the register pointer Rn in the first place?

0x10x20x30x40x50x60x70x8The syntax for the load instruction is:


Loading the Pointer RnThe instruction MVKL will allow a move of a 16-bit constant into a register as shown below:MVKL.?a, A5(a is a constant or label)How many bits represent a full address?32 bitsSo why does the instruction not allow a 32-bit move?All instructions are 32-bit wide (see instruction opcode).


Loading the Pointer RnTo solve this problem another instruction is available:MVKH

eg. MVKH.?a, A5(a is a constant or label) MVKL a, A5MVKH a, A5Finally, to move the 32-bit address to a register we can use:


Loading the Pointer RnMVKL0x1234FABC, A5 A5 = 0xFFFFFABC ; WrongExample 1 A5 = 0x87654321

MVKL0x1234FABC, A5 A5 = 0xFFFFFABC (sign extension)MVKH0x1234FABC, A5 A5 = 0x1234FABC ; OKExample 2

MVKH0x1234FABC, A5 A5 = 0x12344321Always use MVKL then MVKH, look at the following examples:


LDH, MVKL and MVKH.M.LA0A1A2A3A4

A15Register File A. . .axprod32-bitsY.DData Memorypt1 and pt2 point to some locations in the data memory.


Creating a loop

MVKL pt1, A5 MVKH pt1, A5 MVKL pt2, A6 MVKH pt2, A6

LDH.D*A5, A0 LDH.D*A6, A1 MPY.MA0, A1, A3 ADD.LA4, A3, A4 So far we have only implemented the SOP for one tap only, i.e.

Y= a1 * x1

So lets create a loop so that we can implement the SOP for N Taps.


Creating a loop

With the C6000 processors there are no dedicated instructions such as block repeat. The loop is created using the B instruction. So far we have only implemented the SOP for one tap only, i.e.

Y= a1 * x1

So lets create a loop so that we can implement the SOP for N Taps.


What are the steps for creating a loop1. Create a label to branch to.2. Add a branch instruction, B.3.Create a loop counter.4. Add an instruction to decrement the loop counter.5. Make the branch conditional based on the value in the loop counter.


1. Create a label to branch to


2. Add a branch instruction, B.


Which unit is used by the B instruction?.D.M.LA0A1A2A3

A15Register File A. . .axprod32-bitsY.DData Memory.S


Which unit is used by the B instruction?Data Memory.D.M.LA0A1A2A3

A15Register File A. . .axprod32-bitsY.D.S


3. Create a loop counter.Data Memory.D.M.LA0A1A2A3

A15Register File A. . .axprod32-bitsY.D.SB registers will be introduced later


4. Decrement the loop counter.DData Memory.M.LA0A1A2A3



5. Make the branch conditional based on the value in the loop counterWhat is the syntax for making instruction conditional?[condition] InstructionLabele.g.[B1]Bloop(1) The condition can be one of the following registers: A1, A2, B0, B1, B2.(2) Any instruction can be conditional.


5. Make the branch conditional based on the value in the loop counterThe condition can be inverted by adding the exclamation symbol ! as follows:[!condition] InstructionLabele.g.[!B0]Bloop ;branch if B0 = 0[B0]Bloop;branch if B0 != 0


5. Make the branch conditionalData Memory.D.M.LA0A1A2A3



More on the Branch Instruction (1)Case 1: B .S1 labelRelative branch.Label limited to +/- 220 offset. With this processor all the instructions are encoded in a 32-bit.Therefore the label must have a dynamic range of less than 32-bit as the instruction B has to be coded.


More on the Branch Instruction (2)By specifying a register as an operand instead of a label, it is possible to have an absolute branch.This will allow a dynamic range of 232.Case 2: B .S2registerAbsolute branch.Operates on .S2 ONLY!


Testing the codeThis code performs the following operations:

a0*x0 + a0*x0 + a0*x0 + + a0*x0

However, we would like to perform:a0*x0 + a1*x1 + a2*x2 + + aN*xN


Modifying the pointersThe solution is to modify the pointers

A5 and A6.


Indexing PointersR can be any registerIn this case the pointers are used but not modified.


Indexing Pointers[disp] specifies the number of elements size in DW (64-bit), W (32-bit), H (16-bit), or B (8-bit).disp = R or 5-bit constant.R can be any register.In this case the pointers are modified BEFORE being usedand RESTORED to their previous values.


Indexing PointersIn this case the pointers are modified BEFORE being usedand NOT RESTORED to their Previous Values.


Indexing PointersIn this case the pointers are modified AFTER being usedand NOT RESTORED to their Previous Values.


Indexing PointersDescriptionPointer+ Pre-offset- Pre-offsetPre-incrementPre-decrementPost-incrementPost-decrementSyntaxPointer Modified*R*+R[disp]*-R[disp]*++R[disp]*--R[disp]*R++[disp]*R--[disp]NoNoNoYesYesYesYes[disp] specifies # elements - size in DW, W, H, or B.disp = R or 5-bit constant.R can be any register.


Modify and testing the codeThis code now performs the following operations:a0*x0 + a1*x1 + a2*x2 + ... + aN*xN


Store the final resultThis code now performs the following operations:a0*x0 + a1*x1 + a2*x2 + ... + aN*xN


Store the final resultThe Pointer A7 has not been initialized.


Store the final resultThe Pointer A7 is now initialized.


What is the initial value of A4?A4 is used as an accumulator, so it needs to be reset to zero.


How can we add more processing power to this processor?.S1.M1.L1.D1A0A1A2A3A4Register File A. . .Data MemoryA1532-bitsIncreasing the processing power!


(1)Increase the clock frequency..S1.M1.L1.D1A0A1A2A3A4Register File A. . .Data MemoryA1532-bitsIncreasing the processing power!(2)Increase the number of Processing units.


To increase the Processing Power, this processor has two sides (A and B or 1 and 2)Data Memory


Can the two sides exchange operands in order to increase performance?.S1.M1.L1.D1A0A1A2A3A4Register File A. . .A1532-bitsB15


The answer is YES but there are limitations.To exchange operands between the two sides, some cross paths or links are required.What is a cross path?A cross path links one side of the CPU to the other.There are two types of cross paths:Data cross paths.Address cross paths.


Data Cross PathsData cross paths can also be referred to as register file cross paths.These cross paths allow operands from one side to be used by the other side.There are only two cross paths:one path which conveys data from side B to side A, 1X.one path which conveys data from side A to side B, 2X.


TMS320C67x Data-Path


Data Cross PathsData cross paths only apply to the .L, .S and .M units.The data cross paths are very useful, however there are some limitations in their use.


Data Cross Path LimitationsA2x(1) The destination register must be on same side as unit.(2) Source registers - up to one cross path per execute packet per side.

Execute packet: group of instructions that execute simultaneously.


Data Cross Path LimitationsA2xeg:ADD .L1x A0,A1,B2MPY .M1x A0,B6,A9SUB .S1x A8,B2,A8|| ADD .L1x A0,B0,A2

|| Means that the SUB and ADD belong to the same fetch packet, therefore execute simultaneously.


Data Cross Path LimitationsA2x.L1.M1.S1B1x

eg:ADD .L1x A0,A1,B2MPY .M1x A0,B6,A9SUB .S1x A8,B2,A8|| ADD .L1x A0,B0,A2

NOT VALID!


Data Cross Paths for both sidesA2x.L1.M1.S1B1x

.L2.M2.S2


Address cross pathsLDW.D1T1 *A0,A5STW.D1T1 A5,*A0(1) The pointer must be on the same side of the unit.


Load or store to either side.D1A*A0BData1A5Data2B5DA1 = T1DA2 = T2LDW.D1T1 *A0,A5LDW.D1T2 *A0,B5


Standard Parallel Loads.D1AA5*A0BB5.D2Data1 *B0 LDW.D1T1 *A0,A5|| LDW.D2T2 *B0,B5DA1 = T1DA2 = T2


Parallel Load/Store using address cross paths.D1AA5*A0BB5.D2Data1 *B0 LDW.D1T2 *A0,B5|| STW.D2T1 A5,*B0DA1 = T1DA2 = T2


Fill the blanks ... Does this work?.D1A *A0B .D2Data1 *B0 LDW.D1__ *A0,B5|| STW.D2__ B6,*B0DA1 = T1DA2 = T2


Not Allowed! Parallel accesses: both cross or neither cross.D1A *A0BB5B6.D2Data1 *B0DA2 = T2


Conditions Dont Use Cross PathsIf a conditional register comes from the opposite side, it does NOT use a data or address cross-path.Examples: [B2] ADD .L1 A2,A0,A4 [A1] LDW .D2 *B0,B5


C62x Data-Path Summary


C67x Data-Path Summary


Cross Paths - Summary DataDestination register on same side as unit.Source registers - up to one cross path per execute packet per side.Use x to indicate cross-path. AddressPointer must be on same side as unit.Data can be transferred to/from either side.Parallel accesses: both cross or neither cross.

Conditionals Dont Use Cross Paths.


Code Review (using side A only)MVK.S140, A2; A2 = 40, loop countloop:LDH.D1*A5++, A0; A0 = a(n)LDH.D1*A6++, A1; A1 = x(n)MPY.M1A0, A1, A3; A3 = a(n) * x(n)ADD.L1A3, A4, A4; Y = Y + A3SUB.L1A2, 1, A2; decrement loop count [A2]B.S1loop; if A2 0, branchSTH.D1A4, *A7; *A7 = YNote: Assume that A4 was previously cleared and the pointers are initialised.


Let us have a look at the final details concerning the functional units.

Consider first the case of the .L and .S units.


So where do the 40-bit registers come from?Operands - 32/40-bit Register, 5-bit ConstantOperands can be:5-bit constants (or 16-bit for MVKL and MVKH).32-bit registers.40-bit Registers.However, we have seen that registers are only 32-bit.


A 40-bit register can be obtained by concatenating two registers. However, there are 3 conditions that need to be respected:The registers must be from the same side.The first register must be even and the second odd.The registers must be consecutive.Operands - 32/40-bit Register, 5-bit Constant


Operands - 32/40-bit Register, 5-bit ConstantAll combinations of 40-bit registers are shown below:


Operands - 32/40-bit Register, 5-bit Constant32-bit Reg40-bit Reg.L or .Sinstr .unit , ,


Operands - 32/40-bit Register, 5-bit Constantinstr .unit , , .L or .S


Operands - 32/40-bit Register, 5-bit ConstantOR.L1 A0, A1, A2instr .unit , , .L or .S


Operands - 32/40-bit Register, 5-bit ConstantOR.L1 A0, A1, A2ADD.L2 -5, B3, B4instr .unit , , .L or .S


Operands - 32/40-bit Register, 5-bit ConstantOR.L1 A0, A1, A2ADD.L2 -5, B3, B4ADD.L1 A2, A3, A5:A4instr .unit , , .L or .S


Operands - 32/40-bit Register, 5-bit ConstantOR.L1 A0, A1, A2ADD.L2 -5, B3, B4ADD.L1 A2, A3, A5:A4SUB.L1 A2, A5:A4, A5:A4instr .unit , , .L or .S


Operands - 32/40-bit Register, 5-bit ConstantOR.L1 A0, A1, A2ADD.L2 -5, B3, B4ADD.L1 A2, A3, A5:A4SUB.L1 A2, A5:A4, A5:A4ADD.L2 3, B9:B8, B9:B8instr .unit , , .L or .S


To move the content of a register (A or B) to another register (B or A) use the move MV Instruction, e.g.:MV A0, B0MV B6, B7To move the content of a control register to another register (A or B) or vice-versa use the MVC instruction, e.g.:MVC IFR, A0MVC A0, IRP

Register to register data transfer


What Problem Are We Trying To Solve?Most DSP algorithms can be expressed with MAC:for (i = 0; i < count; i++){ sum += c[i] * x[i]; }DACxYADC

DSP

How has the C6000 been designed to handle this? Lets explore the CPU


The Core of DSP : Sum of ProductsThe C6000Designed to handle DSPs math-intensive calculationsALUNote: You dont have to specify functional units (.M or .L)Where are the variables stored?


Working Variables : The Register Filey =40 cn xnn = 1*Register File Acxprod32-bitsy. . ..M.L16 or 32 registersMPY.Mc, x, prodADD.Ly, prod, yHow can we loop our MAC?


Making Loops1. Program flow: the branch instruction2. Initialization: setting the loop count3. Decrement: subtract 1 from the loop counter B loop SUB cnt, 1, cnt MVK 40, cnt


.S Unit: Branch and Shift InstructionsMVK.S40, cntloop:MPY.Mc, x, prodADD.L y, prod, ySUB.Lcnt, 1, cnt B.Sloop.M.L.SRegister File A32-bitsHow is the loop terminated?Note: C64x could use BDEC in place of SUB and Branch16 or 32 registers


Conditional Instruction ExecutionNote: If condition is false, execution is essentially replaced with nopTo minimize branching, all instructions are conditional[condition] Bloop


Loop Control via Conditional BranchMVK.S40, cntloop:MPY.Mc, x, prodADD.L y, prod, ySUB.Lcnt, 1, cnt [cnt]B.SloopHow are the c and x array values brought in from memory?16 or 32 registers


Memory Access via .D Unit.M .L .S MVK.S40, cntloop:LDH.D*cp , cLDH.D*xp , xMPY.Mc, x, prodADD.L y, prod, ySUB.Lcnt, 1, cnt [cnt]B.SloopData Memory:x(40), a(40), y.D What does the H in LDH signify?16 or 32 registersNote:No restrictions on which regs can be used for address or data!


Memory Access via .D Unit.M .L .S MVK.S40, cntloop:LDH.D*cp , cLDH.D*xp , xMPY.Mc, x, prodADD.L y, prod, ySUB.Lcnt, 1, cnt [cnt]B.SloopData Memory:x(40), a(40), yHow do we increment through the arrays?16 or 32 registersInstr.DescriptionC TypeSizeLDBload bytechar8-bitsLDHload half-wordshort16-bitsLDWload wordint32-bitsLDDW*load double-worddouble64-bits* Only available on the C64x and C67x


Auto-Increment of PointersMVK.S40, cntloop:LDH.D*cp++, cLDH.D*xp++, xMPY.Mc, x, prodADD.L y, prod, ySUB.Lcnt, 1, cnt [cnt]B.SloopHow do we store results back to memory?


Storing Results Back to MemoryMVK.S40, cntloop:LDH.D*cp++, cLDH.D*xp++, xMPY.Mc, x, prodADD.L y, prod, ySUB.Lcnt, 1, cnt [cnt]B.SloopSTW.Dy, *ypBut wait - thats only half the story...


Dual Resources : Twice as NiceOur final view of the sum of products example...


Optional - Resource Specific CodingMVK.S140, A2loop:LDH.D1*A5++, A0LDH.D1*A6++, A1MPY.M1A0, A1, A3ADD.L1A4, A3, A4SUB.S1A2, 1, A2 [A2] B.S1loopSTW.D1A4, *A7A0A1A2A3A4Register File AA15orA31A5A6A7cnxnprdsumcnt. .*c*x*y.M1.L1.S1.D132-bits. .Its easier to use symbols rather thanregister names, but you can useeither method.


TMS320C6000 Instruction Set


'C6000 System Block DiagramExternalMemory .D1.M1.L1.S1.D2.M2.L2.S2Reggister Set BRegister Set ACPUOn-chip MemoryPERIPHERALSInternal BusesTo summarize each units instructions ...


C62x RISC-like instruction set


'C62x RISC-Like Instruction Set (by category)Note: Refer to the 'C6000 CPU Reference Guide for more details.


C67x: Superset of Fixed-Point (by units)


'C64x: Superset of C62x Instruction Set.L .M .D .S


Sample C62x Compiler BenchmarksTI C62x Compiler Performance Release 4.0: Execution Time in s @ 300 MHz Versus hand-coded assembly based on cycle countCompletely natural C code (non C6000 specific)Code available at: http://www.ti.com/sc/c6000compiler

AlgorithmUsed InAsm CyclesAssembly Time (ms)C Cycles (Rel 4.0)C Time (ms)% Efficiency vs Hand CodedBlock Mean Square Error MSE of a 20 column image matrixFor motion compensation of image data3481.164021.3487%Codebook SearchCELP based voice coders9773.269613.20100%Vector Max40 element input vectorSearch Algorithms610.20590.20100%All-zero FIR Filter40 samples, 10 coefficientsVSELP based voice coders2380.792800.9385%Minimum Error SearchTable Size = 2304Search Algorithms11853.9513184.3990%IIR Filter16 coefficientsFilter430.14380.13100%IIR cascaded biquads 10 Cascaded biquads (Direct Form II)Filter700.23750.2593%MACTwo 40 sample vectorsVSELP based voice coders610.20580.19100%Vector SumTwo 44 sample vectors510.17470.16100%MSEMSE between two 256 element vectorsMean Sq. Error Computation in Vector Quantizer2790.932740.91100%


Sample Imaging & Telecom Benchmarks* Includes traceback

DSP & Image Processing KernelsCycle CountPerformanceC62xC64xCycle Improvement C64:C62720MHz C64x vs 300MHz C62xReed Solomon Decode: Syndrome Accumulation (204,188,8) Packet16804703.5x8.4xcycles/packetViterbi Decode (GSM) (16 states)38.2514*2.7x6.5xcycles/outputFFT - Radix 4 - Complex (size = N log (N)) (16-bit)12.76.02.1x5xcycles/dataPolyphase Filter - Image Scaling (8-bit)0.770.332.3x5.5xcycles/output/filter tapCorrelation - 3x3 (8-bit)4.51.283.5x8.4xcycles/pixelMedian Filter - 3x3 (8-bit)9.02.14.3x10.3xcycles/pixelMotion Estimation - 8x8 MAD (8-bit)

0.9530.1267.6x18.2xcycles/pixel


Sample C62x Compiler BenchmarksTI C62x Compiler Performance Release 4.0: Execution Time in s @ 300 MHz Versus hand-coded assembly based on cycle countCompletely natural C code (non C6000 specific)Code available at: http://www.ti.com/sc/c6000compiler

AlgorithmUsed InAsm CyclesAssembly Time (ms)C Cycles (Rel 4.0)C Time (ms)% Efficiency vs Hand CodedBlock Mean Square Error MSE of a 20 column image matrixFor motion compensation of image data3481.164021.3487%Codebook SearchCELP based voice coders9773.269613.20100%Vector Max40 element input vectorSearch Algorithms610.20590.20100%All-zero FIR Filter40 samples, 10 coefficientsVSELP based voice coders2380.792800.9385%Minimum Error SearchTable Size = 2304Search Algorithms11853.9513184.3990%IIR Filter16 coefficientsFilter430.14380.13100%IIR cascaded biquads 10 Cascaded biquads (Direct Form II)Filter700.23750.2593%MACTwo 40 sample vectorsVSELP based voice coders610.20580.19100%Vector SumTwo 44 sample vectors510.17470.16100%MSEMSE between two 256 element vectorsMean Sq. Error Computation in Vector Quantizer2790.932740.91100%


Sample Imaging & Telecom Benchmarks* Includes traceback

DSP & Image Processing KernelsCycle CountPerformanceC62xC64xCycle Improvement C64:C62720MHz C64x vs 300MHz C62xReed Solomon Decode: Syndrome Accumulation (204,188,8) Packet16804703.5x8.4xcycles/packetViterbi Decode (GSM) (16 states)38.2514*2.7x6.5xcycles/outputFFT - Radix 4 - Complex (size = N log (N)) (16-bit)12.76.02.1x5xcycles/dataPolyphase Filter - Image Scaling (8-bit)0.770.332.3x5.5xcycles/output/filter tapCorrelation - 3x3 (8-bit)4.51.283.5x8.4xcycles/pixelMedian Filter - 3x3 (8-bit)9.02.14.3x10.3xcycles/pixelMotion Estimation - 8x8 MAD (8-bit)

0.9530.1267.6x18.2xcycles/pixel


TMS320C6000 Memory


Memory size per device


Internal Memory SummaryLINK: TMS320C6000 DSP Generation


PerformanceMaking use of Parallelism


Given this simple loop MVK.S140, cntloop:LDH.D1*cp++, cLDH.D1*xp++, xMPY.M1c, x, prodADD.L1 y, prod, ySUB.L1cnt, 1, cnt [cnt]B.S1loopSTW.Dy, *ypHow many of these instructions can we get in parallel?short mac(short *c, short *x, int count) { for (i=0; i < count; i++) { sum += c[i] * x[i]; }


C62x Intense ParallelismL2: ; PIPED LOOP PROLOG LDW .D1 *A4++,A3|| LDW .D2 *B6++,B7

LDW .D1 *A4++,A3|| LDW .D2 *B6++,B7

[B0] B .S1 L3 || LDW .D1 *A4++,A3|| LDW .D2 *B6++,B7

[B0] B .S1 L3|| LDW .D1 *A4++,A3|| LDW .D2 *B6++,B7

[B0] B .S1 L3|| LDW .D1 *A4++,A3|| LDW .D2 *B6++,B7MPY.M2B7,A3,B4||MPYH.M1B7,A3,A5|| [B0]B.S1L3||LDW.D1*A4++,A3||LDW.D2*B6++,B7

MPY.M2B7,A3,B4||MPYH.M1B7,A3,A5|| [B0]B.S1L3|| LDW.D1*A4++,A3|| LDW.D2*B6++,B7;** -----------------------*L3: ; PIPED LOOP KERNEL ADD.L2B4,B5,B5|| ADD.L1A5,A0,A0|| MPY.M2B7,A3,B4|| MPYH.M1B7,A3,A5|| [B0]B.S1L3|| [B0]SUB.S2B0,1,B0|| LDW.D1*A4++,A3|| LDW.D2*B6++,B7;** -----------------------*What about the C67x?short mac(short *c, short *x, int count) { for (i=0; i < count; i++) { sum += c[i] * x[i]; } Given this C codeThe C62x compiler can achieve Two Sum-of-Products per cycle


C67x MAC using Natural C;** --------------------------------------------------*LOOP:; PIPED LOOP KERNELLDDW.D1A4++,A7:A6||LDDW.D2B4++,B7:B6||MPYSP.M1XA6,B6,A5||MPYSP.M2XA7,B7,B5||ADDSP.L1A5,A8,A8||ADDSP.L2B5,B8,B8|| [A1]B.S2LOOP|| [A1]SUB.S1A1,1,A1;** --------------------------------------------------*float mac(float *c, float *x, int count){ int i, float sum = 0;

for (i=0; i < count; i++) { sum += c[i] * x[i]; } A0A15. ..M1.L1.D1.S1.M2.L2.D2.S2B0B15. .Controller/DecoderCan the 'C64x do better?The C67x compiler gets two 32-bit floating-point Sum-of-Products per iteration


C64x gets four MACs using DOTP2short mac(short *c, short *x, int count){ int i, short sum = 0;

for (i=0; i < count; i++) { sum += c[i] * x[i]; } ;** --------------------------------------------------*; PIPED LOOP KERNELLOOP: ADD.L2B8,B6,B6 ||ADD.L1A6,A7,A7 ||DOTP2.M2X B4,A4,B8 ||DOTP2.M1X B5,A5,A6 || [ B0]B.S1LOOP || [ B0]SUB.S2B0,-1,B0||LDDW.D2T2*B7++,B5:B4||LDDW.D1T1*A3++,A5:A4;** --------------------------------------------------*How many multiplies can the C6x perform?


MMACsHow many 16-bit MMACs (millions of MACs per second) can the 'C6201 perform?400 MMACs (two .M units x 200 MHz)


How Do We Get Such High Parallelism?Compiler and Assembly Optimizer use a technique called Software PipeliningSoftware pipelining enables high performance (esp. on DSP-type loops)Key point: Tools do all the work!What is software pipelining?Let's look at a simple example ...


Tools Use Software PipeliningHeres a simple example to demonstrate ...5 x 3 = 15Our functional units could be used like ...


Without Software Pipelining71Cycle23456mpyaddaddmpyIn seven cycles, were almost half-way done ...


With Software Pipelining71Cycle23456mpyaddmpyaddIt takes 1/2 the time! How does this translate to code?


S/W Pipelining Translated to Code71Cycle23456mpyaddmpyaddc1:LDH ||LDH

c2:MPY ||LDH||LDH

c3:ADD||MPY||LDH||LDH


DSKCode Composer Studio


C6416 DSKDiagnostic Utility included with DSK ...


DSKs Diagnostic UtilityCCS Overview ...Test/Diagnose DSK hardwareVerify USB emulation linkUse Advanced tests to facilitate debuggingReset DSK hardware


Code Composer StudioDSKs Code Composer Studio Includes:Integrated Edit / Debug GUIEditDebugDSP BoardCCS is Project centric ...


Code GenerationEditor


What is a Project?The project menu ...


Project MenuAccess via pull-down menu or by right-clicking .pjt file in project explorer windowProject MenuHint:Create and open projects from the Project menu, not the File menu.


Build Options-g -q -fr"c:\modem\Debug" -mv6700Eight Categories of Compiler optionsThe most common Compiler Options are ...


Linker Options-q -c -m".\Debug\lab1.map" -o".\Debug\lab1.out" -x.\Debug\lab1.outRun-time AutoinitializationBy default linker options include the o optionWe recommend you add the m option".\Debug\" indicates one subfolder level below the projects .pjt folderRun-time Autoinit tells compiler to initialize global/static variables before calling main().\Debug\lab1.map


Compilers Build OptionsOptionsDescription-mv6700Generate C67x code (C62x is default)-mv6400Generate 'C64x code-fr Directory for object/output files-fs Directory for assembly files-qQuiet mode (display less info while compiling)-gEnables src-level symbolic debugging-sInterlist C statements into assembly listingNearly one-hundred compiler options available to tune your code's performance, size, etc.Following table lists the most common options:debug optionsIn Chapter 4 we will examine the options which enable the compilers optimizerAnd, the Config Tool ...


DSP/BIOS Configuration ToolSimplifies system design by:Automatically includes the appropriate runtime support librariesAutomatically handles interrupt vectors and system resetHandles system memory configuration (builds CMD file)Generates 5 files when CDB file is saved:C file, Asm file, 2 header files and a linker command (.cmd) fileMore to be discussed later


C6000 C Data TypesTypeSizeRepresentationchar, signed char8 bitsASCIIunsigned char8 bitsASCIIshort16 bits2s complementunsigned short16 bitsbinaryint, signed int32 bits2s complement unsigned int32 bitsbinarylong, signed long40 bits 2s complementunsigned long40 bits binaryenum32 bits 2s complementfloat32 bits IEEE 32-bitdouble64 bits IEEE 64-bitlong double64 bits IEEE 64-bitpointers32 bits binary


Lab 1c: Using Printf


Command Window


GEL ScriptingGEL:General Extension LanguageC style syntaxLarge number of debugger commands as GEL functionsWrite your own functionsCreate GEL menu items


CCS ScriptingDebug using VB Script or PerlUsing CCS Scripting, a simple script can:Start CCSLoad a fileRead/write memorySet/clear breakpointsRun, and perform other basic testing functions


TCONF Scripting (CDB)Tconf Script (.tcf)hello_dsk62cfg.tcfutils.loadPlatform("dsk6211");/* load DSK6211 platform into TCOM */utils.getProgObjs(prog);/* make all prog objects JavaScript global vars */LOG_system.bufLen = 128;/* set buffer length of LOG_system to 128 */utils.importFile("hello");/* import portable application script */prog.gen();/* generate cfg files (and CDB file) */A textual way to configure CDB filesRuns on both PC and UnixCreate #include type files (.tci)More flexible than Config Tool


Chapter 2TMS320C6000 Architectural Overview- End -


Over the next 20 slides, we want to provide an example to anchor the presentation and provide context. What better algorithm than the standard sum-of products. The question lead-in is so, what problem are we trying to solve? The basics of DSP involve first sampling an analog signal and converting it to digital. What do we do then? Some type of algorithm to shape, modify, etc the signal. This is easily done in the digital realm. So, the time between samples is our limit to how fast we need to do the algorithm. Whats a typical algorithm look like - this! A simple sum-of products. Lets look at a typical DSP algorithm and see how the processor is designed to handle it.Spend about 1 minute on this slide. If the group is VERY new to DSP, you might embellish slightly on any areas you feel comfortable with. But remember, the focus is not WHY DSP, it is assuming you know why youd want to use this algorithm, lets see how the processor is built to handle it.The lead-into the next slide is the Q shown on the slide. Also state that we plan to write the code for this algorithm and see how the architecture is designed to handle it efficiently.

OLD INFOHere are the two basic operations (mpy, add) written in linear assembly. The architecture contains two functional units which perform these two basic operations..M does multiply..L does add. (Why .L? It actually stands for ALU, but since A is used for something else, they chose to use L; well describe this A thing next).

Leading Question (to next foil):Where do you think the variables (c, x, Y, etc) are stored?

Side Note:Weve chosen to use linear asm because of its simple format. Well expand on the other coding methods later. We suggest you dont get into these at this point of the presentation.

Where do we keep the variables? In A register file (ha, ha).Like all good RISC processors, the C6000 is a load/store machine. All data is loaded to and stored from registers. Then the execute units only operate on data found within the CPU.

Side Note:If they ask about how to do loads and stores, tell them well get to that in a minute (after creating a loop). It was too late to rearrange the presentation.

In keeping with the RISC concept, our loop is managed by three instructions .1. Branch (conditional)2. A counter set to the number of iterations we need.3. A means to decrement the counter. In DSPs, classically, hardware is dedicated to watching the counter. But that increases h/w complexity, and remember, h/w control -- is a barrier to high speeds (faster clock rates). Here we add in the counter and decrement it within each loop.

Leading Question to next foil ...Is that all we need for the loop? Oh yeah, we need to make the Branch conditional. That is, when our counter reaches zero, we want to exit the loop. VelociTI -- again -- uses s/w rather than h/w in order to maintain h/w simplicity.

So, right -- you guessed it. We have an instruction for this, too

Every instruction can be conditional. This works great for the branch instruction, but it also works well for other instructions (i.e. conditional loads or stores, etc.).

When, using a hardware pipelined processor to reach these speeds, branches cause latency in program execution. We created 100% conditional instructions in order to reduce the number of branches (or breaks in the program flow).

Side Note:This really is beneficial for pipelined architectures. Which of course is required to get the speeds of the C6000 today (and even more so, in the future).Here we add in the counter and decrement it within each loop.

Leading Question to next foil ...Is that all we need for the loop? Oh yeah, we need to make the Branch conditional. That is, when our counter reaches zero, we want to exit the loop. VelociTI -- again -- uses s/w rather than h/w in order to maintain h/w simplicity.

So, right -- you guessed it. We have an instruction for this, too

. Here we used LDH since we chose halfword (short) data types.

And look we added a new functional unit .D for data moves.

On the next iteration through the loop, do we want to be pointing to the same data value (in my input arrays)? No, we want to increment the pointer to the next value.

How do you increment something in C? You use ++. We do the same thing here.













For those of you whove seen the architecture before, youve already realized that this was only HALF the storywe actually have DOUBLE of everything!

With 8 functional units you can execute up to 8 instructions/cycle. At 4ns (or 5ns) per cycle, were really zooming

Just think of what you can do with all this and how fast itll go!

Our code is now complete.Weve shown linear assembly thus far so you can see the C6000 VelociTI architecture at work. TIs Assembly optimizer takes linear ASM and creates standard asm code as shown here.You can program the C6000 in three ways:1. C2.Linear Asm3.Standard Asm

Most people choose C or linear asm because of their simple, symbolic methodology.

Leading question to next foil ...Looking closer at this diagram, where did the 1s come from (in the unit specification? (As in .S1, .M1, ?)We wont take the time to go through each one of these instructions. Notice, though, that some instructions can be performed on multiple functional units.

This foil shows the 56 base instructions of the C62x. A small, but powerful set!

Leading comment for next foil The C67x has all these same instructions, plus ...

HANDOUT: Instruction Set by category.

And C6701 Floating Point uses a super set of this same instruction set.

This is further evidence of the similarity of the C62x and C67x devices.

Two things to point out with the C67x instructions:The C67x is the first DSP processor (if not microprocessor) to provide Double-Precision floating-point hardware! The C67x has hardware that performs either single-precision (SP) or double-precision (DP) floating-point math. The C67x also includes double-precision integer multiplication. That is, you can do 32x32 bit multiplication and get a 64-bit result.

This foil doesnt need much introduction. The C67x includes all the C62x instructions along with these additional instructions.

Those that end in SP are for single-precision floating-point; those in DP are for double-precision floating-point. The C67x is the only DSP (if not only microprocessor) with hardware support for double-precision.This slide shows you a URL where you can find the benchmark results of 10 DSP functions coded entirely in Natural C. The results show 85% and greater compiler efficiency as compared to hand coded assembly. Not only are the benchmark results located on the web but you can also down load the code and try it on your own.The options used here were:-o3 oi0 op2 k mi mh qq pmThis slide shows you a URL where you can find the benchmark results of 10 DSP functions coded entirely in Natural C. The results show 85% and greater compiler efficiency as compared to hand coded assembly. Not only are the benchmark results located on the web but you can also down load the code and try it on your own.The options used here were:-o3 oi0 op2 k mi mh qq pmSINGLE-CYCLE LOOP KERNEL:Here is the output of the compiler from properly optimized C. This is a product of coding and running the tools correctly for C6000.At two taps (results) per cycle, thats all 40 results in about 20 cycles (20 iterations plus 8 cycles of setup).

Side Notes:Due to time restrictions, do not get into intrinsics.We removed EPILOGUE... since this is an option of the compiler, we were running out of room on the foil, and we wanted to keep it simple.Output is 28 CYCLES (VS 20) because we had to run extra cycles at the top to set up algorithm... and one extra ADD at end to get final result.Were not ignoring all that needs to be done... but if there is high interest, encourage attendance of TTO course...

3. Software PipeliningFinally, the 3rd key is making the best use of our execution hardware. To accomplish this, our optimizers (C and Asm) use Software Pipelining.

Lets look at this conceptually

If we performed this simple loop 5 times without s/w pipelining, how long would it take?

With Software Pipelining its much quicker.Notice that when we perform the 1st multiply, the .D unit hardware is sitting idle. Why not make use of it to go ahead and get the next operand? And so on

Using software pipelining our algorithm completes in less than half the cycles ( 7 vs. 15).

To demonstrate how this is coded by the tools (using the little code box that pops up on the foil, notice that instructions that execute together in one cycle become parallel instructions. Once again, when using C or Linear Asm, this optimization (and parallel coding) is done for you!

Side Note: Why not? Because we only need 5 loopsHidden slide only used for Student Guide.Hidden slide

1.int is 32-bits. Use int or unsigned int for loop counters to avoid unnecessary sign extension2.Use shorts for integer multiplication inputs since this makes the most efficient use of the 16 x 16-bit integer multipliers.3.On C67x devices, 32-bit float operations are performed in hardware. The C6000 supports IEEE 32-bit floating-point math.4.long operands are 40-bits in length:--This implementation choice was made since the C6000 has hardware support for 40-bit arithmetic. --Ordinarily used to provide headroom in integer operations.--Provides 8-bit carry bit.--Avoid coding longs and ints as the same size. (common practice on some processors)5.double precision floating-point hardware supports IEEE 64-bit floating-point math.6.pointers can reach the entire length of the memory map, all 32-bits.

Documents

DSP Lecture 02