Upload
parasaram-venkatasanjai
View
26
Download
0
Embed Size (px)
Citation preview
DSP-8 (DSP Processors) 1 of 8 Dr. Ravi Billa
Digital Signal Processing – 8 December 24, 2009
VIII. DSP Processors
2007 Syllabus: Introduction to programmable DSPs: Multiplier and Multiplier-Accumulator
(MAC), Modified bus structures and memory access schemes in DSPs, Multiple access memory,
Multiport memory, VLSI architecture, Pipelining, Special addressing modes, On-chip
peripherals.
Architecture of TMS 320C5X – Introduction, Bus structure, Central Arithmetic Logic Unit,
Auxiliary register, Index register, Auxiliary register, Compare register, Block move address
register, Parallel Logic Unit, Memory mapped registers, Program controller, Some flags in the
status registers, On-chip registers, On-chip peripherals.
Contents:
8.1 DSP Processors – Market
8.2 DSP Processors – Features
8.3 Multiply-and-Accumulate
8.4 Interrupts – handling incoming signal values
8.5 Fixed- and Floating-point
8.6 Real-time FIR filter example
We shall use “DSP” to mean Digital Signal Processor(s) and sometimes even refer to them as
DSP processors, also as programmable digital signal processors (PDSPs). This includes
1. General purpose DSPs (such as the TMS320’s of TI and DSP563’s of
Motorola and others)
2. Special purpose DSPs tailored to specific applications like FFT
However, the so-called programmability of DSPs mentioned above pales in comparison
to that of general purpose CPUs. In what follows, for the most part, we contrast general purpose
CPUs (whose strength is in general purpose programmability) with DSPs (whose strength is in
high throughput, hardwired, number crunching). For a convergence of both kinds of processor
refer to the article on Intel’s Larrabee in IEEE Spectrum, January 2009.
www.jntuworld.com
www.jntuworld.com
DSP-8 (DSP Processors) 2 of 8 Dr. Ravi Billa
8.1 DSP Processors – Market
Market
1 Small, low-power, relatively weak DSPs
Mass-produced consumer products - Toys, Automobiles
Inexpensive
2 More capable fixed point processors
Cell phones, Digital answering machines, Modems
3 Strongest, often floating point, DSPs
Image and video processing, Server applications
8.2 DSP Processors – Features
Features
1 DSP-specific instructions (e.g., MAC)
2 Special address registers
3 Zero-overhead loops
4 Multiple memory buses and banks
5 Instruction pipelines
6 Fast interrupt servicing (fast context switch)
7 Specialized IO ports
8 Special addressing modes (e.g., bit reversal)
8.3 Multiply-and-Accumulate
DSP algorithms are characterized by intensive number crunching that may exceed the
capabilities of a general purpose CPU. Due to arithmetic instructions specifically tailored to DSP
needs a DSP processor can be much faster for specific tasks. The most common special purpose
task in digital signal processing is the multiply-and -accumulate (MAC) operation illustrated by
the FIR filter
y(n) =
M
j
j jnxb0
)( =
M
j
jj xb0
= b0 x(n) + b1 x(n–1) + … + bM x(n–M)
Such a repeated MAC operation occurs in other situations as well. Further, the operands b and x
need not have the same index j. Letting b and x have independent indices j and k, the following
loop accomplishes the MAC operation
Loop:
update j, update k
a ← a + kj xb
www.jntuworld.com
www.jntuworld.com
DSP-8 (DSP Processors) 3 of 8 Dr. Ravi Billa
[Aside The operation a + (b * c), using floating point numbers, may be done with two roundings
(once when b and c are multiplied and a second time when the product is added to a), or with just
one rounding (where the entire expression a + (b * c) is evaluated in one step). The latter is
called a fused multiply-add (FMA) or fused multiply-accumulate (FMAC) included in the
IEEE Std. 754.]
Loop overhead A general purpose CPU would implement the above sum of products operation
in a fixed length loop such as
For i = 0 to M
{statements}
that involves considerable overhead apart from the statements. This overhead consists of:
Loop Overhead (General purpose CPU) 1 Provide a CPU register to store the loop index
2 Initialize index register
3 After each pass increment and check loop index
for termination
4 (If a CPU register is not available: provide a
memory location for indexing, retrieve it,
increment it, check it and store it back in
memory)
5 Except for the last pass, a jump back to the top of
the loop
DSP processors provide a zero-overhead hardware mechanism (a REPEAT or DO) that
can repeat an instruction or set of instructions a prescribed number of times. Due to hardware
support for this repetition structure no clock cycles are wasted on branching or incrementing and
checking the loop index. The number of loop iterations is necessarily limited. If loop nesting is
allowed not all loops may be zero-overhead.
Enhancing the CPU architecture
Inside the loop How would a general purpose CPU carry out the computations inside the loop?
Assume that {b} and {x} are stored as arrays in memory. Assume that the CPU has pointer
registers j and k that can be directly updated and used to retrieve data from memory, two
arithmetic registers b and x that can be used as operands of arithmetic operations, a double length
register p to receive the product and an accumulator a for summing the products. The instruction
sequence for one pass through the loop on a general purpose CPU looks like this:
Inside the Loop
1 update j
2 update k
3 b ← bj
4 x ← xk
5 fetch (multiply) instruction
www.jntuworld.com
www.jntuworld.com
DSP-8 (DSP Processors) 4 of 8 Dr. Ravi Billa
6 decode (multiply) instruction
7 execute (multiply) instruction (p ← bj xk)
8 fetch (add) instruction
9 decode (add) instruction
10 execute (add) instruction (a ← a + p)
Assuming that each of the lines above takes one unit of time – call it an “instruction time” or
“clock cycle” – (multiplication easily takes several units of time but we assume it is the same as
the rest), the sequence takes 10 units of time to complete. We could add a “multiply and add”
(call it MAC) instruction to the instruction set of the CPU (that is, we augment CPU with the
appropriate hardware): this would merge the last 6 lines (lines 5 through 10) in the above
segment into just 3 lines, and there would then be 7 lines taking 7 units of time as shown below:
Inside the Loop
1 update j
2 update k
3 b ← bj
4 x ← xk
5 fetch MAC instruction
6 decode MAC instruction
7 execute MAC instruction (a ← a + bj xk)
A DSP can perform a MAC operation in a single unit of time. Many use this feature as
the definition of a DSP. We want to describe below how this is accomplished.
Update pointers simultaneously since they are independent. We add two address updating units
to the processor hardware. Since these two updates can be done in parallel we show them as one
line in the sequence, the sequence now taking 6 units of time:
Inside the Loop
1 update j AND update k
2 b ← bj
3 x ← xk
4 fetch MAC instruction
5 decode MAC instruction
6 execute MAC (a ← a + bj xk)
Memory Architecture
Load registers b and x simultaneously Since bj and xk are completely independent we can
make provision to read them simultaneously from memory into the appropriate registers. In the
standard CPU situation there is just one bus connection to the memory; and even connecting two
buses to the same (one) memory does not help; and the so-called “dual port memories” are
expensive and slow. In a radical departure from the memory architecture of the standard CPU,
the DSP can define multiple memory banks each served by its own bus. Now bj and xk can be
www.jntuworld.com
www.jntuworld.com
DSP-8 (DSP Processors) 5 of 8 Dr. Ravi Billa
loaded simultaneously from memory into the registers j and k, shown in the sequence below by
listing the two operations on the same line, the sequence now taking 5 units of time:
Inside the Loop
1 update j AND update k
2 b ← bj AND x ← xk
3 fetch MAC instruction
4 decode MAC instruction
5 execute MAC (a ← a + bj xk)
We next turn to the last three lines (fetch, decode and execute) in the sequence.
Caches Standard CPUs use instruction caches to speed up the execution. Caching implies
different amounts of run-time (that is, unpredictability) depending on the state of the caches
when operation starts. However, DSPs are designed for real-time use where the prediction of
exact timing may be critical. Therefore caches are usually avoided in DSPs because caching
complicates the calculation of program execution time.
Harvard architecture Now we consider the fetching of one instruction while previous ones are
still being decoded or executed. There can now be a clash while fetching an instruction from
memory at the same time that data related to a prior instruction is being transferred to or from
memory. The solution is to use separate memory banks and separate buses. Previously we used
different memory banks for different categories of data but now we are talking about a memory
bank for instructions versus a memory bank for data. The memory banks have independent
address spaces and are called program memory and data memory – resulting in the Harvard
architecture. The CPU can fetch the next instruction and simultaneously do a load/store of a
memory word. Standard computers use the same memory space for program and data, this being
called the von Neumann architecture (Pennsylvania architecture or Princeton architecture?).
Most DSPs abide by the Harvard architecture in order to be able to overlap instruction fetches
with data transfers. The idea of overlapping brings us to pipelining.
[The availability of the modern cache system has substantially alleviated the problem of
the von Neumann bottle neck. Most modern computers labeled “Harvard architecture” allow
accessing the contents of the instruction memory as though it were data and are called modified
Harvard architecture, used in niche applications like DSP (TI’s TMS320, Analog Devices’
Blackfin) and microcontrollers (Atmel AVR, ZiLOG’s Z8Encore!).]
To sum up, so far our efforts to enhance the DSP processor’s speed have introduced the
following concepts:
1. Special instruction (MAC) added to the instruction set – CPU augmentation
2. Address registers updated in parallel – CPU augmentation
3. Data registers loaded from memory in parallel – Memory banks
4. Instruction fetched in parallel with execution of previous instructions –
Harvard architecture (separate program and data memories) and Pipelining
Pipelining allows the parallel execution of any operations that logically can be performed in
parallel. These operations need not be the 5 lines listed above, so let us generalize the 5 lines and
call them A, B, C, D and E. Each one takes one clock cycle and together they make up one pass
through the loop described above. In a given clock cycle the pipeline contains 5 different passes
each one being in just one of the 5 states A through E. Thus any one pass would consist of these
www.jntuworld.com
www.jntuworld.com
DSP-8 (DSP Processors) 6 of 8 Dr. Ravi Billa
5 operations and take 5 clock cycles to complete. The number of overlapable operations of which
one pass is comprised is called the depth of the pipeline. Here we have a depth-5 pipeline.
Typical depths are 4 or 5. Some DSP processors have pipeline depths as high as 11.
The operation of a depth-5 pipeline is shown below. Time (clock cycles) runs from left to
right. The height corresponds to distinct hardware units (stages). There are a total of 6 product
terms being added to the accumulator, each term taking 5 clock cycles. In {A1 through E1} the
first product term is added to the accumulator and is completed in clock cycle #5. The addition of
the second product term, {A2 through E2}, is completed in clock cycle #6, etc. The complete
sum is available in 10 cycles. Without the pipeline the summation would take 6 * 5 = 30 cycles.
There is pipeline overhead: at the left there are 4 clock cycles during which the pipeline
is filling while at the right there are a further 4 cycles while the pipeline is emptying. (In this
specific example the pipeline is full for 2 cycles). For large enough loops the overhead is
negligible; thus the pipeline allows the DSP processor to perform one multiply-and-add per clock
cycle on the average. In general, asymptotically, the processor takes only a single clock cycle per
instruction.
Note that in this treatment we are dividing one pass through the loop into five
overlapable parts labeled A through E. In other contexts an instruction cycle is divided into
perhaps five overlapable parts. Note also that in the above diagram time runs from left to right,
while depth corresponds to distinct hardware units (in this case five). One pass through the loop
goes diagonally down from left to right, shown in color.
8.4 Interrupts – handling incoming signal values
A context switch is when a processor stops what it has been doing and starts doing something
else resulting in a need to save/change the contents of registers, pointers, counters, flags etc. A
hardware mechanism called an interrupt forces a context switch to a predefined routine called
the interrupt handler for the event in question.
One major difference between a DSP processor and other types of CPU is the speed of
the context switch. A general purpose CPU may have a latency (the time from when an interrupt
is requested to the point where the interrupt handler begins to execute) of several tens of clock
cycles to perform a context switch, while DSPs have the ability to do a low latency (perhaps
even zero-overhead) interrupt.
The most important reason for a fast context switch is the need to capture incoming
signal values (which are interrupt-based). These signal values are either processed immediately
or stored in a buffer for later processing. The DSP fast interrupt is usually accomplished by
saving only small portion of the context and having hardware assistance for this procedure.
Signal values are input to and output from the DSP through ports. Serial ports are
typically used for low rate signals – bits are moved in or out one bit per clock cycle through an
internal shift register. Parallel ports (typically 8 or 16 bits) are faster but require more pins on the
Clock cycles →
Depth↓ 1 2 3 4 5 6 7 8 9 10
1 A1 A2 A3 A4 A5 A6
2 B1 B2 B3 B4 B5 B6
3 C1 C2 C3 C4 C5 C6
4 D1 D2 D3 D4 D5 D6
5 E1 E2 E3 E4 E5 E6
www.jntuworld.com
www.jntuworld.com
DSP-8 (DSP Processors) 7 of 8 Dr. Ravi Billa
DSP chip itself. Further speed up of data transfer is made possible through DMA (Direct
Memory Access) channels.
8.5 Fixed- and Floating-point
DSP tasks involve intensive number crunching.
By integer data we shall mean that the decimal point is at the extreme right end and may
therefore be ignored. By real numbers we mean that the decimal point may be somewhere other
than at extreme right but is always fixed; this allows the data to contain a fractional part. We
shall use “fixed-point” to cover both integers and real numbers. Such data have a limited range.
By floating-point data we shall mean that the position of the decimal point is not fixed
(floating) and is adjusted to suit other data it must interact with and operations it goes through
and the results.
Early DSP processors offered integer-only mode of data and arithmetic. Even today such
fixed-point DSPs flourish (or, are forced on DSP developers) due to the realities of cost, speed,
size, and power consumption. The DSP community has developed intricate numeric methods to
use fixed-point devices. Today there are floating-point DSPs but these still tend to be much more
expensive, more power hungry and physically larger than their fixed-point counterparts.
Fixed-point Floating-point
Lower cost
Smaller size
Lower power consumption
More expensive
Larger size
Higher power consumption
Embedding DSP into small package, or
Where power is limited, or
Where price is critical
Good match for A/D and D/A (typically
unsigned or 2’s-comp. integer devices)
Fixed-point DSPs What is the consequence of having to prefer fixed-point DSPs over floating
point DSPs (in some applications) for reasons mentioned above? The price is extended
development time. After the required algorithms have been simulated on computers with floating
point capabilities, floating point operations must then be carefully converted to integer ones. This
involves
1. Rounding
2. Rescaling (due to limited dynamic range) at various points
3. Underflow and overflow handling
4. Placement of rescalings at optimal points (to ensure maximum signal to
quantization ratio)
5. Matching of the precise details of the processor’s arithmetic with other systems
for interoperability or with a standard for conformity
These tasks may require extensive simulation.
The most common fixed-point representation is 16-bit two’s complement (24- or 32-bit
registers also exist). In fixed point DSPs this must accommodate both integers and real numbers.
A real number is represented by multiplying it by a large number and rounding; consider the
www.jntuworld.com
www.jntuworld.com
DSP-8 (DSP Processors) 8 of 8 Dr. Ravi Billa
coefficient a of the IIR filter y(n) – a y(n–1) = x(n), 0 ≤ a ≤ 1. On an 8-bit fixed-point processor 1
becomes 256, and the range for a is 0 ≤ a ≤ 256.
With 16-bit operands, bit growth (or, increase of precision) means that their sum can
require 17 bits and their product 32 bits.
Regular CPUs Fixed-point DSPs Floating-point
DSPs
Addition and
Multiplication
User must check
(overflow flag is set
or exception is
triggered), decide
what to do with
product
User must explicitly handle the
increase in precision:
(1) Use accumulator longer than
the largest product, or
(2) Scaling as part of the MAC
instruction built into the pipeline,
or
(3) Saturation arithmetic
Hardware
automatically
discards least
significant bits
With a DSP using fixed-point arithmetic, if the product is 32 bits the accumulator could
be 40 bits long; this allows eight MACs to be performed without any fear of overflow. At the end
of the loop a single check and possible discard can be done. A second possibility is to provide an
optional scaling operation as part of the MAC instruction itself (basically a right-shift of the
product before the addition, built into the pipeline). Thirdly, saturation arithmetic is the last
resort: whenever an overflow (or underflow) occurs the result is replaced by the largest (or
smallest) possible of the appropriate sign. The error introduced is smaller than that caused by
straight overflow (i. e., roll-over of the register).
With fixed-point processors, the filter coefficients should not simply be rounded; rather
the best integer coefficients should be determined using an optimization procedure.
Floating-point DSPs avoid many of the above problems. There is an IEEE floating point
standard.
Floating-point DSPs do not usually have instructions for division, powers, square root,
trig functions etc.
www.jntuworld.com
www.jntuworld.com