Chapter 9: Algorithmic Strength Reduction in Filters and Transforms

Chapter 9: Algorithmic StrengthReduction in Filters and

Transforms

Keshab K. Parhi

Chapter 9 2

Outline

• Introduction• Parallel FIR Filters

– Formulation of Parallel FIR Filter Using PolyphaseDecomposition

– Fast FIR Filter Algorithms• Discrete Cosine Transform and Inverse DCT

– Algorithm-Architecture Transformation– Decimation-in-Frequency Fast DCT for 2M-point DCT

Chapter 9 3

Introduction

• Strength reduction leads to a reduction in hardware complexity byexploiting substructure sharing and leads to less silicon area or powerconsumption in a VLSI ASIC implementation or less iteration periodin a programmable DSP implementation

• Strength reduction enables design of parallel FIR filters with a less-than-linear increase in hardware

• DCT is widely used in video compression. Algorithm-architecturetransformations and the decimation-in-frequency approach are used todesign fast DCT architectures with significantly less number ofmultiplication operations

Chapter 9 4

Parallel FIR Filters

• An N-tap FIR filter can be expressed in time-domain as

– where {x(n)} is an infinite length input sequence and the sequence contains the FIR filter coefficients of length N

– In Z-domain, it can be written as

∞⋅⋅⋅=−=∗= ∑−

=

,,2,1,0,)()()()()(1

0

ninxihnxnhnyN

i

{ })(nh

⋅

=⋅= ∑∑

∞

=

−−

=

−

0

1

0

)()()()()(n

nN

n

n znxznhzXzHzY

Formulation of Parallel FIR Filters UsingPolyphase Decomposition

Chapter 9 5

• The Z-transform of the sequence x(n) can be expressed as:

– where X0(z2) and X1(z2), the two polyphase components, are the z-transforms of the even time series {x(2k)} and the odd time-series{x(2k+1)}, for {0≤k<∞}, respectively

• Similarly, the length-N filter coefficients H(z) can be decomposed as:

– where H0(z2) and H1(z2) are of length N/2 and are referred as even andodd sub-filters, respectively

• The even-numbered output sequence {y(2k)} and the odd-numberedoutput sequence {y(2k+1)} for {0≤k<∞} can be computed as

[ ] [ ])()(

)5()3()1()4()2()0(

)3()2()1()0()(

21

120

42142

321

zXzzX

zxzxxzzxzxx

zxzxzxxzX

−

−−−−−

−−−

+=

⋅⋅⋅++++⋅⋅⋅+++=

⋅⋅⋅++++=

)()()( 21

120 zHzzHzH −+=

(continued on the next page)

Chapter 9 6

• (cont’d)

– i.e.,

– where Y0(z2) and Y1(z2) correspond to y(2k) and y(2k+1) in time domain,respectively. This 2-parallel filter processes 2 inputs x(2k) and x(2k+1)and generates 2 outputs y(2k) and y(2k+1) every iteration. It can bewritten in matrix-form as:

( ) ( )[ ]

[ ])()(

)()()()()()(

)()()()(

)()()(

21

21

2

20

21

21

20

120

20

21

120

21

120

21

120

zHzXz

zHzXzHzXzzHzX

zHzzHzXzzX

zYzzYzY

−

−

−−

−

+

++=

+⋅+=

+=

)()()()()(

)()()()()(2

02

12

12

02

1

21

21

220

20

20

zHzXzHzXzY

zHzXzzHzXzY

+=

+= −

⋅

=

−

1

0

01

12

0

1

0

XX

HHHzH

YYXHY ⋅= or (9.1)

Chapter 9 7

– The following figure shows the traditional 2-parallel FIR filter structure,which requires 2N multiplications and 2(N-1) additions

• For 3-phase poly-phase decomposition, the input sequence X(z) andthe filter coefficients H(z) can be decomposed as follows

– where {X0(z3), X1(z3), X2(z3)} correspond to x(3k),x(3k+1) and x(3k+2)in time domain, respectively; and {H0(z3), H1(z3), H2(z3)} are the threesub-filters of H(z) with length N/3.

H0

H1

H0

H1 2−Z

y(2k+1)

y(2k)x(2k)

x(2k+1)

)()()()(

),()()()(3

223

113

0

32

231

130

zHzzHzzHzH

zXzzXzzXzX−−

−−

++=

++=

Chapter 9 8

– The output can be computed as:

– In every iteration, this 3-parallel FIR filter processes 3 input samples x(3k),x(3k+1) and x(3k+2), and generates 3 outputs y(3k), y(3k+1) and y(3k+2),and can be expressed in matrix form as:

( ) ( )( )[ ] [ ]

[ ]0211202

223

01101

12213

00

22

11

022

11

0

32

231

130 )()()()(

HXHXHXz

HXzHXHXzHXHXzHX

HzHzHXzXzX

zYzzYzzYzY

+++

+++++=

++⋅++=

++=

−

−−−

−−−−

−−

⋅

=

−

−−

2

1

0

012

23

01

13

23

0

2

1

0

XXX

HHHHzHHHzHzH

YYY

(9.2)

Chapter 9 9

– The following figure shows the traditional 3-parallel FIR filter structure,which requires 3N multiplications and 3(N-1) additions

H1x(3k)H0

H2

H1x(3k+1)

H0

H2

H1x(3k+2)

H0

H2

D

D

D

y(3k+2)

y(3k+1)

y(3k)

D3: −z

Chapter 9 10

• Generalization:

– The outputs of an L-Parallel FIR filter can be computed as:

– This can also be expressed in Matrix form as

∑

∑∑−

=−−−

=−

−

+=−+

−

=

−≤≤

+

=

1

011

0

1

1

20,

L

iiLiL

k

iiki

L

kiikLi

Lk

XHY

LkxHXHzY

⋅⋅⋅⋅

⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅

⋅⋅⋅⋅⋅⋅

=

⋅⋅⋅

−−−

−

−−

−

− 1

1

0

021

201

110

1

1

0

LLL

L

LL

L

L X

XX

HHH

HzHHHzHzH

Y

YY

XHY ⋅=

(9. 3)

(9. 4)

Note: H is a pseudo-circulant matrix

Chapter 9 11

Two-parallel and Three-parallel Low-Complexity FIR Filters

• Two-parallel Fast FIR Filter– The 2-parallel FIR filter can be rewritten as

– This 2-parallel fast FIR filter contains 3 sub-filters. The 2 sub-filters H0X0 and H1X1 are shared for the computation of Y0 and Y1

( ) ( ) 110010101

112

000

XHXHXXHHYXHzXHY

−−+⋅+=+= − (9. 5)

H0x(2k)

H0+H1

H1 Dx(2k+1)

y(2k)

y(2k+1)

-

-

Chapter 9 12

– This 2-parallel filter requires 3 distinct sub-filters of length N/2and 4 pre/post-processing additions. It requires 3N/2 = 1.5Nmultiplications and 3(N/2-1)+4=1.5N+1 additions. [The traditional2-parallel filter requires 2N multiplications and 2(N-1) additions]

– Example-1: when N=8 and , the 3 sub-filtersare

– The subfilter can be precomputed– The 2-parallel filter can also be written in matrix form as

{ }7610 ,,,, hhhhH ⋅⋅⋅=

{ }{ }

{ }7654321010

75311

64200

,,,

,,,,,,

hhhhhhhhHH

hhhhHhhhhH

++++=+

==

10 HH +

22222 XPHQY ⋅⋅⋅= (9.6)Q2 is a post-processing matrix which determines the manner in which the filter outputs are combined to correctly produce the parallel outputs and P2 is a pre-processing matrix which determines the manner in which the inputs should be combined

Chapter 9 13

– (matrix form)

– where diag(h*) represents an NXN diagonal matrix H2 with diagonalelements h*.

– Note: the application of FFA diagonalizes the original pseudo-circulant matrix H. The entries on the diagonal of H2 are the sub-filters required in this parallel FIR filter

– Many different equivalent parallel FIR filter structures can beobtained. For example, this 2-parallel filter can be implementedusing sub-filters {H0, H0 -H1, H1} which may be more attractive innarrow-band low-pass filters since the sub-filter H0 -H1 requiresfewer non-zero bits than H0 +H1. The parallel structure containingH0 +H1 is more attractive for narrow-band high-pass filters.

⋅

⋅

+⋅

−−=

−

1

0

1

10

02

1

0

101101

11101

XX

HHH

Hdiag

zYY

(9.7)

Chapter 9 14

• 3-Parallel Fast FIR Filter– A fast 3-parallel FIR algorithm can be derived by recursively

applying a 2-parallel fast FIR algorithm and is given by

(9.8)

( )( )[ ]( )( )[ ] [ ]( )( )[ ]

( )( )[ ]( )( )[ ]112121

111010

2102102

223

001110101

1121213

223

000

XHXXHHXHXXHH

XXXHHHYXHzXHXHXXHHY

XHXXHHzXHzXHY

−++−−++−

++++=−−−++=

−+++−=−

−−

– The 3-parallel FIR filter is constructed using 6 sub-filters of lengthN/3, including H0X0, H1X1, H2X2, ,

– With 3 pre-processing and 7 post-processing additions, this filterrequires 2N multiplications and 2N+4 additions which is 33% lessthan a traditional 3-parallel filter

( )( )1010 XXHH ++( )( )and 2121 XXHH ++ ( )( )210210 XXXHHH ++++

Chapter 9 15

– The 3-parallel filter can be expressed in matrix form as

33333 XPHQY ⋅⋅⋅=

−−

−

⋅

−−−=

=

−−

10000001001000101000001

11100011001

,

33

3

2

1

0

3

zz

QYYY

Y

=

=

++++

=

2

1

0

33

210

21

10

2

1

0

3 ,

111110011100010001

,XXX

XP

HHHHHHH

HHH

diagH

(9.9)

Chapter 9 16

– Reduced-complexity 3-parallel FIR filter structure

H0

x(3k+2)

H1

H2 D

x(3k+1)

y(3k)

y(3k+1)-

-

H0+H1

H1+H2

H0+H1+H2

D-

- -

-

x(3k)

y(3k+2)

Chapter 9 17

Parallel FIR Filters (cont’d)Parallel Filters by Transposition

• Any parallel FIR filter structure can be used to derive another parallelequivalent structure by transpose operation (or transposition).Generally, the transposed architecture has the same hardwarecomplexity, but different finite word-length performance

• Consider the L-parallel filter in matrix form Y=HX (9.4), where H isan LXL matrix. An equivalent realization of this parallel filter can begenerated by taking the transpose of the H matrix and flipping thevectors X and Y:

– where

FT

F XHY ⋅=[ ]

[ ]

⋅⋅⋅=

⋅⋅⋅=

−−

−−

TLLF

TLLF

YYYY

XXXX

021

021

(9.10)

Chapter 9 18

• Examples:– the 2-parallel FIR filter in (9.1) can be reformulated by using transposition

as follows:

– Transposition of the 2-parallel fast filter in (9.6) leads to anotherequivalent structure:

– The reduced-complexity 2-parallel FIR filter structure by transposition isshown on next page

⋅

=

−

0

1

012

10

0

1

XX

HHzHH

YY

22222 XPHQY ⋅⋅⋅=( )

F

FF

XQHP

XPHQYTTT

T

2222

22222

⋅⋅⋅=

⋅⋅⋅=

⋅

−

−⋅

+⋅

=

− 0

1

21

10

0

0

1

11011

110011

XX

zHHH

Hdiag

YY

(9.11)

Chapter 9 19

• Signal-flow graph of the 2-parallel FIR filter• Transposed signal-flow graph

x0

x1

y0

y1

H0

H1

H0+H1

-

-

z-2

y0

y1

x0

x1

H0

H1

H0+H1 -

- z-2

Fig. (a)

Fig. (b)

Chapter 9 20

(c) Block diagram of the transposed reduced-complexity 2-parallel FIR filter

D

H0

H0+H1

H1

x0

x1

y1

y0-

-

Fig. (c)

Chapter 9 21

Parallel FIR Filters (cont’d)Parallel Filter Algorithms from Linear Convolutions

• Any LXL convolution algorithm can be used to derive an L-parallelfast filter structure

• Example: the transpose of the matrix in a 2X2 linear convolutionalgorithm (9.12) can be used to obtain the 2-parallel filter (9.13):

(9.13)

⋅

=

0

1

0

10

1

0

1

2

0

0

xx

hhh

h

sss

⋅

=

−

1

0

12

01

01

1

0

00

XX

Xz

HHHH

YY

(9. 12)

Chapter 9 22

• Example: To generate a 2-parallel filter using 2X2 fast convolution, considerthe following optimal 2X2 linear convolution:

– Note: Flipping the samples in the sequences {s}, {h}, and {x} preservesthe convolution formulation (i.e., the same C and A matrices can be usedwith the flipped sequences)

– Taking the transpose of this algorithm, we can get the matrix form of thereduced-complexity 2-parallel filtering structure:

⋅

⋅

+⋅

−−=

⋅⋅⋅=

0

1

0

10

1

0

1

2

101101

100111

001

xx

hhh

hdiag

sss

XAHCs

( ) XPHQXAHCY T ⋅⋅⋅=⋅⋅⋅=

(9.14)

Chapter 9 23

– The matrix form of the reduced-complexity 2-parallel filtering structure

– The 2-parallel architecture resulting from the matrix form is shown asfollows

– Conclusion: this method leads to the same architecture that was obtainedusing the direct transposition of the 2-parallel FFA

⋅

−

−⋅

+⋅

=

−

1

0

12

0

10

1

1

0

110010011

110011

XX

Xz

HHH

Hdiag

YY

(9.15)

x(2k)

x(2k+1)

y(2k)

y(2k+1)H0

D

-

H0+H1

H1

-

Chapter 9 24

Parallel FIR Filters (cont’d)

Fast Parallel FIR Algorithms for Large Block Sizes

• Parallel FIR filters with long block sizes can be designed by cascadingsmaller length fast parallel filters

• Example: an m-parallel FFA can be cascaded with an n-parallel FFAto produce an -parallel filtering structure. The set of FIR filtersresulting from the application of the m-parallel FFA can be furtherdecomposed, one at a time, by the application of the n-parallel FFA.The resulting set of filters will be of length .

• When cascading the FFAs, it is important to keep track of both thenumber of multiplications and the number of additions required for thefiltering structure

( )nm×

( )nmN ×

Chapter 9 25

– The number of required multiplications for an L-parallel filter with is given by:

• where r is the number of levels of FFAs used, is the block size ofthe FFA at level-i, is the number of filters that result from theapplications of the i-th FFA and N is the length of the filter

– The number of required additions can be calculated as follows:

rLLLL ⋅⋅⋅= 21

∏∏ =

=

=r

iir

i i

ML

NM

11

iLiM

(9.16)

−

+

+=

∏∏

∑ ∏∏∏

==

=

−

=+==

11

1

2

1

112

r

i i

r

ii

r

i

i

kk

r

ijji

r

iii

L

NM

MLALAA

(9.17)

Chapter 9 26

• where is the number of pre/post-processing adders required bythe i-th FFA

– For example: consider the case of cascading two 2-parallel reduce-complexity FFAs, the resulting 4-parallel filtering structure wouldrequire a total of 9N/4 multiplications and 20+9(N/4-1) additions.Compared with the traditional 4-parallel filter which requires 4Nmultiplications. This results in a 44% hardware (area) savings

• Example: (Example 9.2.1, p.268) Calculating the hardware complexity– Calculate the number of multiplications and additions required to

implement a 24-tap filter with block size of L=6 for both the cases and :

• For the case :

iA

{ }3,2 21 == LL { }2,3 21 == LL{ }3,2 21 == LL

( ) ( ) ( ) ( ) ( ) ( ) 96132

246331034,7263

3224

=

−

××+×+×==××

×= AM

,10,6,4,3 2211 ==== AMAM

Chapter 9 27

• For the case :

• How are the FFAs cascaded?– Consider the design of a parallel FIR filter with a block size of 4,

using (9.3), we have

– The reduced-complexity 4-parallel filtering structure is obtained byfirst applying the 2-parallel FFA to (9.18), then applying the FFA asecond time to each of the filtering operations that result from thefirst application of the FFA

– From (9.18), we have (see the next page):

{ }2,3 21 == LL

( ) ( ) ( ) ( ) ( ) ( ) 98123

243664210,7236

2324

=

−

××+×+×==××

×= AM

,4,3,10,6 2211 ==== AMAM

( )( )3

32

21

10

33

22

11

0

33

22

11

0

HzHzHzH

XzXzXzX

YzYzYzYY

−−−

−−−

−−−

+++

⋅+++=

+++=

(9.18)

Chapter 9 28

– (cont’d)• where

– Application-1

• The 2-parallel FFA is then applied a second time to each of thefiltering operations of(9.19)

– Application-2• Filtering Operation

( ) ( )11

011

0 '''' HzHXzXY −− +⋅+=

+=+=

+=+=−−

−−

32

11,22

00

32

11,22

00

''

''

HzHHHzHH

XzXXXzXX

( ) ( )[ ] '1

'1

2'1

'1

'0

'0

'1

'0

'1

'0

1'0

'0 HXzHXHXHHXXzHXY −− +−−+⋅++= (9.19)

( ) ( ){ }10101100 '''','','' HHXXHXHX +⋅+

{ }00 '' HX

( )( )( ) ( )[ ] 22

422002020

200

22

022

0'0

'0

HXzHXHXHHXXzHX

HzHXzXHX−−

−−

+−−+⋅++=

++=

Chapter 9 29

• Filtering Operation

• Filtering Operation

– The second application of the 2-parallel FFA leads to the 4-parallelfiltering structure (shown on the next page), which requires 9filtering operations with length N/4

{ }11 '' HX( )( )

( ) ( )[ ] 334

331131312

11

32

132

1'1

'1

HXzHXHXHHXXzHX

HzHXzXHX−−

−−

+−−+⋅++=

++=

( )( ){ }1010 '''' HHXX ++( )( ) ( ) ( )[ ] ( ) ( )[ ]

( )( )[ ] ( )( )[ ]( )( )

( )( ) ( )( )

++−++−

+++++++

+++++=

+++⋅+++=++

−

−

−−

32321010

321032102

32324

1010

322

10322

101010 ''''

HHXXHHXXHHHHXXXX

z

HHXXzHHXX

HHzHHXXzXXHHXX

Chapter 9 30

Reduced-complexity 4-parallel FIR filter (cascaded 2 by 2)

Chapter 9 31

Discrete Cosine Transform andInverse DCT

• The discrete cosine transform (DCT) is a frequency transform used instill or moving video compression. We discuss the fastimplementations of DCT based on algorithm-architecturetransformations and the decimation-in-frequency approach

• Denote the DCT of the data sequence x(n), n=0, 1,…, N-1, by X(k),k=0, 1, …, N-1. The DCT and inverse DCT (IDCT) are described bythe following equations:– DCT:

– IDCT:

( )1,,1,0,

212

cos)()()(1

0

−⋅⋅⋅=

+

= ∑−

=

NkN

knnxkekX

N

n

π

( )1,,1,0,

212

cos)()(2

)(1

0

−⋅⋅⋅=

+

= ∑−

=

NnN

knkXke

Nnx

N

k

π

(9.20)

(9.21)

Chapter 9 32

• where

• Note: DCT is an orthogonal transform, i.e., the transformation matrixfor IDCT is a scaled version of the transpose of that for the DCT andvice versa. Therefore, the DCT architecture can be obtained by“transposing” the IDCT, i.e., reversing the direction of the arrows inthe flow graph of IDCT, and the IDCT can be obtained by“transposing” the DCT

• Direct implementation of DCT or IDCT requires N(N-1) multiplicationoperations, i.e., O(N2), which is hardware expensive.

• Strength reduction can reduce the multiplication complexity of a 8-point DCT from 56 to 13.

=

=otherwise

kke

,10,21

)(

Chapter 9 33

• Example (Example 9.3.1, p.277) Consider the 8-point DCT

– It can be written in matrix form as follows: (where )

⋅

=

)7()6()5()4()3()2()1()0(

)7()6()5()4()3()2()1()0(

9271331173217

26142221030186

1112313325155

28201242820124

137127211593

30262218141062

15131197531

44444444

xxxxxxxx

cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc

XXXXXXXX

( )

==

⋅⋅⋅=

+

= ∑=

otherwisekkewhere

kkn

nxkekXn

,10,21)(

7,,1,0,16

12cos)()()(

7

0

π

16cos πici =

Chapter 9 34

– The algorithm-architecture mapping for the 8-point DCT can becarried out in three steps

• First Step: Using trigonometric properties, the 8-point DCT can berewritten as in next page

⋅

−−−−−−−−

−−−−−−−−

−−−−−−−−

−−−−

=

)7()6()5()4()3()2()1()0(

)7()6()5()4()3()2()1()0(

75311357

62266226

51733715

44444444

37155173

26622662

13577531

44444444

xxxxxxxx

cccccccccccccccccccccccc

cccccccccccccccc

cccccccccccccccc

cccccccc

XXXXXXXX

(9.22)

Chapter 9 35

– (continued)

– where

– The following figure (on the next page) shows the DCTarchitecture according to (9.23) and (9.24) with 22 multiplications.

410073123150

410013725130

21161033521170

61121053327110

)0(,)5()4(,)3(

)6(,)7()2(,)1(

cPXcMcMcMcMXcMXcMcMcMcMX

cMcMXcMcMcMcMXcMcMXcMcMcMcMX

⋅=+−+=⋅=−−−=

−=++−=+=+++=

,,,,

,,,,,,,,

3211101032111010

523612431700

523612431700

PPPPPPPPMPPM

xxPxxPxxPxxPxxMxxMxxMxxM

+=+=−=−=

+=+=+=+=−=−=−=−=

(9.23)

11101001110100 , PPPPPM +=−=

(9.24)

Chapter 9 36

Figure: The implementation of 8-point DCT structure in the first step (also see Fig. 9.10, p.279)

Chapter 9 37

• Second step, the DCT structure (see Fig. 9.10, p.279) is grouped intodifferent functional units represented by blocks and then the wholeDCT structure is transformed into a block diagram

– Two major blocks are defined as shown in the following figure

– The transformed block diagram for an 8-point DCT is shown inthe next page (also see Fig. 9.12 in p.280 of text book)

x(0)

x(1)

x(0)+x(1)

x(0)-x(1)-

x(0)

x(1)

ax(0)+bx(1)

bx(0)-ax(1)-

a

abb

X±

XC±

a b

Chapter 9 38

Figure: The implementation of 8-point DCT structure in the second step (also see Fig. 9.12, p.280)

Chapter 9 39

• Third step: Reduced-complexity implementations of various blocksare exploited (see Fig. 9.13, p.281)

– The block can be realized using 3 multiplications and 3additions instead of using 4 multiplications and 2 additions, asshown in follows

– Define the block with andreversed outputs as a rotator block that performs thefollowing computation:

XC±

x

y

ax+by

bx-ay-

x

y

ax+by

bx-ay-

a-b

a+b

a

a

bb

b

XC± { }θθ cos,sin == baθrot

⋅

−=

yx

yx

θθθθ

cossinsincos

''

Chapter 9 40

– Note: The angles of cascaded rotators can be simply added, asshown in the transformation block as follows:

– Note: Based on the fact that a rotator with is just likethe block , we modify it as the following structure:

XC±

a b

x

ybx-ay

ax+byθrotx

yx’y’

==

θθ

cossin

ba

for

1θrot 2θrot ( )21 θθ +rot

{ }4πθ =X±

X±x

y

c4

c4

( )4πrotxy

x’y’ )4cos(4 π=c

Chapter 9 41

– From the three steps, we obtain the final structure where only 13multiplications are required (also see Fig. 9.14, p.282)

X±x(0)x(7)

163π

rot

X±x(3)x(4)

X±x(1)x(6)

X±x(2)x(5)

16π

rot

X±

X±

83π

rot

X±

X±

X±

-

-

-c4

c4

c4

c4

X(1)

X(5)

X(3)

X(7)

X(6)

X(2)

X(0)

X(4)

Chapter 9 42

Decimation-in-Frequency Fast DCT for -Point DCT

• The fast -point DCT/IDCT structures can be derived by thedecimation-in-frequency approach, which is commonly used to derivethe FFT structure to compute the discrete-Fourier transform (DFT). Bypower-of-2 decomposition, this algorithm reduces the number ofmultiplications to about

• We only derive the fast IDCT computation (The fast DCT structure canbe obtained from IDCT by “transposition” according to theircomputation symmetry). For simplicity, the 2/N scaling factor in (9.21)is ignored in the derivation.

Discrete Cosine Transform and Inverse DCTm2

m2

( ){ }NN 2log2

Chapter 9 43

– Define and decompose x(n) into even andodd indexes of k as follows

– Notice

( ) ( ) ( )kXkekX ⋅=ˆ

( )

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( )( )[ ]

( ) ( ) ( ) ( )

+

++

+

•+

+

+

=

++

++

+

=

+

=

∑

∑

∑∑

∑

−

=

−

=

−

=

−

=

−

=

Nn

Nkn

kX

NnNkn

kX

Nkn

kXN

knkX

Nkn

kXnx

N

k

N

k

N

k

N

k

N

k

212

cos2

1212cos12ˆ2

212cos21

2212

cos2ˆ

21212

cos12ˆ2

212cos2ˆ

212

cos)(ˆ)(

12

0

12

0

12

0

12

0

1

0

ππ

ππ

ππ

π

( ) ( ) ( ) ( ) ( )

( )N

knN

knN

nN

kn

π

πππ

12cos

112cos

212

cos2

1212cos2

++

++=

+⋅

++

Chapter 9 44

– Therefore, (since )

– Substitute k’=k+1 into the first term, we obtain

• where

( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( )∑∑

∑∑

∑

−

=

−

=

−

=

−

=

−

=

+

++

++

+=

+

++

++

+=

+

++

+

12

0

22

0

12

0

12

0

12

0

12cos12ˆ112

cos12ˆ

12cos12ˆ112

cos12ˆ

212

cos2

1212cos12ˆ2

N

k

N

k

N

k

N

k

N

k

Nkn

kXN

knkX

Nkn

kXN

knkX

Nn

Nkn

kX

ππ

ππ

ππ( ) ( )[ ] 011212cos =+−+ NNn π

( ) ( ) ( )

( ) ( ) ( ) ( )∑∑

∑−

=

−

=

−

=

+

−=

+

−=

++

+

12

0

12

1'

22

0

'12cos1'2ˆ'12

cos1'2ˆ

112cos12ˆ

N

k

N

k

N

k

Nkn

kXN

knkX

Nkn

kX

ππ

π

( ) 01ˆ =−X

Chapter 9 45

– Then, the IDCT can be rewritten as

– Define

– and

– Clearly, G(k) & H(k) are the DCTs of g(n) & h(n), respectively.

( ) ( )( ) ( )[ ]

( ) ( )[ ] ( )( )∑

∑−

=

−

=

+−++

⋅+

+

+=

12

0

12

0

2212

cos12ˆ12ˆ

212cos21

2212

cos2ˆ)(

N

k

N

k

Nkn

kXkX

NnNkn

kXnx

π

ππ

12,,1,0),12(ˆ)12(ˆ)(

),2(ˆ)(−⋅⋅⋅=

−++≡

≡Nk

kXkXkH

kXkG(9.25)

( )( )

[ ] ( )( )

+−++≡

−⋅⋅⋅=

+≡

∑

∑−

=

−

=

2212

cos)12(ˆ)12(ˆ)(

12,,1,0,22

12cos)2(ˆ)(

12

0

12

0

Nkn

kXkXnh

NnN

knkXng

N

k

N

k

π

π

(9.26)

Chapter 9 46

– Since

– Finally, we can get

– Therefore, the N-point IDCT in (9.21) has been expressed in termsof two N/2-point IDCTs in (9.26). By repeating this process, theIDCT can be decomposed further until it can be expressed in termsof 2-point IDCTs. (The DCT algorithm can also be decomposedsimilarly. Alternatively, it can be obtained by transposing theIDCT)

( )( ) ( )

( )( ) ( )

+−=

+−−

+=

+−−

Nn

NnN

Nkn

NknN

ππ

ππ

12cos

112cos

12cos

112cos

( ) ( )[ ]

( ) ( )[ ]

−⋅⋅⋅=

+−=−−

++=

12,,1,0),(

212cos21

)()1(

),(212cos2

1)()(

Nnnh

NnngnNx

nhNn

ngnx

π

π

(9.27)

Chapter 9 47

• Example (see Example 9.3.2, p.284) Construct the 2-point IDCT butterflyarchitecture.– The 2-point IDCT can be computed as

– The 2-point IDCT can be computed using the following butterflyarchitecture

( )( )

−=+=

,4cos)1(ˆ)0(ˆ)1(,4cos)1(ˆ)0(ˆ)0(

ππ

XXxXXx

-1x(1)

x(0)

C4)1(X̂

)0(X̂

Chapter 9 48

• Example (Example 9.3.3, p.284) Construct the 8-point fast DCTarchitecture using 2-point IDCT butterfly architecture.– With N=8, the 8-point fast DCT algorithm can be rewritten as:

– and

3,2,1,0),12(ˆ)12(ˆ)(

),2(ˆ)(=

−++≡

≡k

kXkXkH

kXkG

( )

( )

+

=

−⋅⋅⋅=

+

=

∑

∑−

=

=

812

cos)()(

12,,1,0,812

cos)()(

13

0

3

0

knkHnh

Nnkn

kGng

k

k

π

π

( )[ ]

( )[ ]

+−=−−

++=

),(1612cos2

1)()1(

),(1612cos2

1)()(

nhn

ngnNx

nhn

ngnx

π

π

Chapter 9 49

– The 8-point fast IDCT is shown below (also see Fig.9.16, p.285), whereonly 13 multiplications are needed. This structure can be transposed to getthe fast 8-point DCT architecture as shown on the next page (also see Fig.9.17, p.286) (Note: for N=8, in bothfigures)

( )[ ] ( )4cos164cos214 ππ ==C

)0(X̂)0(X̂

)4(X̂

)2(X̂

)6(X̂

)1(X̂

)5(X̂

)3(X̂

)7(X̂

)4(X̂

)2(X̂

)6(X̂

)1(X̂

)5(X̂

)3(X̂

)7(X̂

)0(G

)2(G

)1(G

)3(G

)0(H

)2(H

)1(H

)3(H

4C

4C

4C

4C

1−

1−

1−

1−

2C

6C

2C

6C

1−

1−

1−

1−

3C

1C

7C

5C

1−

1−

1−

1−

)0(X

)1(X

)3(X

)2(X

)7(X

)6(X

)4(X

)5(X

Chapter 9 50

Fast 8-point DCT Architecture

)0(X̂

)4(X̂

)2(X̂

)6(X̂

)1(X̂

)5(X̂

)3(X̂

)7(X̂

1−

1−

1−

1−

3C

1C

7C

5C

)0(X

)2(X

)4(X

)6(X

)1(X

)3(X

)5(X

)7(X

1−

1−

1−

1−

6C

2C

2C

6C

1−

1−

1−

1−

4C

4C

4C

4C

4C

Documents

Chapter 9: Algorithmic Strength Reduction in Filters and Transforms