VLSI SYNTHESIS OF DSP KERNELS Algorithmic and ...978-1-4757-3355...5.3.2 Signal Flow Graph Transformations 130 5.3.3 Evaluating Effectiveness of the Transformations 133 5.3.4 Transformations

VLSI SYNTHESIS OF DSP KERNELS Algorithmic and Architectural Transformations

VLSI SYNTHESIS OF DSP KERNELS Algorithmic and Architectural Transformations

by

MANESH MEHENDALE Texas Instruments (India), Ltd.

and

SUNILD. SHERLEKAR Silicon Automation Systems Ltd.

Springer Science+Business Media, LLC

A C.I.P. Catalogue record for this book is available from the Library ofCongress.

ISBN 978-1-4419-4904-2 ISBN 978-1-4757-3355-6 (eBook) DOI 10.lO07/978-1-4757-3355-6

Printed on acid-free paper

All Rights Reserved © 200 1 Springer Science+Business Media New York Originally published by Kluwer Academic Publishers, Boston in 200l. Softcover reprint ofthe hardcover 1st edition 2001 No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without written permission from the copyright owner.

Contents

List of Figures List of Tables

Foreword

Acknow ledgments Preface

1. INTRODUCTION 1.1 An Example 1.2 The Design Process: Constraints and Alternatives 1.3 Organization of the Book 1.4 For the Reader

xi xv

xvii

xix XXI

3 7 9

2. PROGRAMMABLE DSP BASED IMPLEMENTATION 11 2.1 Power Dissipation - Sources and Measures 13

2.1.1 Components Contributing to Power Dissipation 13 2.1.2 Measures of Power Dissipation in Busses 13 2.1.3 Measures of Power Dissipation in the Multiplier 13

2.2 Low Power Realization of DSP Algorithms 16 2.2.1 Allocation of Program, Coefficient and Data Memory 16 2.2.2 Bus Coding 17 2.2.2.1 Gray Coded Addressing 17 2.2.2.2 TO coding 18 2.2.2.3 Bus Invert Coding 20 2.2.3 Instruction Buffering 21 2.2.4 Memory Architectures for Low Power 22 2.2.5 Bus Bit Reordering 24 2.2.6 Generic Techniques for Power Reduction 26

2.3 Low Power Realization of Weighted-sum Computation 26 2.3.1 Selective Coefficient Negation 27 2.3.2 Coefficient Ordering 28 2.3.2.1 Coefficient Ordering Problem Formulation 29 2.3.2.2 Coefficient Ordering Algorithm 30 2.3.3 Adder Input Bit Swapping 31 2.3.4 Swapping Multiplier Inputs 33 2.3.5 Exploiting Coefficient Symmetry 34

v

VI VLSI SYNTHESIS OF DSP KERNELS

2.4 Techniques for Low Power Realization of FIR Filters 35 2.4.1 Circular Buffer 36 2.4.2 Multirate Architectures 37 2.4.2.1 Computational Complexity of Multirate Architectures 37 2.4.2.2 Multirate Architecture on a Programmable DSP 38 2.4.3 Architecture to Support Transposed FIR Structure 41 2.4.4 Coefficient Scaling 42 2.4.5 Coefficient Optimization 43 2.4.5.1 Coefficient Optimization - Problem Definition 43 2.4.5.2 Coefficient Optimization - Problem Formulation 43 2.4.5.3 Coefficient Optimization Aigorithm - Components 44 2.4.5.4 Coefficient Optimization Aigorithm 45 2.4.5.5 Coefficient Optimization Using 0-1 Programming 50

2.5 Framework for Low Power Realization of FIR Filters on a Programmable DSP 51

3. IMPLEMENTATION USING HARDWARE MULTIPLIER(S) AND ADDER(S) 55 3.1 Architectural Transformations 55 3.2 Evaluating the Effectiveness of DFG Transformations 56 3.3 Low Energy vs Low Peak Power Tradeoff 61 3.4 Multirate Architectures 63

3.4.1 Computational Complexity of Multirate Architectures 64 3.4.1.1 Non-linear Phase FIR Filters 64 3.4.1.2 Linear Phase FIR Filters 65

3.5 Power Analysis of Multirate Architectures 68 3.5.1 Power Analysis for One Level Decimated Multirate

Architectures 68 3.5.1.1 Power Analysis - an Example 70 3.5.1.2 Power Reduction Using Multirate Architectures 71

4. DISTRIBUTED ARITHMETIC BASED IMPLEMENTATION 75 4.1 DA Structures for Area-Delay Tradeoff 76

4.1.1 DA Based Implementation of Linear Phase FIR Filters 77 4.1.2 I-Bit-At-A-Time vs 2-Bits-At-A-Time Access 78 4.1.3 Multiple Coefficient Memory Banks 79 4.1.4 Multiple Memory Bank Implementation with 2BAAT

Access 80 4.1.5 DA Based Implementation of Multirate Architectures 81 4.1.6 Multirate Architecture with a Decimation Factor ofThree 82 4.1.7 Multirate Architectures with Two Level Decimation 84 4.1.8 Coefficient Memory vs Number of Additions Tradeoff 84

4.2 Improving Area Efficiency of Two LUT Based DA Structures 85 4.2.1 Minimum Area Partitions for Two ROM Implementation 87 4.2.2 Minimum Area Partitions for Hardwired Logic 88

Contents Vll

4.2.2.1 CF2: Estimating Area from the Actual Truth-Table 89 4.2.2.2 CF1: Estimating Area from the Coefficients in Each

Partition 91 4.2.3 Evaluating the Effectiveness ofthe Coefficient Partitioning

Technique 92 4.3 Techniques for Low Power Implementation of DA Based FIR

Filters 94 4.3.1 Toggle Reduction Using Data Coding 95 4.3.1.1 Nega-binary Coding 95 4.3.1.2 2's Complement vs Nega-binary Representation 96 4.3.1.3 Deriving an Optimum Nega-binary Scheme for a Given

Data Distribution 99 4.3.1.4 Incorporating a Nega-binary Scheme into the DA Based

FIR Filter Implementation 101 4.3.1.5 A Few Observations 103 4.3.1.6 Additional Power Saving with Nega-binary Architecture 104 4.3.2 Toggle Reduction in Memory Based Implementations

by Gray Sequencing and Sequence Reordering 107

5. MULTIPLIER-LESS IMPLEMENTATION 113 5.1 Minimizing Additions in the Weighted-sum Computation 114

5.1.1 Minimizing Additions - an Example 114 5.1.2 2 Bit Common Subexpressions 116 5.1.3 Problem Formulation 116 5.1.4 Common Subexpression Elimination 118 5.1.5 The Algorithm 119

5.2 Minimizing additions in MCM Computation 120 5.2.1 Minimizing Additions - an Example 120 5.2.2 2 Bit Common Subexpressions 122 5.2.3 Problem Formulation 123 5.2.4 Common Subexpression Elimination 124 5.2.5 The Algorithm ] 24 5.2.6 An UpperBoundon theNumberof Additions forMCM

Computation 126 5.3 Transformations for Minimizing Number of Additions 128

5.3.] Number Theoretic Transforms 128 5.3.1.] 2's Complement Representation 128 5.3.1.2 Uni-sign Representation 129 5.3.1.3 Canonical Signed Digit (CSD) Representation 129 5.3.2 Signal Flow Graph Transformations 130 5.3.3 Evaluating Effectiveness of the Transformations 133 5.3.4 Transformations for Optimal Initial Solution 137 5.3.4.1 Coefficient Optimization ] 37 5.3.4.2 Efficient Pre-Filter Structures 138

5.4 High Level Synthesis of Multiprecision DFGs 138

viii VLSI SYNTHESlS OF DSP KERNELS

5.4.1 5.4.2 5.4.3

Precision Sensitive Register Allocation Precision Sensitive Functional Unit Binding Precision Sensitive Scheduling

6. IMPLEMENTATION OFMULTIPLICATION-FREE LINEAR

138 139 140

TRANSFORMS 141 6.1 Optimum Code Generation for Register-rich Architectures 142

6.1.1 Generic Register-rich Architecture Model 142 6.1.2 Sources and Measures of Power Dissipation 143 6.1.3 Optimum Code Generation for 1-D Transforms 144 6.1.4 Minimizing NumberofOperations in Two Dimensional

Tran sform s 146 6.1.5 Low Power Code Generation 148

6.2 Optimum Code Generation for Single Register, Accumulator Based Architectures 153 6.2.1 Single Register, Accumulator Based Architecture Model 153 6.2.2 Code Generation Rules 154 6.2.3 Computation Scheduling Algorithm 156 6.2.4 ImpactofDAG Structure on the Optimality ofGenerated

Code 158 6.2.5 DAG Optimizing Transformations 159 6.2.5.1 Transformation I - Tree to Chain Conversion 159 6.2.5.2 Transformation 11 - Serializing a Butterfly 159 6.2.5.3 Transformation III - Fanout Reduction 160 6.2.5.4 Transformation IV - Merging 161 6.2.6 Synthesis of Spill-free DAGs 162 6.2.7 Sources and Measures of Power Dissipation 168 6.2.8 Low Power Code Generation 168

7. RESIDUE NUMBER SYSTEM BASED IMPLEMENTATION 171 7.1 Optimizing RNS based Implementation of the Weighted-sum

Computation 172 7.1.1 Parallel Processing 174 7.1.2 Residue Encoding for Low Power 174 7.1.3 Coefficient Ordering 17 5 7.1.4 Exploiting Redundancy 176 7.1.5 Residue Encoding for minimizing LUT area 177

7.2 Optimizing RNS based Implementation of FIR Filters 179 7.2.1 Coefficient Scaling 179 7.2.2 Coefficient Optimization for Low Power 180 7.2.3 RNS based Implementation of Transposed FIR Filter

Strucrure lW 7.2.4 Coefficient Optimization for Area Reduction 180

7.3 RNS as an Optimizing Transformation for High Precision Signal Processing 1 83

Contcnts IX

8. A FRAMEWORK FOR ALGORITHMIC AND ARCHITECTURAL TRANSFORMATIONS 187 8.1 Classification of Algorithmic and Architectural Transformations 187 8.2 A Snapshot of the Framework ] 91

9. SUMMARY

References

Topic Index

About the Authors

] 95

]99

207

209

List of Figures

1.1 Digital Still Camera System 2 1.2 DSC Image Pipeline 3 1.3 Hardware-Software Codesign Methodology for a System-

on-a-chip 4 1.4 Solution Space for Weighted-Sum Computation 7 2.1 Generic DSP Architecture 12 2.2 4x4 Array Multiplier 14 2.3 Toggle Count as a Function of Number of Ones in the

Multiplier Inputs 16 2.4 Toggle Count as a Function of Hamming Distance be-

tween Successive Inputs 16 2.5 Address Bus Power Dissipation as a Function of Start Address 17 2.6 Binary to Gray Code Conversion 18 2.7 Memory Reorganization to Support Gray Coded Addressing 19 2.8 Programmable Binary to Gray Code Converter 19 2.9 TO Coding Scheme 20 2.10 TO Coding Scheme 21 2.11 Instruction Buffering 22 2.12 Decoded Instruction Buffering 22 2.13 Memory Partitioning for Low Power 23 2.14 Prefetch Buffer 23 2.15 Bus Reordering Scheme for Power Reduction in PD bus 24 2.16 %Reduction in the Number of Adjacent Signal Transi-

tions in Opposite Directions as a Function of the Bus Reordering Span 26

2.17 Coefficients of a 32 Tap Linear Phase Low Pass FIR Filter 27 2.18 Scheme for Reducing Power in the Adder Input Busses 33 2.19 Data Flow Graph of a Weighted-sum Computation with

Coefficient Symmetry 34 2.20 Suitable Abstraction of TMS320C54x Architecture for

Exploiting Coefficient Symmetry 35 2.21 Signal Flow Graph of a Direct Form FIR Filter 36 2.22 One Level Decimated Multirate Architecture 38

Xl

XII VLSI SYNTHESIS OF DSP KERNELS

2.23 Normalized Power Dissipation as a Function ofNumber of Taps for the Multirate FIR Filters Implemented on TMS32OC2x 41

2.24 Signal Flow Graph of the Transposed FIR Filter 42 2.25 Architecture to Support Efficient Implementation ofTrans-

posed FIR Filter 42 2.26 Frequency Domain Characteristics of a 24 Tap FIR Fil-

ter Before and After Optimization 49 2.27 Low Pass Filter Specifications 50 2.28 Framework for Low Power Realization of FIR Filters

on a Programmable DSP 53 3.1 Direct Form Structure of a 4 Tap FIR Filter 57 3.2 Scheduled DFG Using One Multiplier and One Adder 57 3.3 Scheduled DFG Using One Pipelined Multiplier and

One Adder 58 3.4 Loop Unrolled DFG Using 1 Pipelined Multiplier and 1 Adder 59 3.5 Retimed 4 Tap FIR Filter 59 3.6 MCM DFG Using One Pipelined Multiplier and One Adder 60 3.7 Direct Form DFG Using Two Pipelined Multipliers and

One Adder 60 3.8 MCM DFG Using Two Pipelined Multipliers and Two Adders 61 3.9 Energy and Peak Power Dissipation as a Function of

Degree of Parallelism 62 3.10 LowerLimit of VDD/VT for Reduced Peak Power Dis-

sipation as a Function of Degree of Parallelism 63 3.11 One Level Decimated Multirate Architecture: Topology-I 63 3.12 One Level Decimated Multirate Architecture: Topology - 11 64 3.13 Signal Flow Graph of a Direct Form FIR Structure with

Non-linear Phase 65 3.14 Signal Flow Graph of a Direct Form FIR Structure with

Linear Phase 65 3.15 Signal Flow Graph of a Two Level Decimated Multirate

Architecture 68 3.16 Normalized Delay vs Supply Voltage Relationship 69 3.17 Normalized Power Dissipation vs Number of Taps 71 4.1 DA Based 4 Tap FIR Filter 77 4.2 4 Tap Linear Phase FIR Filter 78 4.3 2 Tap FIR Filter with 2BAAT 79 4.4 Using Multiple Memory Banks 80 4.5 Multirate Architecture 81 4.6 DA Based 4 Tap Multirate FIR Filter 82 4.7 Area-Delay Curves for FIR Filters 85

List 0/ Figures Xlll

4.8 Two Bank Implementation - Simple Coefficient Split 86 4.9 Two Bank Implementation - Generic Coefficient Split 86 4.10 Area vs Normalized CF2 Plot for 25 Different Partitions

of a 16 Tap Filter 91 4.11 Range ofRepresented Values for N=4, 2's Complement

and N+ 1=5, Nega-binary 96 4.12 Typical Audio Data Distribution for 25000 SampIes Ex-

tracted from an Audio File 97 4.13 Difference in Toggles for N=6, 2's Complement and

Nega-binary Scheme : + - - + - + + 98 4.14 Difference in Toggles for N=6, 2's Complement and

Nega-binary Scheme : - + + - + - + 99 4.15 Gaussian Distributed Data with N=6, Mean=22, SD=6 100 4.16 Gaussian Distributed Data with N=6, Mean=-22, SD=6 101 4.17 DA Based FIR Architecture Incorporating the Nega-

binary Scheme 102 4.18 Saving vs SD Plot for N=8, Gaussian Distributed Data

with Mean = max/2 105 4.19 Narrow (SD=8) Gaussian Distribution 106 4.20 Broad (SD=44) Gaussian Distribution 107 4.21 Shiftless Implementation of DA Based FIR with Fixed

Gray Sequencing 108 4.22 Shiftless Implementation of DA Based FIR with Any

Sequencing Possible 109 5.1 Data Flow Graph for a 4-term Weighted-sum Computation 114 5.2 Coefficient Subexpression Graph for the 4-term Weighted-

sum Computation 118 5.3 Data Flow Graph for 4 term MCM Computation 121

5.4 SFG Transformation - Computing Y[n] in Terms of Y[n-l] 131

5.5 SFG Transformation - Computing Y[n] in Terms of Y[n-I] 133

5.6 Average Reduction Factor Using Common Subexpres-sion Elimination 134

5.7 Best Reduction Factors Using Coefficient Transforms Without Common Sub-expression Elimination 135

5.8 Best Reduction Factors Using Coefficient Transforms with Common Sub-expression Elimination 136

5.9 Frequency of Various Coefficient Transforms Result-ing in the Best Reduction Factor with Common Sub-expression Elimination 137

5.10 Precision Sensitive Register Allocation 139

XIV VLSI SYNTHESIS OF DSP KERNELS

5.11 Precision Sensitive Register Allocation 139 5.12 Precision Sensitive Scheduling 140 6.1 Generic Register-rich Architecture 143 6.2 3x3 Pixel Window Transform 144 6.3 Prewitt Window Transform 145 6.4 Transformed DAG with All SUB Nodes 145 6.5 Chain-type DAG for Prewitt Window Transform 146 6.6 Optimized Code for Prewitt Window Transform 146 6.7 Optimized DAG for 4x4 Haar Transform 148 6.8 Schedu1ed Instructions for 4x4 Haar Transform 149 6.9 Data Flow Graph and Variable Lifetimes for 4x4 Haar

Transform 150 6.10 Register-Conflict Graph 150 6.11 Consecutive-Variables Graph 150 6.12 Register Assignment for Low Power 151 6.13 Code Optimized for Low Power 151 6.14 3x3 Window Transforms 152 6.15 Single Register, Accumulator Based Architecture 153 6.16 Example DAG 154 6.17 DAG for 4x4 Walsh-Hadamard Transform 158 6.18 Optimized DAG for 4x4 Walsh-Hadamard Transform 159 6.19 Transformation I - Tree to Chain Conversion 160 6.20 Transformation 11 - Serializing a Butterfly 160 6.21 Transformations III and IV 161 6.22 Optimizing DAG Using Transformations 161 6.23 Spill-free DAG Synthesis 164 6.24 DAGs for 8x8 Walsh-Hadamard Transform 165 6.25 Spill-free DAGs for 8x8 Walsh-Hadamard Transform 166 6.26 DAGs for 8x8 Haar Transform 166 7.1 RNS Based Implementation of FIR Filters 173 7.2 Modulo MAC using look-up-tables 173 7.3 Modulo MAC using a single LUT 174 7.4 RNS Based Implementation ofFIR Filters with Parallel

Processing Transformation 175 7.5 Minimizing Look Up Tab\e Area by Exploiting Redundancy 177 7.6 Modulo MAC structure for Transposed Form FIR Filter 181 8.1 A Framework for Area-Power Tradeoff 192 8.2 A Framework for Area-Power Tradeoff - continued 193

List of Tables

2.1 Adjacent Signal Transitions in Opposite Direction as a Function of the Bus-reordering Span 25

2.2 Impact of Selective Coefficient Negation on Total Num-ber of 1 s in the Coefficients 28

2.3 Impact of Coefficient Ordering on Hamming Distance and Adjacent Toggles 31

2.4 Power Optimization Results Using Input Bit Swapping for 1000 Random Number Pairs 33

2.5 TMS320C2x Code for Direct Form Architecture 38 2.6 TMS320C2x Code for the Multirate Architecture 40 2.7 Hamming Distance and Adjacent Signal Toggles After

Coefficient Scaling Followed by Steepest Descent and First Improvement Optimization with No Linear Phase Constraint 47

2.8 Hamming Distance and Adjacent Signal Toggles After Coefficient Scaling Followed by Steepest Descent and First Improvement Optimization with Linear Phase Constraint 48

2.9 Hamming Distance and Adjacent Signal Toggles for Steepest Descent and First Improvement Optimization with and without Linear Phase Constraint (with No Co-efficient Scaling) 48

3.1 Computational Complexity of Multirate Architectures 67 3.2 Comparison with Direct Form and Block FIR Implementations 72 4.1 Coefficient Memory and Number of Additions for DA

based Implementations 85 4.2 A Few Functions and Their Corresponding Correlations

with Actual Area 88 4.3 ROM Areas as a % of Maximum Theoretical Area 92

4.4 ROM vs Hardwired Area (Equivalent NA210 NAND Gates) Comparison 93

4.5 Area (Equivalent NA210 NAND Gates) Statistics for All Possible Coefficient Partitions 93

4.6 Toggle and No-toggle Power Dissipation in Some D FFs 94

xv

xvi VLSI SYNTHESIS OF DSP KERNELS

4.7 Best Nega-binary Schemes for Gaussian Data Distribu-tion ( mean = max/2; SD = 0.17 max ) 105

4.8 Toggle Reduction in LUT (for 10,000 SampIes; Gaus-sian Distributed Data) 106

4.9 Comparison ofWeighted Toggle Data for Different Gray Sequerices 110

4.10 Toggle Reduction as a Percentage of 2's Complement Case for Two Different Gaussian Distributions 110

4.11 Toggle Reduction with Gray Sequencing for N = 8 and Some Typical Distributions 111

5.1 Number of Additions+Subtractions (Initial and After Minimization) 120

5.2 Numberof Additions+Subtractions for Computing MCM Intermediate Outputs 126

6.1 Total Hamming Distance Between Successive Instructions 152 6.2 Code Dependance on the Scheduling of DAG Nodes 155 6.3 Comparison of Code Generator with 'C5x C Compiler 157 6.4 NumberofNodes (Ns) and Cycles(Cs) for Various DAG

Transforms 167 6.5 Hamming Distance Measure for Accumulator based Ar-

chitectures 169 7.1 Area estimates for PLA based modulo adder implementation 178 7.2 Area estimates for PLA based modulo multiplier imple-

mentation 179 7.3 Area estimates for PLA based modulo MAC implementation 179 7.4 Distribution of Residues across the Moduli Set 182 7.5 Impact of Coefficient Optimization on the Area of Mod-

ulo Multiplier and Modulo MAC 183 7.6 RNS based FIR filter with 24-bit precision on C5x 184 7.7 Number of Operations for RNS based FIR filter with

24-bit precision on C5x 184

Foreword

Technology is a driving force in society. At times it seems to be driving us faster than we want to go. At the same time it seems to patiently wait for us to adapt to it and, finally, adopt it as our own. Let me give a few examples.

The answering machine is a good example of us adopting a technology. Twenty years ago if you had called my horne and I had an answering machine, rather than me, responding to your call, you would have thought, "Oh, how rude of hirn! I don't want to talk to a machine, I want to talk to Gene". Today, twenty years later, if you call my horne and do not get my answering machine (or me), you will think, "Oh, how rude of hirn! He should have an answering machine so that I can at least leave a message". We have actually gone far beyond the answering machine in this respect. We now have cellular phones -with answering machines. Forget "Snail" mail, even Email is not fast enough; we have Instant Messaging. But although I have a videophone on my desk, no one else seems to have one. I guess we haven't adopted all of the technology that we are introduced to.

Another example of the advance of technology is seen in the definition of "portable". The term has changed over the last several decades as a result of advances in integrated circuit technology. Think of the Personal Computer. Not long ago, "portable" meant one (strong!) person could carry a Personal Computer on an airplane without putting it in the checked-in baggage. Now, "portable" means I can put my computer in my briefcase and still have room for other things. It is beginning to mean that I can put my computer in my pocket. In the future, it may very weil be that each one ofus wears multiple computers as a matter of daily life. We will have a communications computer, an entertainment computer, an information computer, a personal medical computer to name a few. They will all communicate with one another over a personal area network on our bodies. I like to call this personal area network the "last meter".

The definition of "portable" has also changed in the area of portable phones. We have graduated from car phones - where the electronics were hidden in the trunk of the car - to cellular phones so small that they can easily get lost in a shirt pocket.

There are many more examples of how the marriage of Digital Signal Processing to Integrated Circuit Technology has revolutionized our lives. But rather than continue in that direction, I would like to turn to abrief historical perspective of how successful this marriage has been. After looking at history, 1 would like to tie all of this to the value of this book.

Digital Signal Processing, depending on your view of history, has been around for only about forty years. It began as a university curiosity in the 1 960s. This was about the same time that digital computers were becoming

XVll

XVlll VLSI SYNTHESIS OF DS? KERNELS

useful. In the 1970s, Digital Signal Processing became a military advantage for those nations who could afford it. It was in the late I 970s and early 1980s that Integrated Circuit Technology became mature enough to impact Digital Signal Processing with the introduction of a new device called "Digital Signal Processor". With this new device, Digital Signal Processing moved from the laboratory and military advantage to being a commercial success. Telecommunications was the earliest to adopt Digital Signal Processing with many others to follow. It was in the decade of the 1990s that Digital Signal Processing moved from being a commercial success to being a consumer success. This was a direct result of the advances in Integrated Circuit Technology. These advances yielded four significant benefits: I) lower cost, 2) higher performance, 3) lower power and 4) more transistors per device. The industry began to think in terms of a System on a Chip (SoC). This led us to where we are now and will lead us to where we will go in the coming decades.

What I see in our future is the opportunity to take advantage of these four benefits of Integrated Circuit Technology as it is applied to Digital Signal Processing. SoC technology will either complicate or simplify our decisions on how best to implement Digital Signal Processing solutions on VLSI. We will need to optimize on the best combination of Performance, Power dissipation and Price. We will not only continue to change the definition of "portable" but will begin to change the definitions of "personal", "good enough" and "programmable".

This book focuses on this very marriage of Digital Signal Processing to Integrated Circuit Technology. It addresses implementation options as we try to create new products which will impact society. These new products will need to have good enough performance, low enough power dissipation and a low enough price. At the same time they will need to be quick to market.

So, read this book! It will give you insights and arm you with techniques to make the inevitable tradeoffs necessary to implement Digital Signal Processing on Integrated Circuits to create new products.

One last thought on the marriage of Digital Signal Processing to Integrated Circuit technology. Over the last several years, I have observed that every time the performance of Digital Signal Processors increases significantly, the rules of how we apply Digital Signal Processing theory change. Isn't this a great time we live in?

GENE FRANTZ

Senior Fellow, Digital Signal Processing Texas Instruments Inc.

Houston, Texas April 2001

Acknowledgments

First and foremost, we would like to express our sincere gratitude to Milind Sohoni, Vi kram Gadre and Supratim Biswas (all ofIIT Bombay), G. Venkatesh (with Sasken Communication Technologies Ltd., earlier with IIT Bombay) and Rubin Parekhji of Texas Instruments (India) for their insightful comments, critical remarks and feedback which enriched the quality of this book.

We are thankful to Bobby Mitra and Sham Banerjee of Texas Instruments (India) for their help, support and guidance.

We are grateful to Texas Instruments (India) for sponsoring the doctoral studies of the first author. We deeply appreciate the support and encouragement of IIT Bombay and Sasken Communication Technologies Ltd.

We are thankful to Amit Sinha, Somdipta Basu Roy, M.N. Mahesh, Satrajit Gupta, Anand Pande, Sunil Kashide and Vikas Agrawal (all with Texas Instruments (India) when the work was done) for their assistance in implementing some of the techniques discussed in this book.

Our warm thanks to our children - Aarohi Mehendale and Apama & Nachiket Sherlekar for putting up with our long hours at work. Finally, thanks are due to our wives - Archana Mehendale and Gowri Sherlekar for being there with us at all times.

MAHESH MEHENDALE

SUNIL D. SHERLEKAR

Preface

D.E Knuth in his seminal paper "Structured Programming with Goto Statements" underlines the importance of optimizing the inner loop in a computer program. More than twenty five years and a revolution in semiconductor technology have not diminished the importance of the inner loop.

This book is about synthesis of the 'inner loop' or the kernel of Digital Signal Processing (DSP) systems. These systems process - in real time -digital information in the form of text, data, speech, images, audio and video. The wide variety of these systems notwithstanding, their kerneis or inner loops share a common dass of computation. This is the weighted sum (L: A[i]X[i]). It occurs in Finite Impulse Response (FIR) and Infinite Impulse Response (HR) filters, in signal correlation and in computing signal transforms.

Unlike general purpose computation which asks for computation to be 'as fast as possible', DSP systems require performance that is characterized by the arrival rate of a data stream which, in turn, is determined by the Nyquist sampling rate of the signal to be processed. The performance of the system is therefore a constraint within which one must optimize the area (cost) and power (battery life). This is usually a matter of tradeoff.

The area-power tradeoff is complicated by additional requirements of flexibility. Flexibility is important to track evolving standards, to cater to multiplicity of standards (such as air interfaces in mobile communication) and fast-paced innovation in algorithms. Flexibility is achieved by implementation in software, but a completely soft implementation is likely to be ruinous for power. It is therefore imperative that the requirements of flexibility be carefully predicted and the system be partitioned into hardware and software components.

In this book, we present several algorithmic and architectural transformations to optimize weighted-sum based DSP kerneis over the area-delay-power space. These transformations address implementation technologies that offer varying degrees of programmability (and therefore flexibility) ranging from software programmable processors to customized hardwired solutions using standardcell or gate-array based ASICs. We consider both the multiplier-less and the hardware multiplier-based implementations of the weighted-sum computation.

To start with, we present a comprehensive framework that encapsulates techniques for low power implementation of DSP algorithms on programmable DSPs. These techniques complement one another and address power reduction

XXI

xxii VLSI SYNTHESIS OF DSP KERNELS

in various components such as the program and data memory busses and the multiplier-accumulator datapath of a Harvard architecture based digital signal processor. The techniques are then specialized for weighted sum computations and then for FIR filters.

Next we present architectural transforms for power optimization for hardwired implementation ofFIR filters. Multirate architectures are presented as an important and interesting transform. A detailed analysis of the computational complexity of multirate architectures is presented with results that indicate significant power savings compared to other FIR filter structures.

Distributed Arithmetic (DA) has been presented in the literature as one of the approaches for multiplier-less implementation of weighted-sum computation. We present techniques for deriving multiple DA based structures that represent different data-points in the area-delay space. We look at improving area-efficiency of DA based implementations and specifically show how the fiexibility in coefficient partitioning can be exploited to reduce the area of a DA structure using two look-up-tables. We also address the problem of reducing power dissipation in the input data shift-registers of DA based FIR filters. Our technique is based on a generic nega-binary representation scheme which is customized for a given distribution profile of input data values, so as to minimize toggles in the shift-registers.

For non-adaptive signal processing applications in which the weight values are constant and known at design time, an area-efficient realization can be achieved by implementing the weighted sum computation using shift and add operations. We present techniques for minimizing additions in such multiplierless implementations. These techniques are also useful for efficient implementation of weighted-sum computations on programmable processors that do not support a hardware multiplier.

We address a special dass of weighted-sum computation problem, where the weight-values are restricted to {O, 1, -I}. We present techniques for optimized code generation of one dimensional and two dimensional multiplication-free linear transforms. These are targeted to both register-rich and single-register, accumulator based architectures.

Residue Number Systems (RNS) have been proposed for high-speed parallel implementation of addition, subtraction and multiplication operations. We explain how the power of RNS can be exploited for optimizing the implementation of weighted sum computations. In particular, RNS is proposed as a method to enhance the results of other techniques presented in this book. RNS is also proposed as a technique to enhance the precision of computations on a programmable DSP.

To tie up all these techniques, a methodology is presented to systematically identifying transformations that exploit the characteristics of a given DSP al-

PREFACE XXIll

gorithm and of the implementation style, to achieve tradeoffs in the area-delaypower space.

This book is meant for practicing DSP system designers, who understand that optimal design can never be a push-button activity. We sincerely hope that they can benefit from the variety of techniques presented in this book. Each of the techniques has a potential benefit to offer. But actual benefit will accrue only from a proper selection from these techniques and their appropriate implementation: something that is in the realm of human expertise and judgement.

MAHESH MEHENDALE

SUNIL D. SHERLEKAR

Bangalore April 2001

Documents

VLSI SYNTHESIS OF DSP KERNELS Algorithmic and ...978-1-4757-3355...5.3.2 Signal Flow Graph Transformations 130 5.3.3 Evaluating Effectiveness of the Transformations 133 5.3.4 Transformations