A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

Embed Size (px)

Citation preview

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    1/161

    A LOW POWER DESIGN METHODOLOGY FOR TURBO

    ENCODER AND DECODER

    RAJESHWARI. M. BANAKAR

    DEPARTMENT OF ELECTRICAL ENGINEERING

    INDIAN INSTITUTE OF TECHNOLOGY, DELHI

    INDIA

    JULY 2004

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    2/161

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    3/161

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    4/161

    Intellectual Property Rights (IPRs) notice

    Part of this thesis may be protected by one or more of Indian Copyright Registrations (Pend-

    ing) and/or Indian Patent/Design (Pending) by Dean, Industrial Research & Development

    (IRD) unit, Indian Institute of Technology Delhi (IITD) New Delhi-110016, India. IITD

    restricts the use, in any form, of the information, in part or full, contained in this thesis

    ONLY on written permission of the competent Authority: Dean, IRD, IIT Delhi OR MD,

    FITT, IIT Delhi.

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    5/161

    A LOW POWER DESIGN METHODOLOGY FOR TURBO

    ENCODER AND DECODER

    by

    RAJESHWARI M. BANAKAR

    DEPARTMENT OF ELECTRICAL ENGINEERING

    Submitted

    in fulfillment of the requirements of the degree of

    Doctor of Philosophy

    to the

    INDIAN INSTITUTE OF TECHNOLOGY, DELHI

    INDIA

    July 2004

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    6/161

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    7/161

    Certificate

    This is to certify that the thesis titled A Low Power Design Methodology for Turbo

    Encoder and Decoder being submitted by Rajeshwari M. Banakar in the Department of

    Electrical Engineering, Indian Institute of Technology, Delhi, for the award of the degree

    of Doctor of Philosophy, is a record of bona-fide research work carried out by her under

    our guidance and supervision. In our opinion, the thesis has reached the standards fulfilling

    the requirements of the regulations relating to the degree.

    The results contained in this thesis have not been submitted to any other university or

    institute for the award of any degree or diploma.

    Dr. M Balakrishnan Dr. Ranjan Bose

    Head Associate Professor

    Department of Department of

    Computer Science & Engineering Electrical Engineering

    Indian Institute of Technology Indian Institute of Technology

    New Delhi - 110 016 New Delhi - 110 016

    i

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    8/161

    ii

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    9/161

    Acknowledgments

    I would like to express my profound gratitudes to my Ph.D advisors Prof M Balakrishnan

    and Prof Ranjan Bose for their invaluable guidance and continuous encouragement during

    the entire period of this research work. Their technical acumen, precise suggestions and

    timely discussions whenever I was in some problem is whole heartily appreciated.

    I wish to express my indebtedness to SRC/DRC members Prof Anshul Kumar, Prof

    Patney and Prof Shankar Prakriya for their helpful suggestions and lively discussions dur-

    ing the research work. I am grateful to Prof Preeti Panda and Prof G.S.Visweswaran for

    their useful discussions.

    I would like to thank Prof Surendra Prasad and Prof S.C.Dutta Roy who were the key

    persons in the phase of choosing this premier institute for my research activity.

    Part of this research work was carried out at the Universitat of Dortmund Germany

    under DSD/DAAD project. My sincere thanks to Prof Peter Marwedel for his valuable

    guidance and pleasant gesture during my stay there. Thanks to the research scholars Lars

    Wehmeyer, Stefan Steinke and a graduate student Bosik Lee of the embedded systems

    design group for the help and support.

    I am thankful to the Director of Technical Education, Karnataka and management B.V

    Bhoomaraddi Engineering College Hubli, Karnataka for granting leave and giving me this

    opportunity to do my research.

    My thanks are due, to my friends B. G. Prasad, S. C. Jain, Manoj, Basant, Bhuvan,

    Satyakiran, Anup, Viswanath, Vishal, C. P. Joshi, Uma and the other enthusiastic embedded

    system team members at the Philips VLSI Design Lab for their lively discussions. It was an

    iii

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    10/161

    enjoyable period and great experience to work as a team member in the energetic embedded

    systems group IIT Delhi.

    I would also like to thank all other faculties, laboratory and administrative staff of De-

    partment of Computer Science and Engineering and Department of Electrical Engineering

    for their help. A special mention deserves to Ms Vandana Ahluwalia for her helping attitude

    accompanied with pleasure during the research period.

    Thanks to my brothers, sisters, friends and other family members for their support. I

    wish to accord my special thanks to my husband Dr Chidanand Mansur for his moral and

    emotional support throughout the deputation period for the research work. With highest

    and fond remembrance I recollect each day of this research work, which turned to be an

    enjoyable, comfortable and cherished moment due to my caring kids Namrata and Vidya.

    Their helping little hands, lovely painted greeting cards and smiling faces at home front

    in IIT Delhi needs a special appreciation. This thesis is dedicated to my parents late Shri

    Mahalingappa Banakar and late Smt Vinoda Banakar who were a source of inspiration

    during their life period.

    New Delhi Rajeshwari Banakar

    iv

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    11/161

    Abstract

    The focus of this work is towards developing an application specific design methodology

    for low power solutions. The methodology starts from high level models which can be used

    for software solution and proceeds towards high performance hardware solutions. Turbo

    encoder/decoder, a key component of the emerging 3G mobile communication is used as

    our case study. The application performance measure, namely bit-error rate (BER) is used

    as a design constraint while optimizing for power and/or area.

    The methodology starts from algorithmic level, concentrating on the functional correct-

    ness rather than on implementation architecture. The effect on performance due to variation

    in parameters like frame length, number of iterations, type of encoding scheme and type of

    the interleaver in the presence of additive white Gaussian noise is studied with the floating

    point C model. In order to obtain the effect of quantization and word length variation, a

    fixed point model of the application is also developed.

    First, we conducted a motivational study on some benchmarks from DSP domain to

    evaluate the benefit of custom memory architecture like the scratch pad memory (SPM).

    The results indicate that SPM is energy efficient solution when compared to conventional

    cache as on-chip memory. Motivated by this we have developed a framework to study

    the benefits of adding small SPM to the on-chip cache. To illustrate our methodology we

    have used ARM7TDMI as the target processor. In this application cache size of 2k or 4k

    combined with a SPM of size 8k is shown to be optimal. Incorporating SPM results in an

    energy improvement of 51.3

    .

    Data access pattern analysis is performed to see whether specific storage modules like

    v

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    12/161

    FIFO/LIFO can be used. We identify FIFO/LIFO to be an appropriate choice due to the

    regular access pattern. A SystemC model of the application is developed starting from the

    C model used for software. The functional correctness of SystemC model is verified by

    generating a test bench using the same inputs and intermediate results used in the C model.

    One of the computation unit, namely the backward metric computation unit is modified to

    reduce the memory accesses.

    The HDL design of the 3GPP turbo encoder and decoder is used to perform bit-width

    optimization. Effect of bit width optimization on power and area is observed. This is

    achieved without unduly compromising on BER. Overall area and power reduction achieved

    in one decoder due to bit width optimization is 46

    and 35.6

    respectively.

    At the system level, power optimization can be done using power shut-down of unused

    modules. This is based on the timing details of the turbo decoder in the VHDL model. To

    achieve this a power manager is proposed. The average power saving obtained is around

    55.5% for the 3GPP turbo decoder.

    vi

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    13/161

    Contents

    1 Introduction 1

    1.1 The system design problem . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2 Turbo Encoder and Decoder . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2 Turbo coding application 7

    2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.1.1 Applications of turbo codes - A scenario in one decade (1993-2003) 9

    2.2 Turbo Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.2.1 Four state turbo encoder . . . . . . . . . . . . . . . . . . . . . . . 12

    2.2.2 Eight state turbo encoder - 3GPP version . . . . . . . . . . . . . . 14

    2.3 Turbo Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.3.1 Decoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.4 Experimental set up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    2.5.1 Effect of varying frame-length on code performance . . . . . . . . 24

    2.5.2 Effect of varying number of iterations on performance . . . . . . . 24

    2.5.3 4 state vs 8 state (3GPP) turbo decoder performance . . . . . . . . 26

    2.5.4 Interleaver design considerations . . . . . . . . . . . . . . . . . . . 26

    2.5.5 Effect of Quantization . . . . . . . . . . . . . . . . . . . . . . . . 31

    2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    vii

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    14/161

    3 On chip Memory Exploration 35

    3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    3.3 Scratch pad memory as a design alternative to cache memory . . . . . . . . 37

    3.3.1 Cache memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    3.3.2 Scratch pad memory . . . . . . . . . . . . . . . . . . . . . . . . . 40

    3.3.3 Estimation of Energy and Performance . . . . . . . . . . . . . . . 44

    3.3.4 Overview of the methodology . . . . . . . . . . . . . . . . . . . . 45

    3.3.5 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 47

    3.4 Turbo decoder : Exploring on-chip design space . . . . . . . . . . . . . . . 51

    3.4.1 Overview of the approach . . . . . . . . . . . . . . . . . . . . . . 51

    3.4.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    3.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    4 System Level Modeling 63

    4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    4.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    4.3 Work flow description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    4.4 Functionality of turbo codes . . . . . . . . . . . . . . . . . . . . . . . . . 65

    4.5 Design and Analysis using SystemC specification . . . . . . . . . . . . . . 67

    4.5.1 Turbo decoder : Data access analysis . . . . . . . . . . . . . . . . 68

    4.5.2 Decoder : Synchronization . . . . . . . . . . . . . . . . . . . . . . 70

    4.6 Choice of on-chip memory for the application . . . . . . . . . . . . . . . . 71

    4.7 On-chip memory evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 74

    4.7.1 Reducing the memory accesses in decoding process . . . . . . . . . 75

    4.8 Synthesis Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    viii

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    15/161

    5 Architectural optimization 81

    5.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    5.3 Bit-width precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    5.4 Design modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

    5.4.1 Selector module . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

    5.4.2 Top level FSM controller unit . . . . . . . . . . . . . . . . . . . . 85

    5.5 Design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

    5.5.1 Power estimation using XPower . . . . . . . . . . . . . . . . . . . 88

    5.6 Experimental Observations . . . . . . . . . . . . . . . . . . . . . . . . . . 89

    5.6.1 Impact of bit optimal design on area estimates . . . . . . . . . . . . 89

    5.6.2 Power reduction due to bit-width optimization . . . . . . . . . . . 94

    5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

    6 System Level Power Optimization 101

    6.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

    6.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

    6.3 General design issues - Power Shut-Down Technique . . . . . . . . . . . . 103

    6.4 Turbo Decoder : System level power optimization . . . . . . . . . . . . . . 104

    6.5 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

    6.5.1 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . 108

    6.5.2 Power optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 112

    6.5.3 Branch metric simplification . . . . . . . . . . . . . . . . . . . . . 116

    6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

    7 Conclusion 119

    7.1 Our design methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

    7.2 Results summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

    7.3 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

    ix

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    16/161

    x

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    17/161

    List of Figures

    2.1 (a) The encoder block schematic (b) 4 state encoder details (Encoder 1

    and Encoder 2 are identical in nature) (c) State diagram representation (d)

    Trellis diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.2 The 8 state encoder 3GPP version . . . . . . . . . . . . . . . . . . . . . 15

    2.3 Decoder schematic diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.4 Impact on performance for different frame-length used, simulated for six

    iterations varying the SNR. . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    2.5 Impact on performance for varying number of iterations, N =1024, 8 state

    turbo code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    2.6 Comparing 4 state and the 8 state turbo code (floating point version), N =

    1024, RC symmetric interleaver, six iterations. . . . . . . . . . . . . . . . . 27

    2.7 Random interleaver design - Our work with Bruces 4 state turbo code [61] 29

    2.8 Illustration of interleaving the data elements . . . . . . . . . . . . . . . . . 30

    2.9 Comparing our work with Mark Hos [41] using a symmetric interleaver . . 32

    2.10 Impact of word length variation on bit error rate of the turbo codes . . . . . 33

    3.1 Cache Memory organization [75] . . . . . . . . . . . . . . . . . . . . . . . 38

    3.2 Scratch pad memory array . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    3.3 Scratch pad memory organization . . . . . . . . . . . . . . . . . . . . . . 42

    3.4 Flow diagram for on-chip memory evaluation . . . . . . . . . . . . . . . . 46

    3.5 Comparison of cache and scratch pad memory area . . . . . . . . . . . . . 48

    3.6 Energy consumed by the memory system . . . . . . . . . . . . . . . . . . 50

    xi

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    18/161

    3.7 Framework of the design space exploration . . . . . . . . . . . . . . . . . 52

    3.8 Energy estimates for the turbo decoder with N = 128 . . . . . . . . . . . . 57

    3.9 Energy estimates for the turbo decoder with N =256 . . . . . . . . . . . . . 58

    3.10 Performance estimates for Turbo decoder . . . . . . . . . . . . . . . . . . 60

    4.1 The turbo coding system . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    4.2 Input data with the index value . . . . . . . . . . . . . . . . . . . . . . . . 68

    4.3 Code segment to compute llr . . . . . . . . . . . . . . . . . . . . . . . . . 69

    4.4 Synchronization signals illustrating the parallel events . . . . . . . . . . . . 71

    4.5 Data flow and request signals . . . . . . . . . . . . . . . . . . . . . . . . . 73

    4.6 (a) Unmodified

    unit (b) Modified

    unit . . . . . . . . . . . . . . . . . . 77

    4.7 Area distribution pie chart representation . . . . . . . . . . . . . . . . . . . 78

    5.1 State diagram representation of FSM used in the turbo decoder design . . . 86

    5.2 Design flow showing the various steps used in the bit-width analysis . . . . 87

    5.3 Depicting the area for 32 bit design vs bit optimal design . . . . . . . . . . 97

    5.4 Power estimates of turbo decoder design . . . . . . . . . . . . . . . . . . . 98

    5.5 Area and Power reduction due to bit-width optimization turbo decoder design 99

    6.1 Illustrating the turbo decoder timing slots which helps to define the sleep

    time and the wake time of the on chip memory modules (

    = time slot) . . . 105

    6.2 Tasks in the decoder vs the time slot (Task graph). . . . . . . . . . . . . . . 107

    6.3 Block diagram schematic of the power manager unit . . . . . . . . . . . . . 115

    7.1 Turbo Decoder : Low power application specific design methodology . . . 121

    xii

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    19/161

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    20/161

    6.4 Power estimates in the individual decoder, active time slot information with

    the associated power saving due to sleep and wake time analysis. . . . . . . 112

    6.5 Database to design the power manager for the 8 state turbo decoder (1 =

    Active, 0 = Doze) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

    6.6 Power and area saving due to the branch metric simplification . . . . . . . 116

    xiv

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    21/161

    Chapter 1

    Introduction

    Application specific design is a process of alternative design evaluations and refined im-

    plementations at various abstraction levels to provide efficient and cost effective solutions.

    The key issues in an application specific design are the cost and performance as per the

    application requirement with power consumption being the other major issue. In hand held

    and portable applications like cellular phones, modems, video phones and laptops, batteries

    contribute a significant fraction of its total volume and weight [1, 2]. These applications

    demand low power solutions to increase the battery life.

    The growing concern for early time to market coupled with the complexity of the sys-

    tems have led to the research and development of efficient design methodologies. Tra-

    ditionally, handing over of manually prepared RTL design documents to the application

    developer have been used for the implementation. Since the introduction of hardware-

    software codesign concepts in this decade (1993 to 2003), suitable frameworks are being

    favored due to their overall operational advantages when details of performance, energy

    and area costs are captured during the system level design stage.

    To define application specific methodology, it is essential to specify steps to come up

    with a suitable design while ensuring the application functionality and features. As the

    design methodology progresses, detailed functional assessment continues to provide valu-

    able input for analyzing initial design concepts and in later stages of application constraints

    1

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    22/161

    and quality of implementation. The design methodology can be used to make significant

    changes during early stages which results in a wide design space exploration. A key con-

    cern is the design turnaround time. Common test bench across various levels of design

    abstraction is one approach which is effective in reducing development time.

    1.1 The system design problem

    The design space including memory design issues of power efficient architectures are dif-

    ferent than the design space of a general-purpose processors. In particular when the ap-

    plication specific design space is considered, the application parameters which impact the

    quality should be taken into account and these are quite application specific. The system

    design problem is to identify the application parameters which will have an impact on the

    overall architecture. Increasingly, today system design approaches tries to address these

    issues at a high level of design abstraction while providing a structured methodology for

    refinement of the design.

    Low power design of digital integrated circuits has emerged as a very active and rapidly

    developing field. Power estimation can be done at different stages of system design as

    illustrated in [3 - 29]. These estimations can drive the transformations to choose low power

    solutions. Transformations done at the higher level of abstraction namely the system level

    have a greater impact, whereas estimations are generally more accurate at the lower levels

    of abstractions like the transistor level and the circuit level [30].

    Klaus et al. [31] present high level DSP design methodology raising the design entry

    to algorithmic and behavioral level instead of specifying a detailed architecture on reg-

    ister transfer level. They concentrate mainly on the issues of functional correctness and

    verification. No optimization techniques are considered in their work.

    Another design methodology example is the work of Zervas et al. [32]. They pro-

    pose a design methodology using communication applications for performance analysis.

    Power savings through various level of design flow are considered and examined for their

    2

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    23/161

    efficiency in the digital receiver context. The authors deal mainly with the data path opti-

    mization issues for low power implementation.

    In our approach, architecture exploration using SystemC is performed, quantifying the

    power savings using bit-width optimization and the power shut down technique for the

    application using control flow diagram. The application parameters like the information

    length and its performance parameter namely the bit error rate to the power consumption

    during the bit-width optimization are correlated.

    Memory units represent a substantial portion of the power as well as area and form

    an important constituent of such applications. Typically power consumed by memory is

    distributed across on-chip as well as off-chip memory. A large memory implies slower

    access time to a given memory element and also increases the power consumption. Memory

    dominates in terms of the three cost functions viz area, performance and power. On-chip

    caches using static RAM consume power in the range of 25

    to 45

    of the total chip

    power [33]. For example, in the Strong ARM 110, the cache consumes 43

    [34] of total

    power.

    Cathoor et al. [35, 36] describe the custom memory design issues. Their research fo-

    cuses on the development of methodologies and supporting tools to enable the modeling

    of future heterogeneous reconfigurable platform architectures and the mapping of applica-

    tions onto these architectures. They analyse the memory dominated media applications for

    data transfer and storage issues. Although they concentrate on a number of issues related

    to reducing memory power, custom memory option like the scratch pad memory is not

    studied.

    When considering power efficient architectures, along with performance and cost, on-

    chip memory issues are also included as an important dimension of the design space to

    be explored. The application algorithmic behavior in terms of its data access pattern is an

    additional feature to be included in the design space exploration to take a decision on the

    type of on-chip memory.

    One of our research goals is to provide a suitable methodology for on-chip memory

    3

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    24/161

    exploration issues in power efficient application specific architectures. In our solution,

    scratch pad memory as an attractive design alternative is proposed. At the system level

    the energy and performance gain by adding a small scratch pad memory to on-chip cache

    configuration is evaluated. Further, in a situation of distributed memory, shut-down of

    unused memory blocks in a systematic manner is proposed.

    1.2 Turbo Encoder and Decoder

    Turbo codes were presented in 1993, by Berrou et al. [37] and since then these codes have

    received a lot of interest from the research community as they offer better performance than

    any of the other codes at very low signal to noise ratio. Turbo codes achieve near Shannon

    limit error correction performance with relatively simple component codes. A BER of

    is reported for a signal to noise ratio of 0.7 dB [37].

    Advances in third generation (3G) systems and beyond will cause a tremendous pres-

    sure on the underlying implementation technologies. The main challenge is to implement

    these complex algorithms with strict power consumption constraint. Improvement in the

    semiconductor fabrication and the growing popularity of hand held devices are resulting

    in computation intensive portable devices. Exploration of novel low power application

    specific architectures and methodologies which address power efficient design capable of

    supporting such computing requirement is critical.

    In communication systems specifically wireless mobile communication applications,

    power and size are dominant factors while meeting the performance requirements. Perfor-

    mance being measured not only by data rate but also by the the corresponding bit error

    rate (BER). In this one decade (1993 - 2003) turbo codes have been used in a number

    of applications like Long-haul terrestrial wireless systems, 3G systems, WLAN, Satellite

    communication, Orthogonal frequency division multiplexing etc.

    The algorithmic specifications and the performance characterization of the turbo codes

    form a pre-requisite study to develop a low power solution. Functional analysis of turbo

    4

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    25/161

    decoder with simplified computation metric to save on power as well as the architectural

    features like bit-width are closely related to the application parameter like bit-error rate

    (BER).

    1.3 Thesis Organization

    The thesis is organized as follows :

    Chapter 2 deals with the algorithmic details of the specific application under consider-

    ation namely the turbo coding system. The effect of varying various parameters like the

    frame length, number of iterations, type of encoder, type of interleaver and quantization is

    presented. The main focus of this chapter being the development of various C models of

    the application, which is the first step in our application specific design methodology.

    In Chapter 3, a motivational study for the evaluation of energy efficient on-chip memory

    is conducted. A design methodology to show that scratch pad memory is an energy efficient

    on-chip memory option in embedded systems vis-a-vis cache is developed. For the specific

    turbo decoder, which is memory intensive, we demonstrate that suitable combination of

    cache and scratch pad memory proves beneficial in terms of energy consumption, area

    occupied and performance.

    In Chapter 4, the data access analysis to see if any other custom memory is suitable

    for the specific application is done. It is concluded that FIFO/LIFO on-chip memory is

    appropriate. The architecture at the system level using SystemC is explored for functional

    verification of the design and show the benefit of modifying one of the computational unit

    to reduce the memory accesses.

    In Chapter 5, the architecture optimization namely the bit width optimization is consid-ered as a low power design issue. A VHDL model for the 3G turbo decoder to estimate the

    power and area occupied by the design is developed.

    In Chapter 6, system level power optimization technique is studied. More specifically,

    the power shut-down method is considered. The sleep time and the wake-up time of the

    5

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    26/161

    design modules is derived from the timing details available from the HDL model. As a

    system level power optimization option, the power shutdown technique for the turbo codes

    is applied to quantify the power savings. Design of a suitable power manager is proposed.

    In Chapter 7, the application specific flow methodology is summarized with focus on

    low power. The results of our work on low power turbo encoder and decoder are also

    summarized with future directions in this area.

    6

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    27/161

    Chapter 2

    Turbo coding application

    2.1 Introduction

    Efficient methodology for the application specific design reduces the time and effort spent

    during design space exploration. The turbo code application from the area of wireless

    communications is chosen as the key application for which an application specific design

    methodology is developed. The functionality and specific characteristics of the application

    are needed to carry out the design space exploration. The application characteristics studied

    are, the impact on the performance of the turbo codes with variation in the size of the input

    message (frame-length), type of the interleaver and the number of decoding iterations.

    Turbo coding is a forward error correction (FEC) scheme. Iterative decoding is the key

    feature of turbo codes [37, 38]. Turbo codes consist of concatenation of two convolution

    codes. Turbo codes give better performance at low SNRs (signal to noise ratio) [39, 40].

    Interestingly, the name Turbo was given to this codes because of the cyclic feedback mech-

    anism (as in Turbo machines) to the decoders in an iterative manner.

    The turbo encoder transmits the encoded bits which form inputs to the turbo decoder.

    The turbo decoder decodes the information iteratively. Turbo codes can be concatenated in

    series, parallel or in a hybrid manner. Concatenated codes can be classified as parallel con-

    catenated convolution codes (PCCC) or serial concatenated convolutional codes (SCCC).

    7

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    28/161

    In PCCC two encoders operate on the same information bits. In SCCC, one encoder en-

    codes the output of another encoder. The hybrid concatenation scheme consists of the

    combination of both parallel and serial concatenated convolutional codes. The turbo de-

    coder has two decoders that perform iterative decoding. Convolutional codes can be viewed

    in different ways. Convolutional codes can be represented by state diagrams. The state of

    the convolutional encoder is determined by its contents.

    Two versions of the turbo codes are studied in detail. One is the four state turbo code

    and the other is 3GPP (third generation partnership project) version which has eight states.

    BER in case of 3GPP turbo code shows significant improvement over the four state version,

    although the computational complexity of the decoder is high when compared to the four

    state turbo code system. The impact of quantization on the BER at the algorithmic level is

    studied.

    In this chapter, the impact of different frame lengths of the message on the bit error rate

    (BER) is analyzed. The impact of varying the number of iterations on the BER is stud-

    ied. The model developed is compared to Mark Hos model [41, 42] and the results have

    been found to closely tally. A comparison of the performance of the random interleaver

    implementation and the symmetric interleaver is done.

    The rest of the chapter is organized as follows. The various application areas of turbo

    codes are described in subsection 2.1.1. Section 2.2 describes the details of the encoder

    schematic. Section 2.3 discusses the turbo decoder and its performance issues. Section 2.4

    gives the experimental set up. Section 2.5 presents the results of various experiments. The

    interleaver design considerations are discussed followed by the explanation of the effect of

    quantization. Finally section 2.6 summarizes the chapter.

    8

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    29/161

    2.1.1 Applications of turbo codes - A scenario in one decade (1993-

    2003)

    The following paragraphs briefly discuss several application areas where turbo codes are

    used.

    Mobile radio :

    Few environments require more immunity to fading than those found in mobile com-

    munications. For applications where delay vs performance is crucial, turbo codes

    offer a wide trade-off space at decoder complexities equal to or better than conven-

    tional concatenated code performance. The major benefit is that such systems can

    work with smaller constraint-length convolutional encoders. The major drawback isdecoder latency. Turbo codes with short delay are being heavily researched. Turbo

    codes generally outperform convolutional and block codes when interleavers exceed

    200 bits in length [43].

    Digital video :

    Turbo codes are used as part of the Digital Video Broadcasting (DVB) standards.

    ComAtlas, a French subsidiary of VLSI Inc, has developed a single-chip turbo code

    codec that operates at rates to 40M bps. The device integrates a 32x32-bit interleaver

    and performance is reportedly at least as good as with conventional concatenated

    codes using a Reed-Solomon outer code and convolutional inner code [44].

    Long-haul terrestrial wireless :

    Microwave towers are spread across and communications between them is subjected

    to weather induced fading. Since these links utilize high data rates, turbo codes with

    large interleavers effectively combat fading, while adding only insignificant delay.

    Furthermore, power savings could be important for towers in especially remote areas,

    where turbo codes are used. [45].

    9

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    30/161

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    31/161

    Oceanographic Data-Link :

    This data-link is based on the Remote Environmental Data-Link (REDL) design.

    ViaSat is a company offering satellite networking products. ViaSat offers a modem

    with turbo encoding and decoding. It is reported that turbo coding increases the

    throughput and decreases the bandwidth requirement. [50].

    Image processing :

    Embedded image codes are very sensitive to channel noise because a single bit er-

    ror can lead to irreversible loss of synchronization between the encoder and the de-

    coder. Turbo codes are used for forward error correction in robust image transmis-

    sion. Turbo codes are suited in protecting visual signals, since these signals are

    typically represented by a large amount of data even after compression. [51, 52, 53].

    WLAN (Wireless LAN) :

    Turbo codes can increase the performance of a WLAN over traditional convolutional

    coding techniques. Using turbo codes, an 802.11a system can be configured for high

    performance. The resulting benefits to the WLAN system are that it requires less

    power, or it can transmit over a greater coverage area. Turbo code solution is used to

    reduce power and boost performance in the transmit portion of mobile devices in a

    wireless local area network (WLAN). [54]

    OFDM :

    The principles of orthogonal frequency division multiplexing (OFDM) modulation

    have been around for several decades. The FEC block of an OFDM system can be

    realizes by either a block-based coding scheme (Reed-Solomon) or turbo codes [55].

    xDSL modems :

    Turbo codes present a new and very powerful error control technique which allows

    communication very close to the channel capacity. Turbo codes have outperformed

    11

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    32/161

    all previously known coding schemes regardless of the targeted channel. Standards

    based on turbo codes have been defined. [56].

    These applications indicate that turbo codes can play a vital role in many of the commu-

    nications systems which are being designed. Low power, memory efficient implementation

    of turbo codes is thus an important research area.

    2.2 Turbo Encoder

    Turbo coding system consists of the turbo encoder and the turbo decoder. The description

    of the four state and eight state (3GPP), turbo encoder used in our design is discussed.

    2.2.1 Four state turbo encoder

    The general structure of a turbo encoder architecture consists of two Recursive Systematic

    Convolutional (RSC) encoders Encoder 1 and Encoder 2. The constituent codes are RSCs

    because they combine the properties of non-systematic codes and systematic codes [40, 38].

    In the encoder architecture displayed in Figure 2.1 the two RSCs are identical. The N bit

    data block is first encoded by Encoder 1. The same data block is also interleaved and

    encoded by Encoder 2. The main purpose of the interleaver is to randomize burst error

    patterns so that it can be correctly decoded. It also helps to increase the minimum distance

    of the turbo code [57]. Input data blocks for a turbo encoder consist of the user data and

    possible extra data being appended to the user data before turbo encoding.

    The encoder consists of a shift register and adders as shown in Fig. 2.1 (b). The struc-

    ture of the RSC encoder is fixed for the design because enabling varying encoder structures

    would significantly increase the complexity of the decoder by requiring to adapt to the new

    trellis structure and computation of the different metrics in the individual decoders.

    The input bits are fed into the left end of the register and for each new input bit two

    output bits are transmitted over the channel. These bits depend not only on the present

    input bit, but also on the two previous input bits, stored in the shift register.

    forms

    12

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    33/161

    Encoder 1

    Encoder 2

    Encoded

    message

    xs

    xp1

    xp2uk_int

    ukInput message

    Interleaver

    Concatenate

    Notation used in trellis diagram

    11

    10

    01

    00 0./00

    1/11

    1/01

    0/10

    1/11

    0/00

    1/01

    0/10

    0./00

    1/11 1/11 1/11 1/11

    1/01 1/01 1/01

    0/10 0/10

    1/01 1/01

    0/10 0/10

    0/00

    uk=0 uk=1

    S0=

    S1=

    0./00 0./00 0./00

    1/11 1/11

    0/00

    0/10

    D D

    u(k)

    parity bits xp1[k]

    S311

    S210

    S101

    So

    00

    0/00

    1/111/11

    0/00

    1/01

    0/100/10

    1/01

    STATE Diagram

    systematic bits

    (b)(a)

    (c) (d)

    Figure 2.1: (a) The encoder block schematic (b) 4 state encoder details (Encoder 1 andEncoder 2 are identical in nature) (c) State diagram representation (d) Trellis diagram

    13

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    34/161

    the systematic bit stream, for the input message of frame length N. The parity bit stream

    depends on the state of the encoder. For a given information vector (input message)

    of frame-length

    , the code word (encoded message) consists of concatenation of two

    constituent convolutional code words, each which is based on a different permutation in

    the order of the information bits. The working of the encoder can be understood using a

    state diagram given in Fig. 2.1 (c). In an encoder with two memory elements, there are four

    states

    ,

    ,

    and

    . At each incoming bit, the encoder goes from one state to another.

    For the encoder shown in Fig 2.1 (b), there are two encoded bits corresponding to input bit

    0 and 1.

    Another way to represent the state diagram is the trellis diagram [58]. A trellis is a

    graph whose nodes are in a rectangular grid semi infinite to the right as shown in Fig. 2.1

    (d). Trellis diagram represents the time evolution of the coded sequences, which can be

    obtained from the state diagram. A trellis diagram is thus an extension of a state diagram

    that explicitly shows passage of time [58]. The nodes in the trellis diagram correspond to

    the state of the encoder. From an initial state (

    ) the trellis records the possible transitions

    to the next states for each possible input pattern. In the trellis diagram Fig 2.1 (d) at stage

    t=1 there are two states

    and

    and each state has two transitions corresponding to the

    input bit 0 and 1. Hence the number of nodes are fixed in a trellis, which is decided

    by the number of memory elements in the encoder. In drawing the trellis diagram the

    convention in specifying the branches introduced by a 0 input bit as a solid line and the

    branches introduced by a 1 input as a dashed line is used.

    2.2.2 Eight state turbo encoder - 3GPP version

    In order to compare and study the characteristics in terms of the performance, both the four

    state and the eight state encoders are implemented. The difference essentially lies in the

    number of memory element that each encoder uses. In contrast to the two memory element

    as in four state encoder, there are three memory elements in an eight state encoder. This is

    the encoder that is specified in the 3GPP standards [38].

    14

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    35/161

    Parity bits

    Systematic bits

    +

    + +

    +

    D D D

    +

    + +

    +

    D D DInputdata

    Parity bits

    Encoder 1

    Encoder 2

    Interleaver

    8 state encoder schematic

    Figure 2.2: The 8 state encoder 3GPP version

    The schematic diagram of the 3GPP encoder is shown in Figure 2.2. It consists of two

    eight state constituent encoders. The data bit stream goes into the first constituent encoder

    that produces a parity for each input bit. The data bit stream is scrambled by the interleaver

    and fed to the encoder 2. For 3GPP, the interleaver can be anywhere from 40 bits to 5114

    bits long. The data sent across the channel is the original bit stream, parity bit of the

    encoder 1 and the parity bits of the second encoder. So the entire turbo encoder is a rate

    1/3 encoder.

    2.3 Turbo Decoder

    In this section the iterative decoding process of the turbo decoder is described. The maxi-

    mum a posteriori algorithm (MAP) is used in the turbo decoder. There are three types of

    algorithms used in turbo decoder namely MAP, Max-Log-MAP and Log-MAP. The MAP

    algorithm is a forward-backward recursion algorithm, which minimizes the probability of

    15

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    36/161

    bit error, has a high computational complexity and numerical instability. The solution to

    these problems is to operate in the log-domain. One advantage of operating in log-domain

    is that multiplication becomes addition. Addition, however is not straight forward. Addi-

    tion is a maximization function plus a correction term in the log domain. The Max-Log-

    MAP algorithm approximates addition solely as a maximization. Max-Log-MAP algorithm

    in turbo decoder is used in our work [38].

    2.3.1 Decoding Algorithm

    The block diagram of the turbo decoder is shown in Figure 2.3. There are two decoders cor-

    responding to the two encoders. The inputs to the first decoder are the observed systematic

    bits, the parity bit stream from the first encoder and the deinterleaved extrinsic information

    from the second decoder. The inputs to the second decoder are the interleaved systematic

    bit stream, the observed parity bit stream from the second RSC and the interleaved extrinsic

    information from the first decoder. The main task of the iterative decoding procedure, in

    each component decoder is an algorithm that computes the a posteriori probability (APP)

    of the information symbols which is the reliability value for each information symbol. The

    sequence of reliability values generated by a decoder is passed to the other one. In this way,

    each decoder takes advantage of the suggestions of the other one. To improve the correct-

    ness of its decisions, each decoder has to be fed with information that does not originate

    from itself. The concept of extrinsic information was introduced to identify the component

    of the general reliability value, which depends on redundant information introduced by the

    considered constituent code. A natural reliability value, in the binary case, is the logarithm

    likelihood ratio (llr).

    Each decoder has a number of compute intensive tasks to be done during decoding.

    There are five main computations to be performed during each iteration in the decoding

    stage as shown in Figure 2.3. Detailed derivations of the algorithm can be found in [59, 60,

    38]. The computations are as follows (computations in one decoder per iteration):

    16

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    37/161

    Deinterleaver Decoder 2

    Decoder 1 Interleaver

    Interleaver

    Iterative Decoding

    1. Branch Metric Computation

    2. Forward Recursion

    3. Backward Recursion

    4. Log likelihood Computation

    5. Extrinsic Computation

    encoder 2 parity bits

    encoder 1parity bits

    systematicbits

    extrinsic

    information

    InterleaverFinal

    estimate

    Retrieved message

    LLR

    Figure 2.3: Decoder schematic diagram

    17

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    38/161

    Branch metric computation - unit

    In the algorithm for turbo decoding the first computational block is the branch metric

    computation. The branch metrics is computed based on the knowledge of input and

    output associated with the branch during the transition from one state to another (Fig.

    2.1 (d)). There are four states and each state has two branches, which gives a total of

    eight branch metrics. The computation of branch metric is done using (2.1). Upon

    simplifying this equation as shown in [61, 62], it is observed that only two values are

    sufficient to derive branch metrics for all the state transitions.

    (2.1)

    where

    is the branch metric at time k,

    are the systematic bits of information

    with frame-length N,

    is the information that is fed back from one decoder to the

    other decoder, is the channel estimate which corresponds to the maximum signal

    to distortion ratio,

    is the encoded parity bits of the encoder,

    is the noisy

    observed values of the encoded parity bits and

    is the observed values of the

    encoded systematic bits.

    The

    unit takes the noisy systematic bit stream, the parity bits from encoder1 and

    encoder2 to decoder1 and decoder2 respectively and the apriori information to com-

    pute the branch metrics. The branch metrics

    for all branches in the trellis are

    computed and stored.

    Forward metric computation - unit

    The forward metric is the next computation in the algorithm, which represents

    the probability of a state at time k, given the probabilities of states at previous time

    instance.

    is calculated using equation (2.2).

    18

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    39/161

    (2.2)

    where the summation is over all the state transitions to . is computed at each

    node at a time instance k in the forward direction traversing through the trellis. is

    computed for states 00, 01, 10 and 11 (refer Fig 2.1 (d) ). In an 8 state decoder -

    3GPP version,

    is computed for states 000, 001, 010, 011, 100, 101, 110 and 111

    respectively.

    Observing a section of the trellis diagram in Figure 2.1 four metrics are to be

    computed, namely

    ,

    ,

    ,

    , where k ranges from [0, 1, 2, ...,

    N-1]. This metric is termed as forward metric because the computation order is from

    0,1,2,... N-1 and the value for index

    is initialized to zero. The

    unit recursively

    computes the metric using

    values computed in the above step. A forward recursion

    on the trellis is performed by computing

    for each node in the trellis.

    is the sum of

    the previous alpha times the branch metric along each branch from the two previous

    nodes to the current node.

    Backward metric unit- unit

    The backward state probability being in each state of the trellis at each time k, given

    the knowledge of all the future received symbols, is recursively calculated and stored.

    The backward metric

    is computed using equation 2.3 in the backward direction,

    going from the end to the beginning of the trellis at time instance k-1, given the

    probabilities at time instance k.

    (2.3)

    where the state transition is from

    . is computed for states 00, 01, 10 and 11 (refer

    Fig 2.1 (d) ) in a 4 state decoder. In an 8 state decoder - 3GPP version,

    is computed

    for states 000, 001, 010, 011, 100, 101, 110 and 111.

    19

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    40/161

    Backward metric computation can start only after the completion of the computation

    by the

    unit. Observing the trellis diagram, there are four

    values to be computed,

    but now in backward order, from [N-1 down to 0]. The four

    values are

    ,

    ,

    ,

    utilizing

    values and the initialized

    values for index k equal

    to N-1. A backward recursion on the trellis is performed by computing

    for each

    node in the trellis. The computation is the same as for

    , but starting at the end of

    the trellis and going in the reverse direction.

    Log likelihood ratio llr unit

    Log likelihood ratio llr is the output of the turbo decoder. This output llr for each

    symbol at time k is calculated as

    llr[k-1]

    (2.4)

    where numerator is summation over all the states

    to

    in

    and input message bit

    = 1. The denominator is summation over all the states to in

    and input

    message bit

    = 0.

    The

    values,

    unit output and the

    values obtained from the above steps are used

    to compute the llr values. The main operations are comparison, addition and sub-

    traction. Finally, these values are de-interleaved at the second decoder output, after

    the required number of iterations to make the hard decision, in order to retrieve the

    information that is transmitted. The log likelihood ratio llr[k] for each kis computed.

    The llr is a convenient measure since it encapsulates both the soft and hard bit in-

    formation in one number. The sign of the number corresponds to the hard decision

    while the magnitude gives a reliability estimate.

    Extrinsic unit

    Compute the extrinsic information that is to be fed to the next decoder in the iteration

    20

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    41/161

    sequence. This is the llr minus the input probability estimate. Extrinsic informa-

    tion

    is obtained from the log likelihood ratio llr[k-1] by subtracting the

    weighted channel systematic bits and the information fed from the other decoder.

    Extrinsic information computation uses the llr outputs, the systematic bits and the

    apriori information to compute the extrinsic value. This is the value that is fed back

    to the other decoder as the apriori information.

    This sequence of computations is repeated for each iteration by each of the two de-

    coders. After all iterations are complete, the decoded information bits can be retrieved by

    simply looking at the sign bit of the llr: if it is positive the bit is a one, if it is negative

    the bit is a zero. This is because the llr is defined to be the logarithm of the ratio of the

    probability that the bit is a one to the probability that the bit is a zero. There are a number

    of references in which one can find the complete derivations equations for the MAP turbo

    decoding algorithm [38, 61, 40, 39].

    2.4 Experimental set up

    To study the effects of varying various parameters on the BER, the C version of the appli-

    cation is developed during the SW solution of the application.

    To compare the performance of four state turbo code and eight state (3GPP) turbo

    code, the floating point version of 3GPP turbo code is developed. It assists in observ-

    ing the effect of the type of encoder design used in each case.

    A floating point version of the four state turbo encoder and the decoder is developed.

    This helps us to obtain the BER for various signal to noise ratio and test its function-

    ality. The test bench modeled during the application characterization is used during

    hardware solution of the design later.

    Effects of varying application parameters (frame length and iterations) are assessed

    21

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    42/161

    using the developed C model. These are used later to correlate BER vs area/power

    due to architectural optimization.

    To study the effect of quantization, the fixed point version of the four state and the

    eight state turbo code is developed. The word length effects of various data elements

    in the design on BER can be observed using the fixed point model. This helps in

    fixing the bit-width of the input parameters of the decoder to be used during the

    architectural optimization (bit-width analysis).

    To compare the results of different interleaver designs, a module for the random

    interleaver and the RC symmetric interleaver was developed. These versions were

    used to compare our result with the work of other researchers in the turbo coding

    area.

    2.5 Simulation Results

    Different parameters which affect the performance of turbo codes are considered. Some of

    these parameters are:

    The size of the input message (Frame-length).

    The number of decoding iterations.

    The type of the encoder used (4 state vs 8 state encoder).

    Effect of interleaver design.

    Word-length optimization (Quantization effects).

    All our results presented in this section consider turbo codes over the additive white

    Gaussian noise (AWGN) channel.

    22

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    43/161

    10-5

    10-4

    10-3

    10-2

    10-1

    100

    0 0.5 1 1.5 2 2.5

    BER

    SNR Eb/No

    Effects of varying the framesize on BER (eight state)

    N = 128N = 256

    N = 1024N = 2048N = 4096

    Figure 2.4: Impact on performance for different frame-length used, simulated for six itera-tions varying the SNR.

    23

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    44/161

    2.5.1 Effect of varying frame-length on code performance

    The increase in the size of the input frame length has an impact on the interleaver size.

    When the size of the interleaver increases it adds to the complexity of the design. It in-

    creases the decoding latency and the power consumption.

    Figure 2.4 depicts how the performance of the turbo codes depends on the frame-length

    N used in the encoder. It can be seen that, there is a clear trend in the performance of the

    turbo codes when the frame-length increases. The codes exhibit a better performance for

    increase in the frame-length. This is in accordance with various other published literature.

    Since our goal is to develop a low power framework for this application specific design,

    the algorithmic study on the impact of vary N is essential. This assists in the power per-

    formance trade-off studies, when the design is ported to ASIC or FPGA solution. Next the

    effect of varying the number of iterations in the turbo codes is considered.

    2.5.2 Effect of varying number of iterations on performance

    Effect of varying the number of iterations during the decoding process is an interesting

    observation in system level studies. Increasing the number of iterations should give a per-

    formance improvement, but at the expense of added complexity. The latency of obtaining

    the decision output after completion of the decoding will increase. This will also have an

    impact on the power consumption when the design is considered at the implementation

    level. As seen from Figure 2.5, going from one iteration, which was the case of no feed-

    back, to the case of 3 iterations, a substantial gain in SNR for a given BER was obtained.

    In our study it is inferred that around four to six iterations are sufficient in most cases. The

    number of iterations have been fixed as six in our all our experiments, unless otherwise

    stated. These observations agree with the results from [63]. Feedback drives the decision

    progressively further in one particular direction, which is an improvement over the previous

    iteration. The extent of growth slows down as iterations increase.

    24

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    45/161

    10-5

    10-4

    10-3

    10-2

    10-1

    100

    0 0.5 1 1.5 2 2.5

    BER

    SNR Eb/No (1024 bit interleaver)

    3GPP N=1024 Varying the number of iterations

    6 iterations5 iterations1 iteration

    Figure 2.5: Impact on performance for varying number of iterations, N =1024, 8 state turbocode

    25

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    46/161

    2.5.3 4 state vs 8 state (3GPP) turbo decoder performance

    Figure 2.6 shows the performance of the turbo decoder using 4 state version and the 8 state

    (3GPP) version. For comparison, the 4 state floating point model and the 3GPP floating

    point model was used, simulated under the same conditions of AWGN channel. Third gen-

    eration partnership project (3GPP) has recommended turbo coding as the defacto standard

    in the 3G standards due to its better performance.

    The performance of the four state vs the eight state turbo codes is compared. In all of

    the simulated cases, 8-state (3GPP) outperforms 4-state turbo codes. The gains obtained by

    8-state with 6 iterations with respect to 4-state with 6 iterations show a difference of almost

    one order of magnitude for a 1024 bit interleaver. The performance of the 8 state turbo

    codes is better than the 4 state, primarily as a result of superior weight distribution of the 8

    state turbo codes. Furthermore 8-state PCCC with i iterations is approximately equivalent

    to 4-state SCCC with 3i/2 iterations in terms of computational cost, gate count and power

    consumption [64].

    2.5.4 Interleaver design considerations

    The performance of the turbo codes is dependent on different parameters like the frame-

    length, number of iterations, selection of different encoders and use of different inter-

    leavers. Interleaving is basically the rearrangement of the incoming bit stream in some

    order specified by the interleaver module.

    Random interleavers

    A random interleaver is based on random permutation . For large values of N, most

    random interleavers utilized in turbo codes perform well. As the interleaver block size

    increases, the performance of a turbo code improves substantially. The method to generate

    a random interleaver is as follows:

    Generate an interleaver array of size N. Input to this array is the information bit

    stream of frame-length N.

    26

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    47/161

    10-5

    10-4

    10-3

    10-2

    10-1

    100

    0 0.5 1 1.5 2 2.5

    BER

    SNR Eb/No (4 state vs 3GPP 1024 bit six iterations)

    Comparison of 8 state(3GPP) and 4 state Turbo code

    3GPP - 8 state4state

    Figure 2.6: Comparing 4 state and the 8 state turbo code (floating point version), N = 1024,RC symmetric interleaver, six iterations.

    27

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    48/161

    Generate a random number array of size N.

    Use the random number array as the index to rearrange the position of the bits in the

    interleaver array to obtain the new interleaved array.

    To verify the turbo code functionality and to compare it with similar work conducted

    by other researchers, a turbo coding system using a random interleaver was also developed.

    This model forms the basis for comparison with the simulated performance of Bruces work

    [61]. Interleaver size N = 1024, 4 state turbo coding and six iterations are used in both the

    cases in order to have a fair comparison. Our results tally with their model. The results are

    shown in Figure 2.7 with bit error rate (BER) on the y axis and signal-to-noise in dB on the

    x axis.

    RC symmetric interleaver

    A symmetric interleaver is used in the remaining experiments of our work. The interleaver

    size is N, which is also the frame-length size of the information that is handled. The inter-

    leaver is arranged in the form of a matrix with specified rows and columns. The possible

    interleaver configuration can be N = 128 with 4 rows and 32 columns, or N = 32 with 4

    rows and 8 columns. The scrambling of the bit positions is done as follows and the diagram

    is shown in Fig 2.8

    Conceptually bits are written into a two dimensional matrix row-wise. The row inter-

    leaving is done first followed by the column interleaving to obtain the intereaved data. The

    rows are shuffled according to a bit-reversal rule and the elements within each row are per-

    muted according to the bit-reversal procedure again.The deinterleaving is also done in the

    same way, since this is a symmetric interleaver. Consider the row interleaving where num-

    ber of rows is 4. The row numbers areR[0,1,2,3]. Its binary equivalent

    .

    After bit reversal binary

    which is R[0,2,1,3]. Elements in row zero

    and three are kept as it is, the elements in rows one and two are interchanged.

    Before interleaving can commence it is essential that all the data elements are present.

    The advantage of using symmetric interleaver is that the same interleaver can be used for

    28

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    49/161

    10-4

    10-3

    10-2

    10-1

    100

    0 0.5 1 1.5 2 2.5

    BER

    SNR Eb/No (1024 bit random interleaver design and six iterations)

    Comparison of Our work and Bruce 4 state Turbo code

    Our workBruce

    Figure 2.7: Random interleaver design - Our work with Bruces 4 state turbo code [61]

    29

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    50/161

    ...............

    elements arranged

    in the matrix form.

    Row 0

    Row 1Row 2

    Row 3

    Col0

    Col31

    Step1 : Take the input message of 128 and arrange it in 4 rows and 32 columns.

    Step 2 : Perform Row interchange. Use bit reversal algorithm to get the pattern

    of the row interchanging. Eg Row 1 means binary 01 . Its bit reversal is

    10 which is Row 2. Interchange the row elements.Step 3 : Column Interleaving. Bit reversal algorithm is used for the column

    numbers also. After row interleaving, column interleaving is done.

    Figure 2.8: Illustration of interleaving the data elements

    30

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    51/161

    deinterleaving also. A disadvantage of random interleavers is that it requires different in-

    terleave and deinterleave unit with separate hardware and lookup tables. It is shown in [41]

    that although the random interleavers give good performance, symmetric interleavers in no

    way degrade the performance of the turbo coding system.

    Our turbo encoder/decoder simulation is compared with the results published by other

    independent researchers. Our simulation results after six iterations are shown in Figure

    2.9, together with the results simulated by Mark Ho et al. [41]. There is some difference

    in BER and it can be attributed due to the different symmetric interleavers used in both the

    cases. In [41] S-symmetric interleaver is used. RC-symmetric interleaver is used in our

    simulations.

    2.5.5 Effect of Quantization

    Next, the effect of quantization on the performance of the turbo decoder is studied. Floating

    point computations are first converted to fixed point computations by the use of quantizers.

    Experiments are conducted to compare the impact on performance using floating point

    version of the turbo code and the fixed point version of the turbo code. Since it is planned

    to use this application further for architectural analysis, on chip memory exploration and

    the FPGA implementation study, this step is a pre-requisite. Further, to investigate the word

    length effects on power consumption the fixed point implementation study is needed. The

    performance of the turbo code for bit width of 32 (integer), 6 bit and 4 bit quantization is

    simulated.

    Numerical precision has an impact on the performance in terms of the bit error rate

    of the turbo codes which is shown in Fig 2.10. It can be observed that the floating point

    implementation would have better performance than the fixed point implementation. How-

    ever, this will increase the complexity of the implementation for floating point case, when

    compared to the fixed point case. Hence, fixed point precision is a preferred choice. Now,

    in fixed point implementation, it is interesting to observe the word length requirement of

    the input data to the decoder (output of the channel estimates). In the same Figure 2.10,

    31

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    52/161

    10-3

    10-2

    10-1

    100

    0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

    BER

    SNR Eb/No (4state N=1024 bit, six iterations)

    Comparison using symmteric interleaver, N=1024

    Our WorkMark Ho

    Figure 2.9: Comparing our work with Mark Hos [41] using a symmetric interleaver

    32

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    53/161

    10-5

    10-4

    10-3

    10-2

    10-1

    100

    0 0.5 1 1.5 2 2.5

    BER

    SNR Eb/No (dB) (Varying the quantization bits)

    Effects of Quantization on BER N = 4096

    floating pointfixed point

    6 bit quantization4 bit quantization

    Figure 2.10: Impact of word length variation on bit error rate of the turbo codes

    33

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    54/161

    comparison of various bit width is depicted. Observing the performance of the turbo codes

    with bit width six, the performance is very close to the fixed point (32 bit width) perfor-

    mance. Hence bit width of six could be an appropriate choice for the input data bit width.

    Coming to the comparison of bit width variation from six to four, it is observed that the

    there is a degradation of performance of the turbo codes, indicating that four bit width is

    not a good choice. These observations tally with other published literature on quantization

    effects [65].

    2.6 Summary

    In this chapter, some applications of turbo codes are discussed in the beginning to get an

    insight into the different areas of its utility. The applications are characterized and the

    operational details of the turbo coding application are studied. The algorithmic details of

    the turbo encoder and the decoder are presented next, which enables one to understand the

    operational flow of the application. BER is a measure of the performance of the turbo codes

    for different signal to noise ratios. The impact on the BER for the varying parameters are

    studied. Experimental set up and the results are presented for effect of number of iterations

    and effect of the frame length. The study on the effect of type of encoder (4 state vs 8

    state(3GPP)) shows that the 3GPP version has better performance than the four state turbo

    decoder as expected. Our results are compared with those published in literature. The

    effect of quantization on the BER is also studied. The C simulation models constructed are

    used to test the functional correctness.

    34

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    55/161

    Chapter 3

    On chip Memory Exploration

    3.1 Introduction

    Processor-based information processing can be used conveniently in many embedded sys-

    tems. Embedded systems themselves have been evolving in complexity over the years.

    From simple 4-bit and 8-bit processors in the 70s and 80s for simple controller appli-

    cations, the trend is towards use of 16-bit as well as 32-bit processors for more complex

    applications involving audio and video processing. The detailed architecture of such sys-

    tems is very much application driven to meet power and performance requirements. In

    particular, this applies to the memory architecture, since memory does have a major impact

    on the overall performance and the power consumption.

    In this chapter the effect of different options that exist for the implementation of mem-

    ory architectures is analyzed and alternatives to on-chip cache realization explored. The

    performance and energy consumption when using scratch pad memory (SPM) vis-a-vis

    cache as on-chip memory alternatives is compared. This is because memory requirements

    in the application programs have increased. This is specially true in digital signal pro-

    cessing, multimedia applications and wireless mobile applications. This also happens to

    be a major bottleneck, since a large memory implies slower access time to a given mem-

    ory element, higher power dissipation and more silicon area. Memory hierarchy consists

    35

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    56/161

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    57/161

    3.2 Related Work

    Recently, interest has been focused on having on-chip scratch pad memory to reduce power

    consumption and improve performance. Scratch pad memories are considerably more en-

    ergy and area efficient than caches. On the other hand, they can replace caches only if

    they are supported by an effective compiler. The mapping of memory elements of the ap-

    plication benchmarks to the scratch pad memory can be done only for data [70], only for

    instructions [71] or for both data and instructions [72, 73]. Current embedded processors

    particularly in the area of multimedia applications and graphics controllers have on-chip

    scratch pad memories. Panda et.al [70] associate the use of scratch pad memory with data

    organization and demonstrate its effectiveness. Benini et.al [74] focus on memory synthe-

    sis for low power embedded systems. They map the frequently accessed locations onto

    a small application specific memory. Most of these works address only the performance

    issues.

    Focusing on energy consumption it is shown that scratch pad memory can be used as

    a preferable design alternative to the cache for on-chip memory. To complete the trade-off

    area model for the cache and scratch pad memory is developed. It is concluded that scratch

    pad memory occupies less silicon area.

    3.3 Scratch pad memory as a design alternative to cache

    memory

    In this section experiments on DSPstone benchmarks using the methodology developed

    to show that SPM is more energy efficient than the traditional on-chip cache memory are

    conducted.

    37

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    58/161

    Address

    Word lines

    Column Muxes

    OutputDrivers

    Comparators

    Sense amplifiers

    Output OutputDriver Driver

    Data Output

    DriversMux

    Input

    Decoder

    Valid output

    Tagarray

    Dataarra

    y

    Bitlines

    Word lines

    Bitlines

    Sense amplifiers

    Column Muxes

    Figure 3.1: Cache Memory organization [75]

    3.3.1 Cache memory

    The basic organization of the cache is taken from [75] and is shown in Fig. 3.1. The

    decoder first decodes the address and selects the appropriate row by driving one wordline

    in the data array and one wordline in the tag array. The information read from the tag array

    is compared to the tag bits of the address. The results of comparisons are used to drive a

    valid (hit/miss) output as well as to drive the output multiplexers. These output multiplexers

    select the proper data from the data array and drive the selected data out of the cache.

    The area model used in our work is based on the transistor count in the circuitry. All

    38

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    59/161

    transistor counts are computed from the designs of circuits. From the organization shown

    in Fig. 3.1, the area of the cache (Ac) is the sum of the area occupied by the tag array (

    )

    and data array (

    ).

    (3.1)

    and

    is computed using the area of its components.

    (3.2)

    where , , , , , and is the area of the tag decoder unit, tag ar-

    ray, column multiplexer, the pre-charge, sense amplifiers, tag comparators and multiplexer

    driver units respectively.

    (3.3)

    where , ,

    , ,

    ,

    is the area of the data decoder unit, data array,

    column multiplexer, pre-charge, data sense amplifiers and the output driver units respec-

    tively. The estimation of power can be done at different levels, from the transistor level

    to the architectural level [30]. In CACTI [75], transistor level power estimation is done.

    The energy consumption per access in a cache is the sum of energy consumptions of all the

    components identified above. It is assumed that the cache is a write through cache. There

    are four cases of cache access that are considered in our model. Table 3.1 lists the accesses

    required in each of four possible cases i.e. read hit, read miss, write hit and write miss.

    Access type ! % & % & !

    Read hit 1 0 0 0

    Read miss 1 L L 0

    Write hit 0 1 0 1

    Write miss 1 0 0 1

    Table 3.1: Cache memory interaction model

    39

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    60/161

    Cache read hit : When the CPU requires some data, the tag array of the cache is

    accessed. If there is a cache read hit, then data is read from the cache. No write to

    the cache is done and main memory is not accessed for a read or write. (row 2 - table

    3.1)

    Cache read miss : When there is a cache read miss, it implies that the data is not in

    the cache and the line has to be brought from main memory to cache. In this case,

    there is a cache read operation, followed by L words to be written in the cache, where

    L is the line size. Hence there will be a main memory read event of size L with no

    main memory write. (row 3 - table 3.1)

    Cache write hit : If there is a cache write hit, there is a cache write followed by a

    main memory write as this is a write through cache. (row 4 - table 3.1)

    Cache write miss : In case of a cache write miss, a cache tag read (to establish the

    miss) is followed by the main memory write. There is no cache update in this case.

    For simplicity, tag read access from cache read of tag and data are not distinguished.

    The cache is a write through cache. (row 5 - table 3.1)

    Using this model the cache energy equation is derived as

    !

    (3.4)

    Where

    is the energy spent in cache.

    is the number of cache read accesses

    and

    !

    is the number of cache write accesses. Energy E is computed using equation

    (3.7).

    3.3.2 Scratch pad memory

    The scratch pad is a memory array with the decoding and the column circuitry logic. This

    model is designed keeping in view that the memory objects are mapped to the scratch pad

    in the last stage of the compiler. The assumption here is that the scratch pad memory

    40

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    61/161

    oVdd Vdd

    Word select

    bit bit_bar

    (c)

    (a)

    (b) Six Transistor Static RAM

    Memory array

    Memory Cell

    Word Select

    bit_barbit

    Figure 3.2: Scratch pad memory array

    occupies one distinct part of the memory address space with the rest of the space occupied

    by main memory. Thus, the availability of the data/instruction in the scratch pad need not

    be checked. It reduces the comparator and the signal miss/hit acknowledging circuitry.

    This contributes to the energy as well as area reduction.

    The scratch pad memory array cell is shown in Fig. 3.2(a) and the memory cell in 3.2(b).

    The 6 transistor static RAM cell [75, 2] is shown in Fig 3.2(c). The cell has one R/W

    port. Each cell has two bit-lines, bit and bit bar, and one word-line.The complete scratch

    pad organization is as shown in Fig. 3.3. From the organization shown in Fig. 3.3, the area

    of the scratch pad is the sum of the area occupied by the decoder, data array and the column

    circuit. Let

    be the area of the scratch pad memory.

    (3.5)

    where

    ,

    ,

    ,

    ,

    and

    is the area of the data decoder, data array

    area, column multiplexer, pre-charge, data sense amplifiers and the output driver units re-

    spectively. The scratch pad memory energy consumption can be estimated from the energy

    consumption of its components i.e. decoder

    and memory columns

    .

    41

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    62/161

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    63/161

    cycle[124].

    ! (3.8)

    is computed from equation (3.8). It is the sum of the capacitances due to pre-

    charge and read access to the scratch pad memory. is the effective load capacitance of

    the bit-lines during pre-charging and ! is the effective load capacitance of the cell

    read/write.

    is the number of columns in the memory.

    In the preparation for an access, bit-lines are pre-charged and during actual read/write,

    one side of the bit lines are pulled down. Energy is therefore dissipated in the bit-lines due

    to pre-charging and the read/write access. When the scratch pad memory is accessed, the

    address decoder first decodes the address bits to find the desired row. The transition in the

    address bits causes the charging and discharging of capacitances in the decoder path. This

    brings about energy dissipation in the decoder path. The transition in the last stage, that is

    the word-line driver stage triggers the switching in the word-line. Regardless of how many

    address bits change, only two word-lines among all will be switched. One will be logic 0

    and the other will be logic 1. The equations are derived based on [75].

    (3.9)

    where

    is the total energy spent in the scratch pad memory. In case of a scratch

    pad as contrast to cache, events due to write miss and read miss are not there. The only

    possible case that holds good is the read or write access.

    is the number of accesses

    to the scratch pad memory.

    is the energy per access obtained from our analytical

    scratch pad model.

    43

  • 8/14/2019 A LOWPOWER DESIGN METHODOLOGY FOR TURBO ENCODER AND DECODER

    64/161

    3.3.3 Estimation of Energy and Performance

    As the scratch pad is assumed to occupy part of the total memory address space, from the

    address values obtained by the trace analyzer, the access is classified as going to scratch

    pad or main memory and an appropriate latency is added to the overall program delay. One

    cycle is assumed if it is a scratch pad read or write access. For main memory, two modes of

    access - 16 bits and 32 bits are considered. The modes and latencies are based on an ARM

    processor with thumb instruction set which has been used for our experimentation. This is

    because for processors designed for energy sensitive applications, multiple access modes

    are provided. If it is a main memory 16 bit access is one cycle plus 1 wait state (refer to

    Table 3.2 ). If it is a main memory 32 bit access then, it is one cycle plus 3 wait states.

    The total time in number of clock cycles is used to determine the performance. The scratch

    pad energy consumption is the number of accesses multiplied by the energy per access as

    described in equation 3.9.

    Access Number of cycles

    Cache Using Table 3.1

    Scratch pad 1 cycle

    Main Memory 16 bit 1 cycle + 1 wait stateMain Memory 32 bit 1 cycle + 3 wait states

    Table 3.2: Memory access cycles

    The special compiler support required in SPM address space allocation is to identify

    the data