Low Power Vlsi Papers

7/21/2019 Low Power Vlsi Papers

http://slidepdf.com/reader/full/low-power-vlsi-papers 1/11

1. Area-Delay-Power Efficient Fixed-Point LMS Adaptive With Low Adaptation-

Delay

In this paper, we present an efficient architecture for the implementation of a delayed least

mean square adaptive filter. For achieving lower adaptation-delay and area-delay-power

efficient implementation, we use a novel partial product generator and propose a strategy for

optimized balanced pipelining across the time-consuming combinational blocks of the

structure. From synthesis results, we find that the proposed design offers nearly 17% less

area-delay product (ADP) and nearly 14% less energy-delay product (EDP) than the best of

the existing systolic structures, on average, for filter lengths N=8, 16, and 32. We propose an

efficient fixed-point implementation scheme of the proposed architecture, and derive the

expression for steady-state error. We show that the steady-state mean squared error obtained

from the analytical result matches with the simulation result. Moreover, we have proposed a

bit-level pruning of the proposed architecture, which provides nearly 20% saving in ADP and

9% saving in EDP over the proposed structure before pruning without noticeable degradation

of steady-state-error performance.

2.

Critical-Path Analysis and Low-Complexity Implementation of the LMS

Adaptive Algorithm

This paper presents a precise analysis of the critical path of the least-mean-square (LMS)

adaptive filter for deriving its architectures for high-speed and low-complexity

implementation. It is shown that the direct-form LMS adaptive filter has nearly the same

critical path as its transpose-form counterpart, but provides much faster convergence and

lower register complexity. From the critical-path evaluation, it is further shown that no

pipelining is required for implementing a direct-form LMS adaptive filter for most practical

cases, and can be realized with a very small adaptation delay in cases where a very high

sampling rate is required. Based on these findings, this paper proposes three structures of the

LMS adaptive filter: (i) Design 1 having no adaptation delays, (ii) Design 2 with only one

adaptation delay, and (iii) Design 3 with two adaptation delays. Design 1 involves the

minimum area and the minimum energy per sample (EPS). The best of existing direct-form

structures requires 80.4% more area and 41.9% more EPS compared to Design 1. Designs 2

LOW POWER VLSI



and 3 involve slightly more EPS than the Design 1 but offer nearly twice and thrice the MUF

at a cost of 55.0% and 60.6% more area, respectively.

3. Efficient Integer DCT Architectures for HEVC

In this paper, we present area- and power-efficient architectures for the implementation of

integer discrete cosine transform (DCT) of different lengths to be used in High Efficiency

Video Coding (HEVC). We show that an efficient constant matrix-multiplication scheme can

be used to derive parallel architectures for 1-D integer DCT of different lengths. We also

show that the proposed structure could be reusable for DCT of lengths 4, 8, 16, and 32 with a

throughput of 32 DCT coefficients per cycle irrespective of the transform size. Moreover, the

proposed architecture could be pruned to reduce the complexity of implementation

substantially with only a marginal affect on the coding performance. We propose power-

efficient structures for folded and full-parallel implementations of 2-D DCT. From the

synthesis result, it is found that the proposed architecture involves nearly 14% less area-delay

product (ADP) and 19% less energy per sample (EPS) compared to the direct implementation

of the reference algorithm, on average, for integer DCT of lengths 4, 8, 16, and 32. Also, an

additional 19% saving in ADP and 20% saving in EPS can be achieved by the proposed

pruning algorithm with nearly the same throughput rate. The proposed architecture is found

to support ultrahigh definition 7680 × 4320 at 60 frames/s video, which is one of theapplications of HEVC.

4. An Optimized Modified Booth Recoder for Efficient Design of the Add-Multiply

Operator

Complex arithmetic operations are widely used in Digital Signal Processing (DSP)

applications. In this work, we focus on optimizing the design of the fused Add-Multiply

(FAM) operator for increasing performance. We investigate techniques to implement the

direct recoding of the sum of two numbers in its Modified Booth (MB) form. We introduce a

structured and efficient recoding technique and explore three different schemes by

incorporating them in FAM designs. Comparing them with the FAM designs which use

existing recoding schemes, the proposed technique yields considerable reductions in terms of

critical delay, hardware complexity and power consumption of the FAM unit.

5. Improved design of high-frequency sequential decimal multipliers

Hardware implementation of decimal arithmetic operations has become a hot topic for

research during the last decade. Among various operations, decimal multiplication is



considered as one of the most complicated dyadic operations, which requires high-cost

hardware implementation. Therefore, the processor industry has opted to use the sequential

decimal multipliers to reduce the high cost of parallel architectures. However, the main

drawback of iterative multipliers is their high latency. In this reported work, the focus has

been on reducing the latency of decimal sequential multipliers while maintaining a low cost

of area. Consequently, a high-frequency sequential decimal multiplier is proposed whose

cycle time is reduced to the latency of a binary half-adder plus that of a decimal multiply-by-

two operation, which overall is less than that of a decimal carry-save adder. The synthesis

results reveal that the proposed sequential multiplier works with a higher clock frequency

than the fastest previous decimal multiplier which in turn leads to overall latency advantage.

6.

On-Chip Codeword Generation to Cope With Crosstalk

Capacitive and inductive coupling between bus lines results in crosstalk induced delays.

Many bus encoding techniques have been proposed to improve the performance. Existing

implementation techniques and mapping algorithms in the literature only apply the specific

encoding. This paper presents the first generalized framework for a stall-free on-chip

codeword generation strategy that is scalable and easy to automate. It is applicable to the

coupling aware encoding techniques that allow recursive codeword generation. The proposed

implementation strategy iteratively generates codewords without explicitly enumeratingthem. Codeword mapping relies on graph-based representation that is unique to the given

encoding technique. The codewords are calculated on-chip using basic function blocks, such

as adders and multiplexers. Three encoding techniques were implemented using the proposed

strategy. Experimental results show significant reduction in the area overhead and power

dissipation over the existing method that uses random logic to implement the codec.

7. Effects of Random Delay Errors in Continuous-Time Semi-Digital Transversal

Filters

The implementation of transversal filters requires basic circuit elements such as adders,

multipliers and (unit) delay elements. The filters designed under infinite precision of these

elements may behave differently when implemented with components with limited accuracy.

In fact, the effects of the coefficient inaccuracies in analog and digital transversal filters have

been investigated extensively in the literature [1], [2]. On the other hand, the effects of the

unit delays with limited precision have not received similar attention. In this paper, we find

that such effects especially in very high frequency continuous-time semi-digital transversalfilters may not be ignored. As an example, we analyze the impact of delay errors in the



implementation of the direct modulation transmitter. Specifically, we provide the analytical

statistical performance bounds and confirm the results with simulations.

8. Digitally Synthesized Stochastic Flash ADC Using Only Standard Digital Cells

It is demonstrated in this paper that it is possible to synthesize a stochastic flash ADC entirely

from Verilog code and a standard digital library. An analog comparator is introduced that is

constructed from two cross-coupled 3-input digital NAND gates, and can be described in

Verilog. The synthesized comparators have random, Gaussian offsets that are used as virtual

voltage references to make a flash ADC. A piecewise-linear inverse Gaussian CDF function

is used to correct the nonlinearity introduced by the Gaussian offset distribution. The

prototype IC is fabricated in 90 nm CMOS and implements a 2047-comparator version of the

proposed architecture. All components including the comparators, the ones adder, and the

peicewise inverse Gaussian function are all implemented in Verilog. Conventional digital

synthesis and place-and-route is then used to generate the physical layout, making this the

first fully synthesized ADC. SNDR of 35.9 dB (without calibration) is achieved at 210 MSPS

from the Verilog synthesized design.

9. Memory Footprint Reduction for Power-Efficient Realization of 2-D Finite

Impulse Response Filters

We have analyzed memory footprint and combinational complexity to arrive at a systematic

design strategy to derive area-delay-power-efficient architectures for two-dimensional (2-D)

finite impulse response (FIR) filter. We have presented novel block-based structures for

separable and non-separable filters with less memory footprint by memory sharing and

memory-reuse along with appropriate scheduling of computations and design of storage

architecture. The proposed structures involve L times less storage per output (SPO), and

nearly L times less energy consumption per output (EPO) compared with the existing

structures, where L is the input block-size. They involve L times more arithmetic resources

than the best of the corresponding existing structures, and produce L times more throughput

with less memory band-width (MBW) than others. We have also proposed separate generic

structures for separable and non-separable filter-banks, and a unified structure of filter-bank

constituting symmetric and general filters. The proposed unified structure for 6 parallel filters

involves nearly 3.6L times more multipliers, 3L times more adders, (N2-N+2) less registers

than similar existing unified structure, and computes 6L times more filter outputs per cycle

with 6L times less MBW than the existing design, where N is FIR filter size in eachdimension. ASIC synthesis result shows that for filter size (4 × 4), input-block size L=4, and



image-size (512 × 512), proposed block-based non-separable and generic non-separable

structures, respectively, involve 5.95 times and 11.25 times less area-delay-product (ADP),

and 5.81 times and 15.63 times less EPO than the corresponding existing structures. The

proposed unified structure involves 4.64 times less ADP and 9.78 times less EPO than the

corresponding existing structure.

10. Improved matrix multiplier design for high-speed 5

A transistor level implementation of an improved matrix multiplier for high-speed digital

signal processing applications based on matrix element transformation and multiplication is

reported in this study. The improvement in speed was achieved by rearranging the matrix

element into a two-dimensional array of processing elements interconnected as a mesh. The

edges of each row and column were interconnected in torus structure, facilitating

simultaneous implementation of several multiplications. The functionality of the circuitry

was verified and the performance parameters for example, propagation delay and dynamic

switching power consumptions were calculated using spice spectre using 90 nm CMOS

technology. The proposed methodology ensures substantial reduction in propagation delay

compared with the conventional algorithm, systolic array and pseudo number theoretic

transformation (PNTT)-based implementation, which are the most commonly used

techniques, for matrix multiplication. The propagation delay of the implemented 4 × 4matrix multiplier was only ~2 μs, whereas the power consumption of the implemented 4 × 4

matrix multiplier was ~3.12 mW only. Improvement in speed compared with earlier reported

matrix multipliers, for example, conventional algorithm, systolic array and PNTT-based

implementation was found to be ~67, ~56 and ~65%, respectively.

11. High Step-Up High-Efficiency Interleaved Converter With Voltage Multiplier

Module for Renewable Energy System

A novel high step-up converter, which is suitable for renewable energy system, is proposed in

this paper. Through a voltage multiplier module composed of switched capacitors and

coupled inductors, a conventional interleaved boost converter obtains high step-up gain

without operating at extreme duty ratio. The configuration of the proposed converter not only

reduces the current stress but also constrains the input current ripple, which decreases the

conduction losses and lengthens the lifetime of the input source. In addition, due to the

lossless passive clamp performance, leakage energy is recycled to the output terminal. Hence,

large voltage spikes across the main switches are alleviated, and the efficiency is improved.Even the low voltage stress makes the low-voltage-rated MOSFETs be adopted for reductions



of conduction losses and cost. Finally, the prototype circuit with 40-V input voltage, 380-V

output, and 1000-W output power is operated to verify its performance. The highest

efficiency is 97.1%.

12. Ultra-High Throughput Low-Power Packet Classification

Packet classification is used by networking equipment to sort packets into flows by

comparing their headers to a list of rules, with packets placed in the flow determined by the

matched rule. A flow is used to decide a packet's priority and the manner in which it is

processed. Packet classification is a difficult task due to the fact that all packets must be

processed at wire speed and rulesets can contain tens of thousands of rules. The contribution

of this paper is a hardware accelerator that can classify up to 433 million packets per second

when using rulesets containing tens of thousands of rules with a peak powerconsumption of

only 9.03 W when using a Stratix III field-programmable gate array (FPGA). The hardware

accelerator uses a modified version of the HyperCuts packet classification algorithm, with a

new pre-cutting process used to reduce the amount of memory needed to save the search

structure for large rulesets so that it is small enough to fit in the on-chip memory of an FPGA.

The modified algorithm also removes the need for floating point division to be performed

when classifying a packet, allowing higher clock speeds and thus obtaining higher

throughputs.

13. Low-Cost Low-Power ASIC Solution for Both DAB+ and DAB Audio Decoding

DAB+ is the upgraded version of digital audio broadcasting (DAB). DAB and DAB+ coexist

in many countries, so receivers are required to be compatible with both standards. In this

paper, a solution integrating an MPEG1-LayerII (MP2) decoder and an advanced audio

coding (AAC) low-complexity (AAC LC) decoder is proposed to provide basic audio

decoding for both DAB and DAB+. It also utilizes simple methods to improve high

frequencies and stereo quality instead of complicated spectrum band replication and

parametric stereo. A highly integrated low-power audio decoder design compatible with

DAB/DAB+ and using a purely ASIC approach is presented. As a result of the system

structure optimization and hardware sharing, the audio decoder is fabricated in 1P4M 0.18-

μm CMOS technology using only 3.2 mm2

silicon area (including 147 456 bits RAM and 170496 bits ROM). The powerconsumption of the audio decoder is 10.4 mW for DAB audio



decoding and 8.5 mW for DAB+ audio decoding. Laboratory and field tests show that the

function is correct and the audio quality is good for receiving both DAB and DAB+. The

audio decoder is thus proven to be a low-cost low-power solution for the two existing DAB

standards.

14. Low-Power Digital Signal Processor Architecture for Wireless Sensor Nodes

Radio communication exhibits the highest energy consumption in wireless sensor nodes.

Given their limited energy supply from batteries or scavenging, these nodes must trade data

communication for on-the-node computation. Currently, they are designed around off-the-

shelf low-power microcontrollers. But by employing a more appropriate processing element,

the energy consumption can be significantly reduced. This paper describes the design and

implementation of the newly proposed folded-tree architecture for on-the-node data

processing in wireless sensor networks, using parallel prefix operations and data locality in

hardware. Measurements of the silicon implementation show an improvement of 10-20× in

terms of energy as compared to traditional modern micro-controllers found in sensor nodes.

15. Area – Delay – Power Efficient Carry-Select Adder

In this brief, the logic operations involved in conventional carry select adder (CSLA) and

binary to excess-1 converter (BEC)-based CSLA are analyzed to study the data dependence

and to identify redundant logic operations. We have eliminated all the redundant logic

operations present in the conventional CSLA and proposed a new logic formulation for

CSLA. In the proposed scheme, the carry select (CS) operation is scheduled before the

calculation of final-sum, which is different from the conventional approach. Bit patterns of

two anticipating carry words (corresponding to $c_{rm in} = 0 hbox{and} 1$) and

fixed $c_{rm in}$ bits are used for logic optimization of CS and generation units. An

efficient CSLA design is obtained using optimized logic units. The proposed CSLA design

involves significantly less area and delay than the recently proposed BEC-based CSLA. Due

to the small carry-output delay, the proposed CSLA design is a good candidate for square-

root (SQRT) CSLA. A theoretical estimate shows that the proposed SQRT-CSLA involves

nearly 35% less area – delay – product (ADP) than the BEC-based SQRT-CSLA, which is best



among the existing SQRT-CSLA designs, on average, for different bit-widths. The

application-specified integrated circuit (ASIC) synthesis result shows that the BEC-based

SQRT-CSLA design involves 48% more ADP and consumes 50% more energy than the

proposed SQRT-CSLA, on average, for different bit-widths.

16. An Optimized Modified Booth Recoder for Efficient Design of the Add-Multiply

Operator

Complex arithmetic operations are widely used in Digital Signal Processing (DSP)

applications. In this work, we focus on optimizing the design of the fused Add-Multiply

(FAM) operator for increasing performance. We investigate techniques to implement the

direct recoding of the sum of two numbers in its Modified Booth (MB) form. We introduce a

structured and efficient recoding technique and explore three different schemes by

incorporating them in FAM designs. Comparing them with the FAM designs which use

existing recoding schemes, the proposed technique yields considerable reductions in terms of

critical delay, hardware complexity and power consumption of the FAM unit.

17. Improved design of high-frequency sequential decimal multipliers

Hardware implementation of decimal arithmetic operations has become a hot topic for

research during the last decade. Among various operations, decimal multiplication is

considered as one of the most complicated dyadic operations, which requires high-cost

hardware implementation. Therefore, the processor industry has opted to use the sequential

decimal multipliers to reduce the high cost of parallel architectures. However, the main

drawback of iterative multipliers is their high latency. In this reported work, the focus has

been on reducing the latency of decimal sequential multipliers while maintaining a low cost

of area. Consequently, a high-frequency sequential decimal multiplier is proposed whose

cycle time is reduced to the latency of a binary half-adder plus that of a decimal multiply-by-

two operation, which overall is less than that of a decimal carry-save adder. The synthesis

results reveal that the proposed sequential multiplier works with a higher clock frequency

than the fastest previous decimal multiplier which in turn leads to overall latency advantage.

18. Bit-Level Optimization of Adder-Trees for Multiple Constant Multiplications for

Efficient FIR Filter Implementation

Multiple constant multiplication (MCM) scheme is widely used for implementing transposed

direct-form FIR filters. While the research focus of MCM has been on more effective



common subexpression elimination, the optimization of adder-trees, which sum up the

computed sub-expressions for each coefficient, is largely omitted. In this paper, we have

identified the resource minimization problem in the scheduling of adder-tree operations for

the MCM block, and presented a mixed integer programming (MIP) based algorithm for

more efficient MCM-based implementation of FIR filters. Experimental result shows that up

to 15% reduction of area and 11.6% reduction of power (with an average of 8.46% and

5.96% respectively) can be achieved on the top of already optimized adder/subtractor

network of the MCM block.

19. Improved matrix multiplier design for high-speed digital signal processing

applications

A transistor level implementation of an improved matrix multiplier for high-speed digital

signal processing applications based on matrix element transformation and multiplication is

reported in this study. The improvement in speed was achieved by rearranging the matrix

element into a two-dimensional array of processing elements interconnected as a mesh. The

edges of each row and column were interconnected in torus structure, facilitating

simultaneous implementation of several multiplications. The functionality of the circuitry

was verified and the performance parameters for example, propagation delay and dynamic

switching power consumptions were calculated using spice spectre using 90 nm CMOS

technology. The proposed methodology ensures substantial reduction in propagation delay

compared with the conventional algorithm, systolic array and pseudo number theoretic

transformation (PNTT)-based implementation, which are the most commonly used

techniques, for matrix multiplication. The propagation delay of the implemented 4 × 4

matrix multiplier was only ~2 μs, whereas the power consumption of the implemented 4 × 4

matrix multiplier was ~3.12 mW only. Improvement in speed compared with earlier reported

matrix multipliers, for example, conventional algorithm, systolic array and PNTT-based

implementation was found to be ~67, ~56 and ~65%, respectively.

20. A Novel Distortion Model and Lagrangian Multiplier for Depth Maps Coding

In three-dimensional videos (3-DV) coding systems, depth maps are not used for viewing but

for rendering virtual views. Therefore, the traditional rate distortion criterion (including

distortion criterion, and Lagrangian multiplier) is not suitable for depth map coding. In order

to design an effective rate distortion criterion for depth maps, the relationship between the

distortion of synthesized virtual view and the coding error of depth maps is analyzed in detail.



Through the analysis, a polynomial model revealing the relationship between the coding error

of depth maps and the distortion of synthesized virtual view is derived. Model parameters are

estimated by utilizing camera parameters and features of the texture video corresponding to

the depth map. Based on the model, a virtual view-based Lagrangian multiplierfor depth map

coding is also proposed. Experimental results demonstrated the accuracy of the model. The

squared correlation coefficients between the actual distortion of virtual view and the

estimated distortion are all larger than 0.98 for all tested sequences. When incorporating the

proposed model and Lagrangian multiplier into the mode decision procedure of joint model

version 18.5 (JM18.5) of H.264/AVC, a maximum 0.470 dB BD PSNR and an average 0.251

dB BD PSNR can be achieved.

21. Dual-Basis Superserial Multipliers for Secure Applications and Lightweight

Cryptographic Architectures

Cryptographic algorithms utilize finite-field arithmetic operations in their computations. Due

to the constraints of the nodes which benefit from the security and privacy advantages of

these algorithms in sensitive applications, these algorithms need to be lightweight. One of the

well-known bases used in sensitive computations is dual basis (DB). In this brief, we present

low-complexity superserial architectures for the DB multiplication over GF(2m). To the best

of our knowledge, this is the first time that such a multiplier is proposed in the open

literature. We have performed complexity analysis for the proposed lightweight architectures,

and the results show that the hardware complexity of the proposed superserial multiplier is

reduced compared with that of regular serial multipliers. This has been also confirmed

through our application-specific integrated circuit hardware- and time-equivalent estimations.

The proposed superserial architecture is a step forward toward efficient and lightweight

cryptographic algorithms and is suitable for constrained implementations of cryptographic

primitives in applications such as smart cards, handheld devices, life-critical wearable and

implantable medical devices, and constrained nodes in the blooming notion of Internet of

nano-Things.

22. Multifunction Residue Architectures for Cryptography

A design methodology for incorporating Residue Number System (RNS) and Polynomial

Residue Number System (PRNS) in Montgomery modular multiplication in GF(p) or GF(2 n)

respectively, as well as a VLSI architecture of a dual-field residue arithmetic

Montgomery multiplier are presented in this paper. An analysis of input/output conversions



to/from residue representation, along with the proposed residue Montgomery multiplication

algorithm, reveals common multiply-accumulate data paths both between the converters and

between the two residue representations. A versatile architecture is derived that supports all

operations of Montgomery multiplication in GF(p) and GF(2n), input/output conversions,

Mixed Radix Conversion (MRC) for integers and polynomials, dual-field modular

exponentiation and inversion in the same hardware. Detailed comparisons with state-of-the-

art implementations prove the potential of residue arithmetic exploitation in dual-field

modular multiplication.

Documents

Low Power Vlsi Papers