Elysium Technologies Private Limitedelysiumtechnologies.com/wp-content/uploads/2014/07/... · design successfully solves the long discharging path problem in conventional explicit

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad

Pondicherry | Salem | Erode | Tirunelveli

http://www.elysiumtechnologies.com, [email protected]

http://www.elysiumtechnologies.com/


mailto:[email protected]





ETPL NT-001 Answering “What-If” Deployment and Configuration Questions With WISE: Techniques and Deployment Experience

ETPL NT-002 Complexity Analysis and Algorithm Design for Advance Bandwidth Scheduling in Dedicated Networks

ETPL NT-003 Diffusion Dynamics of Network Technologies With Bounded Rational Users: Aspiration-Based Learning

ETPL NT-004 Delay-Based Network Utility Maximization

ETPL NT-005 A Distributed Control Law for Load Balancing in Content Delivery Networks

ETPL NT-006 Efficient Algorithms for Neighbor Discovery in Wireless Networks

ETPL NT-007 Stochastic Game for Wireless Network Virtualization

ETPL NT-008 ABC: Adaptive Binary Cuttings for Multidimensional Packet Classification,

ETPL NT-009 A Utility Maximization Framework for Fair and Efficient Multicasting in Multicarrier Wireless Cellular Networks

ETPL NT-010 Achieving Efficient Flooding by Utilizing Link Correlation in Wireless Sensor Networks,

ETPL NT-011 Random Walks and Green's Function on Digraphs: A Framework for Estimating Wireless Transmission Costs

ETPL NT-012 "A Flexible Platform for Hardware-Aware Network Experiments and a Case Study on Wireless Network Coding

ETPL NT-013 Exploring the Design Space of Multichannel Peer-to-Peer Live Video Streaming Systems

ETPL NT-014 Secondary Spectrum Trading—Auction-Based Framework for Spectrum Allocation and Profit Sharing

ETPL NT-015 Towards Practical Communication in Byzantine-Resistant DHTs

ETPL NT-016 Semi-Random Backoff: Towards Resource Reservation for Channel Access in Wireless LANs

ETPL NT-017 Entry and Spectrum Sharing Scheme Selection in Femtocell Communications Markets

ETPL NT-018 On Replication Algorithm in P2P VoD,

ETPL NT-019 Back-Pressure-Based Packet-by-Packet Adaptive Routing in Communication Networks

ETPL NT-020 Scheduling in a Random Environment: Stability and Asymptotic Optimality

ETPL NT-021 An Empirical Interference Modeling for Link Reliability Assessment in Wireless Networks

ETPL NT-022 On Downlink Capacity of Cellular Data Networks With WLAN/WPAN Relays

ETPL NT-023 Centralized and Distributed Protocols for Tracker-Based Dynamic Swarm Management

ETPL NT-024 Localization of Wireless Sensor Networks in the Wild: Pursuit of Ranging Quality

ETPL NT-025 Control of Wireless Networks With Secrecy

ETPL NT-026 ICTCP: Incast Congestion Control for TCP in Data-Center Networks

ETPL NT-027 Context-Aware Nanoscale Modeling of Multicast Multihop Cellular Networks

ETPL NT-028 Moment-Based Spectral Analysis of Large-Scale Networks Using Local Structural Information

ETPL NT-029 Internet-Scale IPv4 Alias Resolution With MIDAR

ETPL NT-030 Time-Bounded Essential Localization for Wireless Sensor Networks

ETPL NT-031 Stability of FIPP -Cycles Under Dynamic Traffic in WDM Networks

ETPL NT-032 Cooperative Carrier Signaling: Harmonizing Coexisting WPAN and WLAN Devices

ETPL NT-033 Mobility Increases the Connectivity of Wireless Networks

ETPL NT-034 Topology Control for Effective Interference Cancellation in Multiuser MIMO Networks

ETPL NT-035 Distortion-Aware Scalable Video Streaming to Multinetwork Clients

ETPL NT-036 Combined Optimal Control of Activation and Transmission in Delay-Tolerant Networks

ETPL NT-037 A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless















A fast, area efficient very large scale integration (VLSI) architecture is proposed

of interpolation modules. The algorithm has a regular structure which makes it

suitable for VLSI implementation. The circuitry is simplified as the decoding algorithm directly gives the

message word at the end of the decoding algorithm without separate

-

Further speed improvements can be achieved by combining the main idea of Guruswami list decoding with

the Lee-O'Sullivan algorithm. In

terms of hardware, the addition of this concept, will further reduce the running time of the algorithm and

make the circuitry abo -

-O'Sullivan algorithms on Xilinx Virtex-5 shows that the proposed

decoder can be operated at higher clock frequency with almost same area complexity.

ETPL

VLSI - 001

Efficient VLSI Architecture For Interpolation Decoding Of Hermitian

Codes

This paper presents a practical method for designing fixed-point FIR filters. The proposed method takes both

the filter's magnitude response and its hardware cost into consideration in the design process. The method

constructs a basis set based on the fixed-point coefficients that have been synthesized already. The elements in

the basis set are used to synthesize the undetermined fixed-point coefficients later. Thus, this basis set expands

gradually along with the progress of the coefficient design. The method employs some strategies to speed up

the design process. For example, a complexity estimation strategy helps us stop digging deeper in some

branches of the search tree, and a solution prediction strategy for high-order FIR filters helps us design fixed-

point FIR filters of length equal to a few hundreds. Applying the proposed method to design twenty

benchmark cases, we can obtain hardware-efficient results in a reasonable design time. In two long filter

design cases, our design results are better than those designed by the other methods.

ETPL

VLSI - 002

Designing Hardware-Efficient Fixed-Point FIR Filters In An Expanding

Subexpression Space








Emerging non-volatile memories (nvm) based on resistive switching mechanism (rs) such as stt-mram, oxrram

and cbram etc., are under intense r&d investigation by both academics and industries. They provide high

write/read speed, low power and good endurance (e.g., > 1012) beyond mainstream nvms, which allow them

to be embedded directly with logic units for computing purpose. This integration could increase significantly

the power/die area efficiency, and then overcome definitively the power/speed bottlenecks of modern vlsis.

This paper presents firstly a theoretical investigation of synchronous nv logic gates based on rs memories (rs-

nvl). Special design techniques and strategies are proposed to optimize the structure according to different

resistive characteristics of nvms. To validate this study, we simulated a non-volatile full-adder (nvfa) with two

types of nvms: stt-mram and oxrram by using cmos 40 nm design kit and compact models, which includes

related physics and experimental parameters. They show interesting power, speed and area gain compared

with synchronized cmos fa while keeping good reliability.

ETPL

VLSI - 003

Synchronous Non-Volatile Logic Gate Design Based on Resistive Switching

Memories

a

Complex arithmetic operations are widely used in Digital Signal Processing (DSP) applications. In this work,

we focus on optimizing the design of the fused Add-Multiply (FAM) operator for increasing performance. We

investigate techniques to implement the direct recoding of the sum of two numbers in its Modified Booth

(MB) form. We introduce a structured and efficient recoding technique and explore

three different schemes by incorporating them in FAM designs. Comparing them with the FAM designs which

use existing recoding schemes, the proposed technique yields considerable reductions in terms of critical

delay, hardware complexity and power consumption of the FAM unit.

ETPL

VLSI - 004

An Optimized Modified Booth Recoder for Efficient Design of the Add-

Multiply Operator








In this brief, a low-power flip-flop (FF) design featuring an explicit type pulse-triggered structure and a

modified true single phase clock latch based on a signal feed-through scheme is presented. The proposed

design successfully solves the long discharging path problem in conventional explicit type pulse-triggered FF

(P-FF) designs and achieves better speed and power performance. Based on post-layout simulation results

using TSMC CMOS 90-nm technology, the proposed design outperforms the conventional P-FF design data-

close-to-output (ep-DCO) by 8.2% in data-to-Q delay. In the mean time, the performance edges on power and

power- delay-product metrics are 22.7% and 29.7%, respectively.

ETPL

VLSI - 005

Low-Power Pulse-Triggered Flip-Flop Design Based on a Signal Feed-Through

In this paper, we present an efficient architecture for the implementation of a delayed least mean square

adaptive filter. For achieving lower adaptation-delay and area-delay-power efficient implementation, we use a

novel partial product generator and propose a strategy for optimized balanced pipelining across the time-

consuming combinational blocks of the structure. From synthesis results, we find that the proposed design

offers nearly 17% less area-delay product (ADP) and nearly 14% less energy-delay product (EDP) than the

best of the existing systolic structures, on average, for filter lengths N=8, 16, and 32. We propose an efficient

fixed-point implementation scheme of the proposed architecture, and derive the expression for steady-state

error. We show that the steady-state mean squared error obtained from the analytical result matches with the

simulation result. Moreover, we have proposed a bit-level pruning of the proposed architecture, which

provides nearly 20% saving in ADP and 9% saving in EDP over the

proposed structure before pruning without noticeable degradation of steady-state-error performance.

ETPL

VLSI - 006

Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter With Low

Adaptation-Delay








The reliability of data stored in high-density Flash memory devices tends to decrease rapidly because of the

reduced cell size and multilevel cell technology. Soft-decision error correction algorithms that use multiple-

precision sensing for reading memory can solve this problem; however, they require very complex hardware

for high-throughput decoding. In this paper, we present a rate-0.96 (68254, 65536) shortened Euclidean

geometry low-density parity-check code and its VLSI implementation for high-throughput NAND Flash

memory systems. The design employs the normalized a posteriori probability (APP)-based algorithm, serial

schedule, and conditional update, which lead to simple functional units, halved decoding iterations, and low-

power consumption, respectively. A pipelined-parallel architecture is adopted for high-throughput decoding,

and memory-reduction techniques are employed to minimize the chip size. The proposed decoder is

implemented in 0.13-μ M z tion of the decoder are

compared with those of a BCH (Bose-Chaudhuri-Hocquenghem) decoding circuit showing comparable error-

correcting performance and throughput.

ETPL

VLSI - 007

Rate-0.96 LDPC Decoding VLSI for Soft-Decision Error Correction of NAND

Flash Memory

The excellent finite wordlength (FWL) property of lattice digital filters is well known. The four-multiplier

normalized lattice, with signal power at all delay elements normalized to unity, has particular advantage in its

overflow property. However, when used to implement an Nth-order digital filter, the normalized lattice

implementation requires 5N+1 multipliers. There exists another lattice structure with excellent FWL property

called the injected numerator lattice structure. In this paper, we combine the injected numerator lattice and

tapped numerator lattice to form a new hybrid lattice structure, which is not only canonic in the number of

multipliers resulting in a significant reduction in overall implementation cost but also exhibits much better

FWL properties than the normalized la A “ ”

for application where the input signal has a strong time varying sinusoidal component. The new structure

requires a few additional adders; it can be used to implement any causal and stable z-transform transfer

function. Two numerical examples are presented to demonstrate the performance of the proposed structure.

ETPL

VLSI - 008

A Generalized Lattice Filter for Finite Wordlength Implementation With

Reduced Number of Multipliers

















provides nearly 20% saving in ADP and 9% saving in EDP over the proposed structure before pruning without

noticeable degradation of steady-state-error performance.

ETPL

VLSI - 009


Adaptation-Delay

Low-density parity-check (LDPC) codes are adopted in many applications due to their Shannon-limit

approaching error-correcting performance. Nevertheless, belief-propagation (BP) based decoding of these

codes suffers from the error-floor problem, i.e., an abrupt change in the slope of the error-rate curve that

occurs at very low error rates. Recently, a new type of decoders termed finite alphabet iterative decoders

(FAIDs) were introduced. The FAIDs use simple Boolean maps for variable node processing, and can surpass

the BP-based decoders in the error floor region with very short word length. We restrict the scope of this paper

to regular dv=3 LDPC codes on the BSC channel. This paper develops a low-complexity implementation

architecture for the FAIDs by making use of their properties. Particularly, an innovative bit-serial check node

unit is designed for the FAIDs, and a small-area variable node unit is proposed by exploiting the symmetry in

the Boolean maps. Moreover, an optimized data scheduling scheme is proposed to increase the hardware

utilization efficiency. From synthesis results, the proposed FAID implementation needs only 52% area to

reach the same throughput as one of the most efficient standard Min-Sum decoders for an example (7807,

7177) LDPC code, while achieving better error-correcting performance in the error-floor region. Compared to

an offset Min-Sum decoder with longer word length, the proposed design can achieve higher throughput with

45% area, and still leads to possible performance improvement in the error-floor region.

ETPL

VLSI - 010

Finite Alphabet Iterative Decoders for LDPC Codes: Optimization,

Architecture and Analysis








A multiplier-less architecture based on algebraic integer representation for computing the Daubechies

6-tap wavelet transform for 1-D/2-D signal processing is proposed. This architecture improves on

previous designs in a sense that it minimizes the number of parallel 2-input adder circuits. The

algorithm was achieved using brute-force numerical optimization of the algebraic integer

representation. The proposed architecture furnishes exact computation up to the final reconstruction

step, which is the operation that maps the exactly computed filtered results from algebraic integer

representation to fixed-point. Compared to our recent work, this architecture shows a reduction of

$27cdot n-16$ adder circuits, where $n$ is the number of wavelet decomposition levels. The design

is physically implemented for a 4-level 1-D/2-D decomposition using a Xilinx Virtex-6 vcx240t-

1ff1156 field programmable gate array (FPGA) device operating at up to a maximum clock

frequency of 344/ 168 MHz. The FPGA implementation of 1-D/2-D are tested using hardware co-

simulation using an ML605 board with clock of 100 MHz. A 45 nm CMOS synthesis of 2-D designs

show improved clock frequency of better than 306 MHz for a supply voltage of 1.1 V.

ETPL

VLSI - 011

Precise VLSI Architecture for AI Based 1-D/ 2-D Daub-6 Wavelet Filter Banks

With Low Adder-Count

Quantum-dot cellular automata (QCA) are an attractive emerging technology suitable for the development of

ultra-dense low-power high-performance digital circuits. Efficient solutions have recently been proposed for

several arithmetic circuits, such as adders, multipliers, and comparators. Nevertheless, since the design of

digital circuits in QCA still poses several challenges, novel implementation strategies and methodologies are

highly desirable. This paper proposes a new design approach oriented to the implementation of binary

comparators in QCA. New formulations of basic logic equations required to perform the comparison function

are proposed. The new strategy has been exploited in the design of two different comparator architectures and

for several operands word lengths. With respect to existing counterparts, the comparators proposed here

exhibit significantly higher speed and reduced overall area.

ETPL

VLSI - 012

Design of Efficient Binary Comparators in Quantum-Dot Cellular Automata








A transistor level implementation of an improved matrix multiplier for high-speed digital signal processing

applications based on matrix element transformation and multiplication is reported in this study. The

improvement in speed was achieved by rearranging the matrix element into a two-dimensional array of

processing elements interconnected as a mesh. The edges of each row and column were interconnected in

torus structure, facilitating simultaneous implementation of several multiplications. The functionality of the

circuitry was verified and the performance parameters for example, propagation delay and dynamic switching

power consumptions were calculated using spice spectre using 90 nm CMOS technology. The proposed

methodology ensures substantial reduction in propagation delay compared with the conventional algorithm,

systolic array and pseudo number theoretic transformation (PNTT)-based implementation, which are the most

commonly used techniques, for matrix multiplication. The propagation delay of the implemented 4 × 4 matrix

~2 μ ×

~3.12 mW only. Improvement in speed compared with earlier reported matrix multipliers, for example,

conventional algorithm, systolic array and PNTT-based implementation was found to be ~67, ~56 and ~65%,

respectively.

ETPL

VLSI - 013

Improved matrix multiplier design for high-speed digital signal processing

applications

We experimentally demonstrated high-speed logic operations of adiabatic quantum-flux-parametron (AQFP)

gates through the use of quantum-flux-latches (QFLs). In QFL-based high-speed test circuits (QHTCs), the

output data of the circuits under test (CUTs), which are driven by high-speed excitation currents, are stored in

QFLs and are slowly read out using low-speed excitation currents. We designed and fabricated three types of

QHTCs using QFLs with different circuit parameters, where the CUTs were buffer gates and and gates. We

confirmed the correct operation of buffer gates and and gates at 1 GHz. The obtained bias margins of the 1

GHz excitation currents were more than ±30% for each QHTC, which is wide enough for high-speed logic

operations of AQFP gates

ETPL

VLSI - 014

High-Speed Experimental Demonstration of Adiabatic Quantum-Flux-

Parametron Gates Using Quantum-Flux-Latches








A novel nonvolatile flip-flop based on spin-orbit torque magnetic tunnel junctions (SOT-MTJs) is proposed

for fast and ultralow energy applications. A case study of this nonvolatile flip-flop is considered. In addition to

the independence between writing and reading paths, which offers a high reliability, the low resistive writing

path performs high-speed, and energy-efficient WRITE operation. We compare the SOT-MTJ performances

metrics with the spin transfer torque (STT)-MTJ. Based on accurate compact models, simulation results show

an improvement, which attains 20× in terms of WRITE energy per bit cell. At the same writing current and

supply voltage, the SOT-MTJ achieves a writing frequency 4× higher than the STT-MTJ.

ETPL

VLSI - 015

Spin Orbit Torque Non-Volatile Flip-Flop for High Speed and Low Energy

Applications

This paper presents a precise analysis of the critical path of the least-mean-square (LMS) adaptive filter for

deriving its architectures for high-speed and low-complexity implementation. It is shown that the direct-form

LMS adaptive filter has nearly the same critical path as its transpose-form counterpart, but provides much

faster convergence and lower register complexity. From the critical-path evaluation, it is further shown that no

pipelining is required for implementing a direct-form LMS adaptive filter for most practical cases, and can be

realized with a very small adaptation delay in cases where a very high sampling rate is required. Based on

these findings, this paper proposes three structures of the LMS adaptive filter: (i) Design 1 having no

adaptation delays, (ii) Design 2 with only one adaptation delay, and (iii) Design 3 with two adaptation delays.

Design 1 involves the minimum area and the minimum energy per sample (EPS). The best of existing direct-

form structures requires 80.4% more area and 41.9% more EPS compared to Design 1. Designs 2 and 3

involve slightly more EPS than the Design 1 but offer nearly twice and thrice the MUF at a cost of 55.0% and

60.6% more area, respectively.

ETPL

VLSI - 016

Critical-Path Analysis and Low-Complexity Implementation of the LMS

Adaptive Algorithm








Algebraic side-channel attack (ASCA) is a typical technique that relies on a general solver to solve the

equations of a cipher and its side-channel leaks. It falls under analytical side-channel attack and can recover

the entire key at once. Many ASCAs are proposed against the AES, and they utilize the Gröbner basis-based,

SAT-based, or optimizer-based solver. The advantage of the general solver approach is its generic feature,

which can be easily applied to different cryptographic algorithms. The disadvantage is that it is difficult to take

into account the specialized properties of the targeted cryptographic algorithms. The results vary depending on

what type of solver is used, and the time complexity is quite high when considering the error-tolerant attack

scenarios. Thus, we were motivated to find a new approach that would lessen the influence of the general

solver and reduce the time complexity of ASCA. This paper proposes a new analytical side-channel attack on

AES by exploiting the incomplete diffusion feature in one AES round. We named our technique incomplete

diffusion analytical side-channel analysis (IDASCA). Different from previous ASCAs, IDASCA adopts a

specialized approach to recover the secret key of AES instead of the general solver. Extensive attacks are

performed against the software implementation of AES on an 8-bit microcontroller. Experimental results show

that: 1) IDASCA can exploit the side-channel leaks in all AES rounds using a single power trace; 2) it has less

time complexity and more robustness than previous ASCAs, especially when considering the error-tolerant

attack scenarios; and 3) it can calculate the reduced key search space of AES for the given amount of side-

channel leaks. IDASCA can also interpret the mechanism behind previous ASCAs on AES from a quantitative

perspective, such as why ASCA can work under unknown plaintext/ciphertext scenarios and what are the

extreme cases in ASCAs.

ETPL

VLSI - 017

Exploiting the Incomplete Diffusion Feature: A Specialized Analytical Side-Channel

Attack Against the AES and Its Application to Microcontroller Implementations








This paper presents design and simulation of a power efficient traffic light controller (PTLC). The main focus

is on simulation and optimization of PTLC design and computing its speed of operation. In the conventional

system, power consumption is high and expensive. The design of PTLC is better than conventional in terms of

LUT's (number of gates), complexity, size and cost. In this research paper a novel PTLC is presented with a

minimum number of LEDs which fairly improves its performance and makes the design efficient in terms of

power and speed with respect to conventional design. The conventional traffic light controller has been

implemented using microcontroller and FPGA's. The research paper by Parasmani in 2013 stated the use of

FPGA to design an advanced traffic light controller which uses the sensor to maintain the continuous traffic

flow hence the power consumption is too high which can be reduced by the design PTLC. The novel design of

PTLC is an economical and possess the characters of high integration, low power and flexibility. The PTLC

has been implemented using FPGA. FPGA has many advantages as the speed, number of input/output ports

and performance. This system has been successful tested and implemented in hardware using Xilinx v 10.1

software packages using Very High Speed Integrated circuit hardware description language (VHDL), RTL and

technology schematic are included to validate simulation results.

ETPL

VLSI - 019

Design and simulation of power efficient traffic light controller (PTLC)

Modern superscalar processors implement register renaming using either random access memory (RAM) or

content-addressable memories (CAM) tables. The design of these structures should address both access time

and misprediction recovery penalty. Although direct-mapped RAMs provide faster access times, CAMs are

more appropriate to avoid recovery penalties. The presence of associative ports in CAMs, however, prevents

them from scaling with the number of physical registers and pipeline width, negatively impacting

performance, area, and energy consumption at the rename stage. In this paper, we present a new hybrid RAM–

CAM register renaming scheme, which combines the best of both approaches. In a steady state, a RAM

provides fast and energy-efficient access to register mappings. On misspeculation, a low-complexity CAM

enables immediate recovery. Experimental results show that in a four-way state-of-the-art superscalar

processor, the new approach provides almost the same performance as an ideal CAM-based renaming scheme,

while dissipating only between 17% and 26% of the original energy and, in some cases, consuming less

energy than purely RAM-based renaming schemes. Overall, the silicon area required to implement the hybrid

RAM–CAM scheme does not exceed the area required by conventional renaming mechanisms.

ETPL

VLSI - 018

Efficient Register Renaming and Recovery for High-Performance Processors








Energy efficiency has become an important design goal for networking equipment. Traditionally routers and

switches have been designed to minimize peak power consumption but they operate most of the time with

settings and traffic that is far from that peak. Therefore, many elements and functions of networking

equipment are being redesigned to improve energy efficiency. A common functionality in networking is flow

identification that is needed in many applications. Flow identification can be implemented with Content

Addressable Memories (CAMs) or alternatively with several data structures. Among those, one efficient

option is Cuckoo hashing that enables fast searches and high memory utilization at the cost of complicating

the insertion procedure. In this letter, first the energy efficiency of exact matching using Cuckoo hashing is

analyzed and then a technique is presented to improve the energy efficiency of Cuckoo hashing. The proposed

scheme is evaluated using a traffic monitoring application and compared with the traditional Cuckoo hashing.

The results show that significant energy savings can be obtained by using the proposed technique.

ETPL

VLSI - 020

Energy Efficient Exact Matching for Flow Identification with Cuckoo Affinity

Hashing

Advanced computing systems suffer from high static power due to the rapidly rising leakage currents in deep

sub-micron MOS technologies. Fast access non-volatile memories (NVM) are under intense investigation to

be integrated in Flip-Flops or computing memories to allow system power-off in standby state and save power.

Spin Transfer Torque MRAM (STT-MRAM) is considered the most promising NVM to address this issue

thanks to its high speed, low power, and infinite endurance. However, one of the disadvantages of STT-

MRAM for the computing purpose is its relatively high write energy to build up Magnetic Flip-Flop (MFF). In

this paper, we propose a power-efficient MFF design architecture to address this challenge based on the

combination of checkpointing operation, power gating and self-enable mechanisms. Multi non-volatile

storages can be integrated locally in a conventional FF without significant area overhead benefiting from the

3-D implementation of STT-MRAM. We performed electrical simulations (i.e. transient and statistical) to

validate its functional behaviors and evaluate its performance by using an accurate spice model of STT-

MRAM and an industrial 40 nm CMOS design kit. The simulation results confirm its lower power

consumption compared to conventional CMOS FF and the other structures.

ETPL

VLSI - 021

Ultra Low Power Magnetic Flip-Flop Based on Checkpointing/Power Gating

and Self-Enable Mechanisms








A five-transistor dynamic ternary content addressable memory (CAM) is presented for high-density data

search applications. The data path and the search path are separated to avoid unwanted capacitive coupling at

the storage node. To increase the data retention time, the data lines are grounded and dummy search lines are

implemented for refresh operations. The proposed CAM cell is fabricated using a 130 nm CMOS process, and

8 99 μ 2 A f 64 × 128 search memory has a retention time of 2.84 ms at

room temperature with a 1.2 V supply voltage. The hardware search performance is compared with a

conventional software-based search scheme, running on two different systems with clock frequencies of more

than an order of magnitude faster. The hardware search engine exhibits comparable search speeds while

dissipating only 149 mW.

ETPL

VLSI - 023

Dynamic ternary cam for hardware search engine

Due to the ubiquitous open air links and complex electromagnetic environment in the satellite

communications, how to ensure the security and reliability of the information through the satellite

communications is an urgent problem. This paper combines the AES(Advanced Encryption Standard) with

LDPC(Low Density Parity Check Code) to design a secure and reliable error correction method ??

SEEC(Satellite Encryption and Error Correction). This method selects the LDPC codes, which is suitable for

satellite communications, and uses the AES round key to control the encoding process, at the same time,

proposes a new algorithm of round key generation. Based on a fairly good property in error correction in

satellite communications, the method improves the security of the system, achieves a shorter key size, and

then makes the key management easier. Eventually, the method shows a great error correction capability and

encryption effect by the MATLAB simulation

ETPL

VLSI - 022

A joint encryption and error correction method used in satellite

communications

a








This paper introduces a reordered overlapped search mechanism for high-throughput low-energy content-

addressable memories (CAMs). Most mismatches can be found by searching a few bits of a search word. To

lower power dissipation, a word circuit is often divided into two sections that are sequentially searched or even

pipelined. Because of this process, most of match lines in the second section are unused. Since searching the

last few bits is very fast compared to searching the rest of the bits, we propose to increase throughput by

asynchronously initiating second-stage searches on the unused match lines as soon as a first-stage search is

complete. In our circuit implementation, each word circuit is independently controlled by a locally generated

timing signal rather than a global signal. This allows the circuits to be in the required phase for their own local

operation: evaluate or precharge, instead of having to synchronize their phase to the rest of the word circuits,

which greatly reduces the cycle time. As a design example, a 128 × 64-bit CAM is implemented and evaluated

by HSPICE simulation under a 90 nm CMOS technology. The proposed asynchronous CAM operates 5.98

times faster than a synchronous CAM with 14.2% smaller energy dissipation. The post-layout proposed CAM

achieves 385-ps cycle delay time and 0.773 fJ/bit/search and is also evaluated under different corner

conditions and PVT variations to guarantee it operates properly.

ETPL

VLSI - 024

High-Throughput Low-Energy Self-Timed CAM Based on Reordered

Overlapped Search Mechanism

This paper presents a novel high-speed BCH decoder that corrects double-adjacent and single-bit errors in

parallel and serially corrects multiple-bit errors other than double-adjacent errors. Its operation is based on

extending an existing parallel BCH decoder that can only correct single-bit errors and serially corrects double-

adjacent errors at low speed. The proposed decoder is constructed by a novel design and is suitable for

nanoscale memory systems, in which multiple-bit errors occur at a probability comparable to single-bit errors

and double-adjacent errors occur at a higher probability (nearly two orders of magnitude) than other multiple-

bit errors. Extensive simulation results are reported. Compared with the existing scheme, the area and delay

time of the proposed decoder are on average 11% and 6% higher, but its power consumption is reduced by 9%

on average. This paper also shows that the area, delay, and power overheads incurred by the proposed scheme

are significantly lower than traditional fully parallelized BCH decoders capable of correcting any double-bit

errors in parallel.

ETPL

VLSI - 025

A Single-Bit and Double-Adjacent Error Correcting Parallel Decoder for

Multiple-Bit Error Correcting BCH Codes








This paper presents an error compensation bias circuit added to a modified encoded booth multiplier to

produce a high accuracy fixed-width multiplier. Fixed-width multiplier is employed in many digital signal

processing applications, as most of these systems employ iterative structures with fixed precision. The design

has been implemented in TSMC 180nm technology. The design is 14.6% faster than the fixed-width

multipliers. The design has 37.2% less truncation error as compared to direct truncated fixed width multiplier

(DTFM). The design is embedded with operand isolator technique to ensure low power operation when

employed in DSP applications.

ETPL

VLSI - 026

Design and implementation of high speed and high accuracy fixed-width

modified booth multiplier for DSP application

We have designed the full Adder using hybrid-CMOS logic style by dividing it in three modules so that it can

be optimized at various levels. First module is an XOR-XNOR circuit, which generates full swing XOR and

XNOR outputs simultaneously and have a good driving capability. It also consumes minimum power and

provides better delay performance. Second module is a sum circuit which is also a XOR circuit and uses carry

input and the output of the first module as input to generate sum output. Third module is a carry circuit which

uses the output of the first stage and other inputs to generate carry output. In the new full adder design we have

proposed new full adder circuit which reduce the power consumption, delay between carry out to carry in and

PDP by 12 to 100% P E M 0 18 μ M

ETPL

VLSI - 027

A new design of low power high speed hybrid CMOS full adder

Lower power digital circuits in cellular phones, laptop or tablet computers have critical power consumption

limitations. Power consumption at process corners can vary as much as 50%. In order to optimize high-speed

logic circuit designs for low power needs, we need to accurately predict device to product aging across

process, temperature and voltage corners. In this talk, we focus on the impact of BTI aging at corners, the

Fmax guardband and its trade-off with power and performance.

ETPL

VLSI - 028

Design for reliability for low power digital circuits








Multipliers are the key block in high speed arithmetic logic units, multiplier and accumulate units, digital

signal processing units etc. With the increasing constraints on delay, more and more emphasis is being laid on

design of faster multiplications. To enhance speed many modifications over the standard modified booth

algorithm, Wallace tree methods for multiplier design have been made and several new techniques are being

worked upon. Amongst these Vedic multipliers based on Vedic mathematics are presently under focus due to

these being one of the fastest and low power multiplier. There are sixteen sutras in Vedic multiplication in

“U ” he most efficient one in terms of speed. A large number

of high speed Vedic multipliers have been proposed with Urdhva Tiryakbhyam sutra. Few of them are

presented in this paper giving an insight into their methodology, merits and demerits. Compressor based Vedic

Multipliers show considerable improvements in speed and area efficiency over the conventional ones.

ETPL

VLSI - 029

High speed vedic multiplier designs-A review

This paper proposes the design of an energy efficient, high speed and low power full subtractor using Gate

Diffusion Input (GDI) technique. The entire design has been performed in 150nm technology and on

comparison with a full subtractor employing the conventional CMOS transistors, transmission gates and

Complementary Pass-Transistor Logic (CPL), respectively it has been found that there is a considerable

amount of reduction in Average Power consumption (Pavg), delay time as well as Power Delay Product

(PDP). Pavg is as low as 13.96nW while the delay time is found to be 18.02pico second thereby giving a PDP

as low as 2.51×10-19 Joule for 1 volt power supply. In addition to this there is a significant reduction in

transistor count compared to traditional full subtractor employing CMOS transistors, transmission gates and

CPL, accordingly implying minimization of area. The simulation of the proposed design has been carried out

in Tanner SPICE and the layout has been designed in Microwind.

ETPL

VLSI - 030

Design of an energy efficient, high speed, low power full subtractor using GDI

technique








We present designs of all-optical SR, clocked-SR, D and T flip-flops, simultaneous single-bit comparator-

decoder and reconfigurable logic unit based on all-optical switching by two-photon absorption induced free-

carrier injection in silicon 2 × 2 add-drop microring resonators. The proposed circuits have been theoretically

analyzed using time-domain coupled-mode theory and all-optical switching has been optimized for ultrafast

(~25 ps), low-power operation (~25 mW) and high modulation (> 85%), enabling logic operations at 40 Gb/s.

The designs are attractive due to advantages of high Q-factor, tunability, compactness, cascadibility,

scalability, reconfigurability, simplicity and minimal number of switches and inputs for realization of the

desired logic.

ETPL

VLSI - 031

Ultrafast All-Optical Flip-Flops, Simultaneous Comparator-Decoder and

Reconfigurable Logic Unit With Silicon Microring Resonator Switches

Reversible logic has presented itself as a prominent technology which plays an imperative role in Quantum

Computing. Quantum computing devices theoretically operate at ultra high speed and consume infinitesimally

less power. Research done in this paper aims to utilize the idea of reversible logic to break the conventional

speed-power trade-off, thereby getting a step closer to realise Quantum computing devices. To authenticate

this research, various combinational and sequential circuits are implemented such as a 4-bit Ripple-carry

Adder, (8-bit X 8-bit) Wallace Tree Multiplier, and the Control Unit of an 8-bit GCD processor using

Reversible gates. The power and speed parameters for the circuits have been indicated, and compared with

their conventional non-reversible counterparts. The comparative statistical study proves that circuits

employing Reversible logic thus are faster and power efficient. The designs presented in this paper were

simulated using Xilinx 9.2 software.

ETPL

VLSI - 032

Implementation of high speed low power combinational and sequential circuits

using reversible logic

This paper presents an all-digital delay-locked loop with the novel digital delay line for high-speed memory

interface applications. The proposed digital delay line has smaller tuning step and better tuning linearity than

the prior arts. The proposed ADDLL inside the DDR3 PHY for the purpose of the 90-degree phase shift and

read leveling is fabricated in a 40nm low-power CMOS process. The testchip is successfully verified at the

data rate of 800∼1600Mbps. The measured peak-to-peak and rms jitter of the write DQS are 60ps and 10ps at

the data rate of 1600Mbps, respectively.

ETPL

VLSI - 033

An all-digital delay-locked loop for high-speed memory interface applications








In this paper low power square and cube architectures are proposed using Vedic sutras. Low power and less

area square and cube architectures uses Dwandwa yoga Duplex combination properties of Urdhva

Tiryagbhyam sutra and Anurupyena sutra of Vedic mathematics. Simulation results for 8-bit square and 8-bit

cube shows that proposed architectures lowers the total power consumption by 45% and area by 63% when

compared to the conventional architecture. Also the reduction in power consumption increases with the

increase in bit width. Comparison is made between conventional and Vedic method implementations of square

and cube architecture. Implementation results show a significant improvement in terms of area, power and

delay. Proposed square and cube architectures can be used for high speed and low power applications.

Synthesis is done on Xilinx FPGA Device using, Xilinx Family: Spartan 3E, Speed Grade: -4. Propagation

delay of the proposed 8-bit square is 4ns and area consumed in terms of slices is 22 and for 8-bit cube

propogation delay is 7.72ns and area consumed in terms of slices is 58. Dynamic power estimation for square

and cube are 13mW and 16mW respectively.

ETPL

VLSI - 034

Low Power Square and Cube Architectures Using Vedic Sutras

The objective of this paper is fully focused on designing of a power efficient, high performance 4×4 1T

DRAM cell array using conventional MOS, fully depleted SOI/SON and CNFET devices. As the CMOS

technology is being scaled down, there has been a major need to improve the performance and robustness of

the memory extensively used in today's hand-held devices. Dynamic Random Access Memory (DRAM) is the

main memory used for all desktop and larger computers. In modern VLSI circuit designing, power dissipation

is also a crucial issue. The new emerging devices with improved technology promise of low power

applications. In this paper, we have presented a comparative circuit level analysis between Metal Oxide

Semiconductor (MOS), fully depleted Silicon on Insulator (FD-SOI), fully depleted Silicon on Nothing (FD-

SON) and Carbon Nanotube Field Effect transistor (CNFET) in 32nm technology node using HSpice tool.

ETPL

VLSI - 035

Performance analysis of a high speed, energy efficient 4×4 dynamic RAM cell

array using 32nm fully depleted SOI/SON and CNFET








High-performance 64-bit binary comparator is proposed in this brief. Comparison is most basic arithmetic

operation that determines if one number is greater than, equal to, or less than the other number. Comparator is

most fundamental component that performs comparison operation. This briefly presents comparison of

modified and existing 64-bit binary comparator designs concentrating on power consumption and delay.

Means some modifications have been done in existing 64-bit binary comparator design to improve the

performance of the circuit. Comparison between modified and existing 64-bit binary comparator designs is

calculated by simulation that is performed at 90nm technology in Tanner EDA Tool.

ETPL

VLSI - 036

High-performance 64-bit binary comparator

This paper proposes partial reconfigurable FIR filter design using systolic Distributed Arithmetic (DA)

architecture optimized for FPGAs. To implement computationally efficient, low power, high speed Finite

Impulse Response (FIR) filter a two dimensional fully pipelined structure is used. To reduce the partial

reconfiguration time a new architecture for the Look-Up Table (LUT) in distributed arithmetic is proposed.

The FIR filter is dynamically reconfigured to realize low pass and high pass filter characteristics by changing

the filter coefficients in the partial reconfiguration module. The design is implemented using XUP Virtex 5

LX110T FPGA kit. The FIR filter design shows improvement in configuration time and efficiency.

ETPL

VLSI - 037

FPGA based partial reconfigurable fir filter design








The need for ultra low-power, area efficient, and high speed analog-to-digital converters is pushing toward the

use of dynamic regenerative comparators to maximize speed and power efficiency. In this paper, an analysis

on the delay of the dynamic comparators will be presented and analytical expressions are derived. From the

analytical expressions, designers can obtain an intuition about the main contributors to the comparator delay

and fully explore the tradeoffs in dynamic comparator design. Based on the presented analysis, a new dynamic

comparator is proposed, where the circuit of a conventional double-tail comparator is modified for low-power

and fast operation even in small supply voltages. Without complicating the design and by adding few

transistors, the positive feedback during the regeneration is strengthened, which results in remarkably reduced

delay time. Post-layout simulation results in a 0.18- μ M

shown that in the proposed dynamic comparator both the power consumption and delay time are significantly

reduced. The maximum clock frequency of the proposed comparator can be increased to 2.5 and 1.1 GHz at

1 2 0 6 V 1 W 153 μW

of the input-referred offset is 7.8 mV at 1.2 V supply.

ETPL

VLSI - 038

Analysis and Design of a Low-Voltage Low-Power Double-Tail Comparator

Design fingerprinting is a means to trace the illegally redistributed intellectual property (IP) by creating a

unique IP instance with a different signature for each user. Existing fingerprinting techniques for hardware IP

protection focus on lowering the design effort to create a large number of different IP instances without paying

much attention on the ease of fingerprint detection upon IP integration. This paper presents the first dynamic

fingerprinting technique on sequential circuit IPs to enable both the owner and legal buyers of an IP embedded

in a chip to be readily identified in the field. The proposed fingerprint is an oblivious ownership watermark

independently endorsed by each user through a blind signature protocol. Thus, the authorship can also be

proved through the detection of different user's fingerprints without the need to separately embed an identical

IP owner's signature in all fingerprinted instances. The proposed technique is applicable to both application-

specific integrated circuit and field-programmable gate array IPs. Our analyses show that the fingerprint is

immune to collusion attack and can withstand all perceivable attacks, with a lower probability of removal than

state-of-the-art FSM watermarking schemes. The probability of coincidence of a 32-bit fingerprint is in the

order of 10-10 and up to 1035 32-bit fingerprinted instances can be generated for a small design of 100 flip-

flops.

ETPL

VLSI - 039

A Blind Dynamic Fingerprinting Technique for Sequential Circuit Intellectual

Property Protection

a

















provides nearly 20% saving in ADP and 9% saving in EDP over the proposed structure before pruning without

noticeable degradation of steady-state-error performance.

ETPL

VLSI - 040


Adaptation-Delay

This paper proposes a low-complexity min-sum algorithm for decoding low-density parity-check codes. It is

an improved version of the single-minimum algorithm where the two-minimum calculation is replaced by one

minimum calculation and a second minimum emulation. In the proposed one, variable correction factors that

depend on the iteration number are introduced and the second minimum emulation is simplified, reducing by

this way the decoder complexity. This proposal improves the performance of the single-minimum algorithm,

approaching to the normalized min-sum performance in the water-fall region. Also, the error-floor region is

analyzed for the code of the IEEE 802.3an standard showing that the trapping sets are decoded due to a slow

down of the convergence of the algorithm. An error-floor free operation below $hbox {BER}=10^{-15}$ is

shown for this code by means of a field-programmable gate array (FPGA)-based hardware emulator. A layered

decoder is implemented in a 90-nm CMOS technology achieving 12.8 Gbps with an area of 3.84 mm$^2$ .

ETPL

VLSI - 041

Reduced-Complexity Min-Sum Algorithm for Decoding LDPC Codes With Low

Error-Floor








Video processing systems such as HEVC requiring low energy consumption needed for the multimedia market

has lead to extensive development in fast algorithms for the efficient approximation of 2-D DCT transforms.

The DCT is employed in a multitude of compression standards due to its remarkable energy compaction

properties. Multiplier-free approximate DCT transforms have been proposed that offer superior compression

performance at very low circuit complexity. Such approximations can be realized in digital VLSI hardware

using additions and subtractions only, leading to significant reductions in chip area and power consumption

compared to conventional DCTs and integer transforms. In this paper, we introduce a novel 8-point DCT

approximation that requires only 14 addition operations and no multiplications. The proposed transform

possesses low computational complexity and is compared to state-of-the-art DCT approximations in terms of

both algorithm complexity and peak signal-to-noise ratio. The proposed DCT approximation is a candidate for

reconfigurable video standards such as HEVC. The proposed transform and several other DCT approximations

are mapped to systolic-array digital architectures and physically realized as digital prototype circuits using

FPGA technology and mapped to 45 nm CMOS technology.

ETPL

VLSI - 042

Improved 8-Point Approximate DCT for Image and Video Compression

Requiring Only 14 Additions

This paper presents a hardware-efficient architecture for 4×4 and 8×8 high-throughput MIMO detectors. The

adopted non-constant K-best algorithm tends to keep more survival nodes in top search tree layers and reduce

computational complexity in bottom layers as opposed to the conventional K-best algorithm. A pipelined

architecture is used to generate one detection output per clock cycle, thus meeting multi-gigabit throughput

requirements for advanced wireless communication systems. The proposed efficient folding scheme strikes a

suitable balance between complexity and throughput. This paper also presents a discussion on the scalability

of this architecture with respect to the setting of QAM size, K values, and antenna number. One 4×4 MIMO

detector IC has been manufactured and one 8×8 MIMO detector layout has been realized, both in 90-nm

CMOS technology. The 4×4 detector IC has 232 kilogates (KG). Its maximum measured throughput is 4.08

Gbps at 170-MHz operating frequency and 1.3-V core voltage. The 8×8 detector has 665 KG. Its post-layout

simulation results show that it achieves 4.37-Gbps throughput at 182-MHz operating frequency and 0.9-V core

voltage. Compared to earlier hard-output detectors, both implemented detectors demonstrate good normalized

power and normalized hardware efficiencies.

ETPL

VLSI - 043

Toward Multi-Gigabit Wireless: Design of High-Throughput MIMO Detectors

With Hardware-Efficient Architecture








Demands have been placed on a dynamic random access memory (DRAM) to not only have increased

memory capacity and data transfer speed, but also have reduced operating and standby currents. When a

system uses a DRAM, a refresh operation is necessary because of its data retention time restriction: each bit of

the DRAM is stored as an amount of electrical charge in a storage capacitor that is discharged by the leakage

current. Power consumption for the refresh operation increases in proportion to the memory capacity. We

propose a new method to reduce the refresh power consumption by effectively extending the memory cell

retention time. Conversion from 1 cell/bit to $2^{N}$ cells/bit reduces the variation in the retention time

among memory cells. Although active power increases by a factor of $2^{N}$ , the refresh time increases by

more than $2^{N}$ as a consequence of the fact that the majority decision does better than averaging for the

tail distribution of retention time. The conversion can be realized very simply from the structure of the DRAM

array circuit, and it reduces the frequency of disturbance and power consumption by two orders of magnitude.

On the basis of this conversion method, we propose a partial access mode to reduce power consumption

dynamically when the full memory capacity is not required.

ETPL

VLSI - 044

Partial Access Mode: New Method for Reducing Power Consumption of

Dynamic Random Access Memory

Minimizing the size of a clock tree is known as an effective approach to reduce power dissipation in modern

circuit designs. However, most existing power-aware clock-tree minimization algorithms optimize power on

the basis of flip-flops alone, which may result in limited power savings. To achieve a power and timing

tradeoff, this paper investigates the pulsed-latch utilization in a clock tree for further power savings. This is the

first paper to propose a migration approach to efficiently construct a clock tree with both pulsed-latches and

flip-flops. The proposed method is based on minimum-cost maximum-flow formulation to globally determine

the tree topology, which maintains load balance and considers the wirelength between pulse generators and

pulsed latches. Experimental results indicate that the proposed migration approach can improve the power

consumption by 12% and 13% with 7% and 70% skew improvements on average compared with the most

recent paper on the industrial circuits and ISPD-2010 benchmarks, respectively.

ETPL

VLSI - 045

Pulsed-Latch Utilization for Clock-Tree Power Optimization








Montgomery modular multiplication is widely used in public-key cryptosystems. This work shows how to

relax the data dependency in conventional word-based algorithms to maximize the possibility of reusing the

current words of variables. With the greatly relaxed data dependency, we then proposed a novel scheduling

scheme to alleviate the number of memory access in the developed scalable architecture. Analytical results

show that the memory bandwidth requirement of the proposed scalable architecture is almost 1/(w - 1) times

that of conventional scalable architectures, where w denotes word size. The proposed one also retains a latency

of exactly one cycle between the operations of the same words in two consecutive iterations of the

Montgomery modular multiplication algorithm when employing enough processing elements. Compared to the

design in the related work, experimental results demonstrate that the proposed one achieves an almost 54

percent reduction in power consumption with no degradation in throughput. The reduced number of memory

access not only leads to lower power consumption, but also facilitates the design of scalable architectures for

any precision of operands.

ETPL

VLSI - 046

Scalable Montgomery Modular Multiplication Architecture with Low-Latency

and Low-Memory Bandwidth Requirement

This paper presents a low-power coordinate rotation digital computer (CORDIC)-based reconfigurable

discrete cosine transform (DCT) architecture. The main idea of this paper is based on the interesting fact that

all the computations in DCT are not equally important in generating the frequency domain outputs.

Considering the importance difference in the DCT coefficients, the number of CORDIC iterations can be

dynamically changed to efficiently tradeoff image quality for power consumption. Thus, the computational

energy can be significantly reduced without seriously compromising the image quality. The proposed

CORDIC-based 2-D D 0 13 μ M

results show that our reconfigurable DCT achieves power savings ranging from 22.9% to 52.2% over the

CORDIC-based Loeffler DCT at the cost of minor image quality degradations.

ETPL

VLSI - 047

Reconfigurable CORDIC-Based Low-Power DCT Architecture Based on Data

Priority








On-chip routers typically have buffers dedicated to their input or output ports for temporarily storing packets

in case contention occurs on output physical channels. Buffers, unfortunately, consume significant portions of

router area and power budgets. While running a traffic trace, however, not all input ports of routers have

incoming packets needed to be transferred simultaneously. Therefore, a large number of buffer queues in the

network are empty and other queues are mostly busy. This observation motivates us to design router

architecture with shared queues (RoShaQ), router architecture that maximizes buffer utilization by allowing

the sharing multiple buffer queues among input ports. Sharing queues, in fact, makes using buffers more

efficient hence is able to achieve higher throughput when the network load becomes heavy. On the other side,

at light traffic load, our router achieves low latency by allowing packets to effectively bypass these shared

queues. Experimental results on a 65-nm CMOS standard-cell process show that over synthetic traffics

RoShaQ has 17% less latency and 18% higher saturation throughput than a typical virtualchannel (VC) router.

Because of its higher performance, RoShaQ consumes 9% less energy per transferred packet than VC router

given the same buffer space capacity. Over real multitask applications and E3S embedded benchmarks using

near-optimal NMAP mapping algorithm, RoShaQ has 32% lower latency than VC router and targeting the

same application throughput with 30% lower energy per packet.

ETPL

VLSI - 048

Achieving High-Performance On-Chip Networks With Shared-Buffer Routers

As transistors decrease in size more and more of them can be accommodated in a single die, thus increasing

chip computational capabilities. However, transistors cannot get much smaller than their current size. The

quantum-dot cellular automata (QCA) approach represents one of the possible solutions in overcoming this

physical limit, even though the design of logic modules in QCA is not always straightforward. In this brief, we

propose a new adder that outperforms all state-of-the-art competitors and achieves the best area-delay tradeoff.

The above advantages are obtained by using an overall area similar to the cheaper designs known in literature.

The 64- 18 72 μ2

cycles, that is just 36 clock phases.

ETPL

VLSI - 049

Area-Delay Efficient Binary Adders in QCA








This brief presents a new low-complexity reconfigurable fast filter bank (RFFB) for wireless communication

applications such as spectrum sensing and channelization. In RFFB, the bandwidth and center frequency of

sub-bands can be varied with high frequency resolution without hardware reimplementation. This is achieved

with an improved modified frequency transformation-based variable digital filter (MFT-VDF) at the first stage

of the proposed multistage implementation. Existing second-order frequency transformation-based low-pass

VDFs have limited cutoff frequency range which is approximately 12.5% of the sampling frequency. The

proposed low-pass MFT-VDF offers unabridged control over the cutoff frequency on a wide frequency range

thereby, improving the cutoff frequency range of existing VDFs. The design example shows that the RFFB is

easy to design and offers substantial savings in gate counts over other filter banks.

ETPL

VLSI - 050

Low-Complexity Reconfigurable Fast Filter Bank for Multi-Standard Wireless

Receivers

Input vector monitoring concurrent built-in self test (BIST) schemes perform testing during the normal

operation of the circuit without imposing a need to set the circuit offline to perform the test. These schemes are

evaluated based on the hardware overhead and the concurrent test latency (CTL), i.e., the time required for the

test to complete, whereas the circuit operates normally. In this brief, we present a novel input vector

monitoring concurrent BIST scheme, which is based on the idea of monitoring a set (called window) of

vectors reaching the circuit inputs during normal operation, and the use of a static-RAM-like structure to store

the relative locations of the vectors that reach the circuit inputs in the examined window; the proposed scheme

is shown to perform significantly better than previously proposed schemes with respect to the hardware

overhead and CTL tradeoff.

ETPL

VLSI - 051

Input Vector Monitoring Concurrent BIST Architecture Using SRAM Cells








A new architecture for matching the data protected with an error-correcting code (ECC) is presented in this

brief to reduce latency and complexity. Based on the fact that the codeword of an ECC is usually represented

in a systematic form consisting of the raw data and the parity information generated by encoding, the proposed

architecture parallelizes the comparison of the data and that of the parity information. To further reduce the

latency and complexity, in addition, a new butterfly-formed weight accumulator (BWA) is proposed for the

efficient computation of the Hamming distance. Grounded on the BWA, the proposed architecture examines

whether the incoming data matches the stored data if a certain number of erroneous bits are corrected. For a

(40, 33) code, the proposed architecture reduces the latency and the hardware complexity by ${sim}{32%}$

and 9%, respectively, compared with the most recent implementation.

ETPL

VLSI - 052

Low-Complexity Low-Latency Architecture for Matching of Data Encoded

With Hard Systematic Error-Correcting Codes

As dynamic random access memories (DRAMs) are becoming denser with technology scaling, more complex

fault behaviors emerge; examples are leakage, coupling effects, and cell neighborhoods interaction. The

neighborhood pattern sensitive fault (NPSF) model is suitable to address such faulty behaviors and identify

them during the characterization and/or test of new DRAM chips. However, NPSF test algorithms are

extremely time-consuming and therefore not economically affordable. In this brief, we show how layout

information can be used to refine and significantly simplify the NPSF model and reduce the test time

complexity. As a case study, the folded DRAM array is considered. A realistic NPSF model, the $Delta$ -type

neighborhood, is introduced together with a time efficient test algorithm which is more than two-times cheaper

than traditional ones. Even when incorporating bit-line influence and word-line coupling effects, along with

NPSFs, the test algorithm time complexity almost remains unaltered. Therefore, the proposed approach makes

NPSF testing economically affordable, and hence, suitable for the characterization/test of dense DRAMs in the

nanoera.

ETPL

VLSI - 053

Layout-Based Refined NPSF Model for DRAM Characterization and Testing








Multigate FET technology is the most viable successor to planar CMOS technology at the 22-nm node and

beyond. Prior research on multigate SRAMs is generally confined to the optimization of DC targets. However,

on account of the nonplanar nature of multigate FETs, it is highly questionable whether multigate SRAM DC

metrics can guide bitcell designers, as parasitic capacitances for two topologically equivalent bitcells can be

very different - due to various issues such as fin pitches - resulting in widely varying transient characteristics.

In this paper, we evaluate several known symmetric gate-workfunction (Symm- Φ 6 E RAM

for the first time, asymmetric gate-workfunction (Asymm-Φ 6 E RAM -to-head in a 22-nm

silicon-on-insulator process, from the perspective of transient behavior, using a unified 3-D/mixed-mode 2-D

TCAD technology-circuit co-design methodology. We accomplish the latter by capturing bitcell parasitics

accurately through transport analysis-based 3-D TCAD capacitance extractions that leverage automated

layout-3-D TCAD structure synthesis algorithms. Mixed-mode transient device simulations (incorporating

back-annotated 3-D TCAD parasitics) indicate that a design guided by DC metrics alone can lead to erroneous

conclusions and suboptimal bitcell choices. Overall, from the perspective of area and performance, in single-

Φ -gate (or vanilla) configurations are superior to topologies employing independent-gate

configurations, even though the latter often have better DC metrics. In a larger design space encompassing

dual/Asymm-Φ A -Φ E RAM

topologies in terms of DC metrics and have better dynamic write-ability, even at low VDD.

ETPL

VLSI - 054

Parasitics-Aware Design of Symmetric and Asymmetric Gate-Workfunction

FinFET SRAMs

Gate-level clock gating starts with a netlist, with partial or no gating applied; some flip-flops are then selected

for further gating to reduce the circuit's power consumption, and a gating logic of the smallest possible size

must then be synthesized. We show how to do this by factored form matching, in which gating functions in

factored forms are matched, as far as possible, with factored forms of the Boolean functions of existing

combinational nodes in the circuit; additional gates are then introduced, but only for the portion of gating

functions that are not matched. Strong matching identifies matches that are explicitly present in the factored

forms, and weak matching seeks matches that are implicit in the logic and thus are more difficult to discover.

Factored form matching reduces gating logic by an average of 24%, over a few test circuits, for which Boolean

division only achieves an average reduction of 8%.

ETPL

VLSI - 055

Simplifying Clock Gating Logic by Matching Factored Forms








Clock gating is a predominant technique used for power saving. It is observed that the commonly used

synthesis-based gating still leaves a large amount of redundant clock pulses. Data-driven gating aims to

disable these. To reduce the hardware overhead involved, flip-flops (FFs) are grouped so that they share a

common clock enabling signal. The question of what is the group size maximizing the power savings is

answered in a previous paper. Here we answer the question of which FFs should be placed in a group to

maximize the power reduction. We propose a practical solution based on the toggling activity correlations of

FFs and their physical position proximity constraints in the layout. Our data-driven clock gating is integrated

into an Electronic Design Automation (EDA) commercial backend design flow, achieving total power

reduction of 15%-20% for various types of large-scale state-of-the-art industrial and academic designs in 40

and 65 manometer process technologies. These savings are achieved on top of the sClock gating is a

predominant technique used for power saving. It is observed that the commonly used synthesis-based gating

still leaves a large amount of redundant clock pulses. Data-driven gating aims to disable these. To reduce the

hardware overhead involved, flip-flops (FFs) are grouped so that they share a common clock enabling signal.

The question of what is the group size maximizing the power savings is answered in a previous paper. Here we

answer the question of which FFs should be placed in a group to maximize the power reduction. We propose a

practical solution based on the toggling activity correlations of FFs and their physical position proximity

constraints in the layout. Our data-driven clock gating is integrated into an Electronic Design Automation

(EDA) commercial backend design flow, achieving total power reduction of 15%-20% for various types of

large-scale state-of-the-art industrial and academic designs in 40 and 65 manometer process technol- gies.

These savings are achieved on top of the savings obtained by clock gating synthesis performed by commercial

EDA tools, and gating manually inserted into the register transfer level design.avings obtained by clock gating

synthesis performed by commercial EDA tools, and gating manually inserted into the register transfer level

design.

ETPL

VLSI - 056

Design Flow for Flip-Flop Grouping in Data-Driven Clock Gating








Polar codes have recently received a lot of attention because of their capacity-achieving performance and low

encoding and decoding complexity. The performance of the successive cancellation decoder (SCD) of the

polar codes highly depends on that of the partial-sum network (PSN) implementation. Hence, in this work, an

efficient PSN architecture is proposed, based on the properties of polar codes. First, a new partial-sum

updating algorithm and the corresponding PSN architecture are introduced which achieve a delay performance

independent of the code length. Moreover, the area complexity is also reduced. Second, for a high-

performance and area-efficient semi-parallel SCD implementation, a folded PSN architecture is presented to

integrate seamlessly with the folded processing element architecture. This is achieved by using a novel folded

decoding schedule. As a result, both the critical path delay and the area (excluding the memory for folding) of

the semi-parallel SCD are approximately constant for a large range of code lengths. The proposed designs are

implemented in both FPGA and ASIC and compared with the existing designs. Experimental result shows that

for polar codes with large code length, the decoding throughput is improved by more than 1.05 times and the

area is reduced by as much as 50.4%, compared with the state-of-the-art designs.

ETPL

VLSI - 057

An Efficient Partial-Sum Network Architecture for Semi-Parallel Polar Codes

Decoder Implementation

An analog inner hair cell and auditory nerve circuit using a dual AGC model has been implemented using 0.35

micron mixed-signal technology. A fully-differential current-mode architecture is used and the ability to

correct channel mismatch is evaluated with matched layouts as well as with digital current tuning. The Meddis

test paradigm is used to examine the analog implementation's auditory processing capabilities and investigate

the circuit's ability to correct DC mismatch. The correction techniques used demonstrate the analog inner hair

cell and auditory nerve circuit's potential use in low-power, multiple-sensor analog biomimetic systems with

highly reproducible signal processing blocks on a single massively parallel integrated circuit.

ETPL

VLSI - 058

An Analog VLSI Implementation of the Inner Hair Cell and Auditory Nerve

Using a Dual AGC Model








This brief presents two original implementations of improved accuracy current-mode multiplier/divider

circuits. Besides the advantage of their simplicity, these original multiplier/divider structures present the

advantage of very small linearity errors that can be obtained as a result of the proposed design techniques

(0.75% and 0.9%, respectively, for an extended range of the input currents). The original multiplier/divider

circuits permit a facile reconfiguration, the presented structures representing the functional basis for

implementing complex function synthesizer circuits. The proposed computational structures are designed for

implementing in 0.18-μ M -voltage operation (a supply voltage of 1.2 V). The

circuits' power consumpt 60 75 μW 79 6

59.7 MHz, respectively.

ETPL

VLSI - 059

Improved Accuracy Current-Mode Multiplier Circuits With Applications in

Analog Signal Processing

A new architecture for matching the data protected with an error-correcting code (ECC) is presented in this

brief to reduce latency and complexity. Based on the fact that the codeword of an ECC is usually represented

in a systematic form consisting of the raw data and the parity information generated by encoding, the proposed

architecture parallelizes the comparison of the data and that of the parity information. To further reduce the

latency and complexity, in addition, a new butterfly-formed weight accumulator (BWA) is proposed for the

efficient computation of the Hamming distance. Grounded on the BWA, the proposed architecture examines

whether the incoming data matches the stored data if a certain number of erroneous bits are corrected. For a

(40, 33) code, the proposed architecture reduces the latency and the hardware complexity by ${sim}{32%}$

and 9%, respectively, compared with the most recent implementation.

ETPL

VLSI - 060

Low-Complexity Low-Latency Architecture for Matching of Data Encoded

With Hard Systematic Error-Correcting Codes








Spin-transfer torque magnetic RAM (STT-MRAM) is a promising memory technology for lower level caches

because of its high density and nonvolatile nature. However, the high write latency is a bottleneck to its

widespread adoption as the future on-chip memory. In this paper, we propose a new cache architecture-

asymmetric write architecture with redundant blocks (AWARE)-that can improve the write latency by taking

advantage of the asymmetric write characteristics of 1T-1MTJ STT-MRAM bit-cells. Due to the nature of the

storage element in STT-MRAM, the time required for the two- 1→ 0 0→ 1

identical. In other words, one of the state transitions is slower than the other direction. In conventional cache

architecture, the overall write latency is limited by the slower transition. However, the AWARE cache design

introduces redundant blocks in each row, and they are preset to the initial state that enables the faster

transition. Hence the write operations performed in these redundant blocks are much faster than the

conventional write scheme. The write latency in AWARE is improved by 30% over conventional cache

architecture with no area penalty in the data array. Moreover, the additional tag bits introduced in this

technique result in penalty on the total cache area. In addition, the write energy increases modestly by 7% in

the proposed cache design. However, this write-energy increase can be mitigated by sacrificing the cache

capacity.

ETPL

VLSI - 061

AWARE (Asymmetric Write Architecture With REdundant Blocks): A High

Write Speed STT-MRAM Cache Architecture

A low-voltage low-dropout (LDO) regulator that converts an input of 1 V to an output of 0.85–0.5 V, with 90-

nm CMOS technology is proposed. A simple symmetric operational transconductance amplifier is used as the

error amplifier (EA), with a current splitting technique adopted to boost the gain. This also enhances the

closed-loop bandwidth of the LDO regulator. In the rail-to-rail output stage of the EA, a power noise

cancellation mechanism is formed, minimizing the size of the power MOS transistor. Furthermore, a fast

responding transient accelerator is designed through the reuse of parts of the EA. These advantages allow the

proposed LDO regulator to operate over a wide range of operating conditions while achieving 99.94% current

efficiency, a 28-mV output variation for a 0–100 mA load transient, and a power supply rejection of roughly

50 dB over 0–100 kHz. The area of the proposed LDO regulator is only 0.0041 ${rm mm}^{2}$ , because of

the compact architecture.

ETPL

VLSI - 062

Design of a Low-Voltage Low-Dropout Regulator








Strain engineering for performance enhancement is an integral part of a state-of-the-art CMOS process flow.

However, use of stressors makes the performance of CMOS devices layout dependent. Performance variability

arising due to the use of stressor materials is often referred to as Layout Dependent Effect (LDE) variability.

The existing delay models do not take LDE into consideration and, therefore, results into unaccounted change

in performance and degraded design robustness. In this paper we propose an analytical delay model for

Inverter, 2-input NAND and NOR gates while considering LDE variability due to the use of strain engineered

devices. We compare our derived model with TCAD calibrated HSPICE simulation results and observe that

our model estimates delay well for varying transistor sizes, load capacitances and input signal transition times.

ETPL

VLSI - 063

An Analytical Delay Model for Mechanical Stress Induced Systematic

Variability Analysis in Nanoscale Circuit Design

This paper presents an energy efficient programmable hardware accelerator that targets multiple-input-

multiple-output (MIMO) decoding tasks of orthogonal frequency-division multiplexing (OFDM) systems. The

work is motivated by the adoption of MIMO and OFDM by almost all existing and emerging high-speed

wireless data communication systems. The accelerator was fabricated in 65-nm CMOS technology and

occupies a core area of 2.48 ${rm mm}^{2}$ . It delivers full programmability across different wireless

standards (i.e., WiFi, 3G-long term evolution, and WiMax) as well as different MIMO decoding algorithms

(i.e., minimum mean square error, singular value decomposition, and maximum likelihood) with extreme

energy efficiency. The energy efficiency of our MIMO accelerator chip was compared against dedicated

application specific integrated circuits for 4 $,times,$ 4 QR decomposition, 4 $,times,$ 4 singular value

decomposition, and 2 $,times,$ 2 minimum mean square error decoding. Despite the programmable nature of

our design, it delivered energy efficiencies that were 18% to 28% better than the dedicated solutions reported

in the literature. This paper presents the VLSI implementation of the architecture discussed in [14]–[16]. It

discusses the implementation decisions and tradeoffs used to ensure minimum overall energy consumption of

the resulting accelerator chip without sacrificing programmability. Given its programmability and extreme

energy efficiency, the accelerator is an ideal solution for today's smart phones that implement multiple MIMO-

OFDM waveforms on the - ame platform.

ETPL

VLSI - 064

Energy Efficient Programmable MIMO Decoder Accelerator Chip in 65-nm

CMOS








In this paper, we propose an iterative linear interpolation (ILI) algorithm, which produces quadratic ILI

polynomials to perform the most cost-effective interpolation among state-of-the-art quadratic and cubic

methods. Unlike traditional point and area pixel models, the ILI adopts the fuzzy gradient model to estimate

gradients of the target point according to its neighbor sample points in different directions. By weighing the

gradients using fuzzy membership grades, the ILI estimates the difference between the target point and its

neighbor sample points and finally obtains the target point. In 1-D signal reconstructions, using only three

multipliers, the ILI obviously outperforms both conventional quadratic Lagrange interpolation and cubic

interpolation. To approximate 2-D signals, we use five 1-D ILIs, which costs only eight multipliers to obtain

similar peak signal-to-noise ratio (PSNR) performance but better robustness compared with bi-cubic

interpolation. Reusing the ILI polynomials of the previous target point, we further reduce the cost of ILI to

three multipliers and eight adders. The VLSI implementation using TSMC 0.18- $mu{rm m}$ technology

shows that only 7256 gates are required for running a 200-MHz, 8-bit input/output, 15-bit fix-point data path,

and 10-stage pipelined 2-D ILI, which is the quadratic interpolation of lowest cost but with PSNR

performance closest to state-of-the-art bi-cubic methods.

ETPL

VLSI - 065

Iterative Linear Interpolation Based on Fuzzy Gradient Model for Low-Cost

VLSI Implementation

Gate-level clock gating starts with a netlist, with partial or no gating applied; some flip-flops are then selected

for further gating to reduce the circuit's power consumption, and a gating logic of the smallest possible size

must then be synthesized. We show how to do this by factored form matching, in which gating functions in

factored forms are matched, as far as possible, with factored forms of the Boolean functions of existing

combinational nodes in the circuit; additional gates are then introduced, but only for the portion of gating

functions that are not matched. Strong matching identifies matches that are explicitly present in the factored

forms, and weak matching seeks matches that are implicit in the logic and thus are more difficult to discover.

Factored form matching reduces gating logic by an average of 24%, over a few test circuits, for which Boolean

division only achieves an average reduction of 8%.

ETPL

VLSI - 066

Simplifying Clock Gating Logic by Matching Factored Forms








This paper presents an extensive statistical study on the impact of bias temperature instability (BTI) on digital

circuits. A statistical framework for the evaluation of BTI at the electrical (SPICE) level, enhanced by an

atomistic model for BTI, is introduced. This framework is then employed to perform the timing analysis of

different combinational paths using cells from a given library, aiming to statistically model BTI at the higher

abstraction level. A statistical static timing analysis (SSTA) method is then performed and the results are

compared to detailed simulations using atomistic models based on experimental data. The comparison between

the two methods shows that for large paths both methods converge to the same distribution for the delay while

for short paths the delay distributions are different causing the SSTA method to generate misleading results.

An analysis is then performed in order to understand and formalize the results.

ETPL

VLSI - 067

Use of SSTA Tools for Evaluating BTI Impact on Combinational Circuits

A multiplier-less architecture based on algebraic integer representation for computing the Daubechies 6-tap

wavelet transform for 1-D/2-D signal processing is proposed. This architecture improves on previous designs

in a sense that it minimizes the number of parallel 2-input adder circuits. The algorithm was achieved using

brute-force numerical optimization of the algebraic integer representation. The proposed architecture furnishes

exact computation up to the final reconstruction step, which is the operation that maps the exactly computed

filtered results from algebraic integer representation to fixed-point. Compared to our recent work, this

architecture shows a reduction of $27cdot n-16$ adder circuits, where $n$ is the number of wavelet

decomposition levels. The design is physically implemented for a 4-level 1-D/2-D decomposition using a

Xilinx Virtex-6 vcx240t-1ff1156 field programmable gate array (FPGA) device operating at up to a maximum

clock frequency of 344/ 168 MHz. The FPGA implementation of 1-D/2-D are tested using hardware co-

simulation using an ML605 board with clock of 100 MHz. A 45 nm CMOS synthesis of 2-D designs show

improved clock frequency of better than 306 MHz for a supply voltage of 1.1 V.

ETPL

VLSI - 068

Precise VLSI Architecture for AI Based 1-D/ 2-D Daub-6 Wavelet Filter Banks

With Low Adder-Count








Complex arithmetic operations are widely used in Digital Signal Processing (DSP) applications. In this work,

we focus on optimizing the design of the fused Add-Multiply (FAM) operator for increasing performance. We

investigate techniques to implement the direct recoding of the sum of two numbers in its Modified Booth

(MB) form. We introduce a structured and efficient recoding technique and explore three different schemes by

incorporating them in FAM designs. Comparing them with the FAM designs which use existing recoding

schemes, the proposed technique yields considerable reductions in terms of critical delay, hardware

complexity and power consumption of the FAM unit.

ETPL

VLSI - 069

An Optimized Modified Booth Recoder for Efficient Design of the Add-Multiply

Operator

Emerging non-volatile memories (NVM) based on resistive switching mechanism (RS) such as STT-MRAM,

OxRRAM and CBRAM etc., are under intense R&D investigation by both academics and industries. They

provide high write/read speed, low power and good endurance (e.g., > 1012) beyond mainstream NVMs,

which allow them to be embedded directly with logic units for computing purpose. This integration could

increase significantly the power/die area efficiency, and then overcome definitively the power/speed

bottlenecks of modern VLSIs. This paper presents firstly a theoretical investigation of synchronous NV logic

gates based on RS memories (RS-NVL). Special design techniques and strategies are proposed to optimize the

structure according to different resistive characteristics of NVMs. To validate this study, we simulated a non-

volatile full-adder (NVFA) with two types of NVMs: STT-MRAM and OxRRAM by using CMOS 40 nm

design kit and compact models, which includes related physics and experimental parameters. They show

interesting power, speed and area gain compared with synchronized CMOS FA while keeping good reliability.

ETPL

VLSI - 070

Synchronous Non-Volatile Logic Gate Design Based on Resistive Switching

Memories








This paper proposes a low-cost high-throughput multistandard transform (MST) core, which can support

MPEG-1/2/4 (8 × 8), H.264 (8 × 8, 4 × 4), and VC-1 (8 × 8, 8 × 4, 4 × 8, 4 × 4) transforms. Common sharing

distributed arithmetic (CSDA) combines factor sharing and distributed arithmetic sharing techniques,

efficiently reducing the number of adders for high hardware-sharing capability. This achieves a 44.5%

reduction in adders in the proposed MST, compared with the direct implementation method. With eight

parallel computation paths, the proposed MST core has an eightfold operation frequency throughput rate.

Measurements show that the proposed CSDA-MST core achieves a high-throughput rate of 1.28 G-pels/s,

supporting the (4928 × 2048@24 Hz) digital cinema or ultrahigh resolution format. This is possible only with

30 k gate counts when implemented in a TSMC 0.18- μ M DA-MST core thus achieves

a high-throughput rate supporting multistandard transformations at low cost.

ETPL

VLSI - 071

High-Throughput Multistandard Transform Core Supporting

MPEG/H.264/VC-1 Using Common Sharing Distributed Arithmetic

Nonlinear activation function is one of the main building blocks of artificial neural networks. Hyperbolic

tangent and sigmoid are the most used nonlinear activation functions. Accurate implementation of these

transfer functions in digital networks faces certain challenges. In this paper, an efficient approximation scheme

for hyperbolic tangent function is proposed. The approximation is based on a mathematical analysis

considering the maximum allowable error as design parameter. Hardware implementation of the proposed

approximation scheme is presented, which shows that the proposed structure compares favorably with

previous architectures in terms of area and delay. The proposed structure requires less output bits for the same

maximum allowable error when compared to the state-of-the-art. The number of output bits of the activation

function determines the bit width of multipliers and adders in the network. Therefore, the proposed activation

function results in reduction in area, delay, and power in VLSI implementation of artificial neural networks

with hyperbolic tangent activation function.

ETPL

VLSI - 072

Efficient VLSI Implementation of Neural Networks With Hyperbolic Tangent

Activation Function








This paper seeks to combine linear time-invariant (LTI) filtering and sparsity-based denoising in a principled

way in order to effectively filter (denoise) a wider class of signals. LTI filtering is most suitable for signals

restricted to a known frequency band, while sparsity-based denoising is suitable for signals admitting a sparse

representation with respect to a known transform. However, some signals cannot be accurately categorized as

either band-limited or sparse. This paper addresses the problem of filtering noisy data for the particular case

where the underlying signal comprises a low-frequency component and a sparse or sparse-derivative

component. A convex optimization approach is presented and two algorithms derived: one based on

majorization-minimization (MM), and the other based on the alternating direction method of multipliers

(ADMM). It is shown that a particular choice of discrete-time filter, namely zero-phase noncausal recursive

filters for finite-length data formulated in terms of banded matrices, makes the algorithms effective and

computationally efficient. The efficiency stems from the use of fast algorithms for solving banded systems of

linear equations. The method is illustrated using data from a physiological-measurement technique (i.e., near

infrared spectroscopic time series imaging) that in many cases yields data that is well-approximated as the sum

of low-frequency, sparse or sparse-derivative, and noise components.

ETPL

VLSI - 073

Simultaneous Low-Pass Filtering and Total Variation Denoising

The implementation of transversal filters requires basic circuit elements such as adders, multipliers and (unit)

delay elements. The filters designed under infinite precision of these elements may behave differently when

implemented with components with limited accuracy. In fact, the effects of the coefficient inaccuracies in

analog and digital transversal filters have been investigated extensively in the literature [1], [2]. On the other

hand, the effects of the unit delays with limited precision have not received similar attention. In this paper, we

find that such effects especially in very high frequency continuous-time semi-digital transversal filters may not

be ignored. As an example, we analyze the impact of delay errors in the implementation of the direct

modulation transmitter. Specifically, we provide the analytical statistical performance bounds and confirm the

results with simulations.

ETPL

VLSI - 074

Effects of Random Delay Errors in Continuous-Time Semi-Digital Transversal

Filters








This paper introduces two polynomial finite-length impulse response (FIR) digital filter structures with

simultaneously variable fractional delay (VFD) and phase shift (VPS). The structures are reconfigurable

(adaptable) online without redesign and do not exhibit transients when the VFD and VPS parameters are

altered. The structures can be viewed as generalizations of VFD structures in the sense that they offer a VPS in

addition to the regular VFD. The overall filters are composed of a number of fixed subfilters and a few

variable multipliers whose values are determined by the desired FD and PS values. A systematic design

algorithm, based on iter ℓ1-norm minimization, is proposed. It generates fixed subfilters

with many zero-valued coefficients, typically located in the impulse response tails. The paper considers two

different structures, referred to as the basic structure and common-subfilters structure, and compares these

proposals as well as the existing cascaded VFD and VPS structures, in terms of arithmetic complexity, delay,

memory cost, and transients. In general, the common-subfilters structure is superior when all of these aspects

are taken into account. Further, the paper shows and exemplifies that the VFDPS filters under consideration

can be used for simultaneous resampling and frequency shift of signals.

ETPL

VLSI - 075

Two Polynomial FIR Filter Structures With Variable Fractional Delay and

Phase Shift

In a broadband MIMO-OFDM wireless communication system, embedded buffering memories occupy a large

portion of the chip area and a significant amount of power consumption. Due to process variations of advanced

CMOS technologies, it becomes both challenging and costly to maintain perfectly functioning memories under

all anticipated operating conditions. Thus, Voltage over Scaling (VoS) has emerged as a means to achieve

energy efficient systems resulting in a tradeoff between energy efficiency and reliability. In this paper we

present the algorithm and VLSI architecture of a novel error-resilient K-Best MIMO detector based on the

combined distribution of channel noise and induced errors due to VoS. The simulation results show that,

compared with a conventional MIMO detector design, the proposed algorithm provides up-to 4.5 dB gain to

achieve the near-optimal Packet Error Rate (PER) performance in the 4 $times$ 4 64-QAM system.

Furthermore, based on experimental results, when jointly considering the detector and memory power

consumption, the proposed resilient scheme with VoS memory can achieve up to 32.64% savings compared to

the conventional K-Best detector with perfect memory.

ETPL

VLSI - 076

Algorithms and Architectures of Energy-Efficient Error-Resilient MIMO

Detectors for Memory-Dominated Wireless Communication Systems








Cryptocircuits can be attacked by third parties using differential power analysis (DPA), which uses power

consumption dependence on data being processed to reveal critical information. To protect security devices

against this issue, differential logic styles with (almost) constant power dissipation are widely used. However,

to use such circuits effectively for secure applications it is necessary to eliminate any energy-secure flaw in

security in the shape of memory effects that could leak information. This paper proposes a design

methodology to improve pull-down logic configuration for secure differential gates by redistributing the

charge stored in internal nodes and thus, removing memory effects that represent a significant threat to

security. To evaluate the methodology, it was applied to the design of AND/NAND and XOR/XNOR gates in

a 90 nm technology, adopting the sense amplifier based logic (SABL) style for the pull-up network. The

proposed solutions leak less information than typical SABL gates, increasing security by at least two orders of

magnitude and with negligible performance degradation. A simulation-based DPA attack on the Sbox9

cryptographic module used in the Kasumi algorithm, implemented with complementary metal–oxide–

semiconductor, SABL and proposed gates, was performed. The results obtained illustrate that the number of

measurements needed to disclose the key increased by much more than one order of magnitude when using

our proposal. This paper also discusses how the effectivenness of DPA attacks is influenced by operating

temperature and details how to insure energy-secure operations in the new proposals.

ETPL

VLSI - 077

A Methodology for Optimized Design of Secure Differential Logic Gates for

DPA Resistant Circuits

As the feature size shrinks to the nanometer scale, SRAM-based FPGAs will become increasingly vulnerable

to soft errors. Existing reliability-oriented placement and routing approaches primarily focus on reducing the

fault occurrence probability (node error rate) of soft errors. However, our analysis shows that, besides the fault

occurrence probability, the propagation probability (error propagation probability) plays an important role and

should be taken into consideration. In this paper, we first propose a cube-based analysis algorithm to

efficiently and accurately estimate the error propagation probability. Based on such a model, we propose a

novel reliability-oriented placement and routing algorithm that combines both the fault occurrence probability

and the error propagation probability together to enhance system-level robustness against soft errors.

Experimental results show that, compared with the baseline versatile place and route technique, the proposed

scheme can reduce the failure rate by 20.73%, and increase the mean time between failures by 39.44%.

ETPL

VLSI - 078

Reliability-Oriented Placement and Routing Algorithm for SRAM-Based

FPGAs








Modern multicore systems have a large number of components operating in different clock domains and

communicating through asynchronous interfaces. These interfaces use synchronizer circuits, which guard

against metastability failures but introduce latency in processing the asynchronous input. We propose a

speculative method that hides synchronization latency by overlapping it with computation cycles. We verify

the correctness of our approach through a field programmable gate array implementation and apply it to a

number of synthesized benchmarks. Synthesis results reveal that our approach achieves average savings of

135% and 204% in area costs and nearly 100% in power costs compared to two similar speculative techniques.

ETPL

VLSI - 079

Eliminating Synchronization Latency Using Sequenced Latching














ETPL

VLSI - 080










Minimizing the size of a clock tree is known as an effective approach to reduce power dissipation in modern

circuit designs. However, most existing power-aware clock-tree minimization algorithms optimize power on

the basis of flip-flops alone, which may result in limited power savings. To achieve a power and timing

tradeoff, this paper investigates the pulsed-latch utilization in a clock tree for further power savings. This is the

first paper to propose a migration approach to efficiently construct a clock tree with both pulsed-latches and

flip-flops. The proposed method is based on minimum-cost maximum-flow formulation to globally determine

the tree topology, which maintains load balance and considers the wirelength between pulse generators and

pulsed latches. Experimental results indicate that the proposed migration approach can improve the power

consumption by 12% and 13% with 7% and 70% skew improvements on average compared with the most

recent paper on the industrial circuits and ISPD-2010 benchmarks, respectively.

ETPL

VLSI - 081

Pulsed-Latch Utilization for Clock-Tree Power Optimization

This paper presents a hardware-efficient architecture for 4×4 and 8×8 high-throughput MIMO detectors. The

adopted non-constant K-best algorithm tends to keep more survival nodes in top search tree layers and reduce

computational complexity in bottom layers as opposed to the conventional K-best algorithm. A pipelined

architecture is used to generate one detection output per clock cycle, thus meeting multi-gigabit throughput

requirements for advanced wireless communication systems. The proposed efficient folding scheme strikes a

suitable balance between complexity and throughput. This paper also presents a discussion on the scalability

of this architecture with respect to the setting of QAM size, K values, and antenna number. One 4×4 MIMO

detector IC has been manufactured and one 8×8 MIMO detector layout has been realized, both in 90-nm

CMOS technology. The 4×4 detector IC has 232 kilogates (KG). Its maximum measured throughput is 4.08

Gbps at 170-MHz operating frequency and 1.3-V core voltage. The 8×8 detector has 665 KG. Its post-layout

simulation results show that it achieves 4.37-Gbps throughput at 182-MHz operating frequency and 0.9-V core

voltage. Compared to earlier hard-output detectors, both implemented detectors demonstrate good normalized

power and normalized hardware efficiencies.

ETPL

VLSI - 082

Toward Multi-Gigabit Wireless: Design of High-Throughput MIMO Detectors

With Hardware-Efficient Architecture








Low-density parity-check (LDPC) codes are adopted in many applications due to their Shannon-limit

approaching error-correcting performance. Nevertheless, belief-propagation (BP) based decoding of these

codes suffers from the error-floor problem, i.e., an abrupt change in the slope of the error-rate curve that

occurs at very low error rates. Recently, a new type of decoders termed finite alphabet iterative decoders

(FAIDs) were introduced. The FAIDs use simple Boolean maps for variable node processing, and can surpass

the BP-based decoders in the error floor region with very short word length. We restrict the scope of this paper

to regular dv=3 LDPC codes on the BSC channel. This paper develops a low-complexity implementation

architecture for the FAIDs by making use of their properties. Particularly, an innovative bit-serial check node

unit is designed for the FAIDs, and a small-area variable node unit is proposed by exploiting the symmetry in

the Boolean maps. Moreover, an optimized data scheduling scheme is proposed to increase the hardware

utilization efficiency. From synthesis results, the proposed FAID implementation needs only 52% area to

reach the same throughput as one of the most efficient standard Min-Sum decoders for an example (7807,

7177) LDPC code, while achieving better error-correcting performance in the error-floor region. Compared to

an offset Min-Sum decoder with longer word length, the proposed design can achieve higher throughput with

45% area, and still leads to possible performance improvement in the error-floor region.

ETPL

VLSI - 083

Finite Alphabet Iterative Decoders for LDPC Codes: Optimization,

Architecture and Analysis

Reducing the interconnection length of VLSI arrays leads to less capacitance, power dissipation and dynamic

communication cost between the processing elements (PEs). This paper develops efficient algorithms for

constructing tightly-coupled subarrays from the mesh-connected VLSI arrays with faulty PEs. For a given size

r·s of the target (logical) array, the proposed algorithm searches and reroutes a physical r×s subarray that has

the least number of faults, resulting in an approximate target array, which is subsequently extended to the

desired target array. Experimental results show that over 65 percent redundant interconnects can be reduced

for a 64×64 target array on the 512×512 host array with no more than 1 percent faults. In addition, we propose

a recursive divide-and-conquer algorithm for constructing the maximum target array (MTA). The lower bound

of the total interconnection length of the MTA has been established. Experimental results show that the

proposed algorithm is capable of reducing the long interconnects by over 33 percent for the MTA derived

from the 512×512 host array with no more than 1 percent faults. Moreover, the proposed total interconnection

length of target array is close to the lower bound for the cases with relatively fewer number of faults.

ETPL

VLSI - 084

Constructing Sub-Arrays with ShortInterconnects from Degradable VLSI

Arrays








In this paper, a VLSI implementation of a complete MIMO channel equalization ASIC based on lattice

reduction-aided linear detection is presented. The architecture performs preprocessing steps at channel rate and

low-complexity linear data detection at symbol rate. Preprocessing is based on Seysen's algorithm for lattice

reduction. We present algorithmic improvements of the lattice reduction preprocessing in terms of area and

throughput of the VLSI implementation with minor impact on the error-rate. Due to the low-complexity

implementation of the lattice reduction-aided data detection stage, our architecture is able to achieve very low

power in typical packet-based MIMO wireless data transmission scenarios. The final 90 nm CMOS ASIC

achieves an energy efficiency for the detection of 24 pJ/bit at a throughput of 720 Mbps with near-optimal

error-rate performance.

ETPL

VLSI - 085

A Lattice Reduction-Aided MIMO Channel Equalizer in 90 nm CMOS

Achieving 720 Mb/s

Wire delays and leakage energy consumption are both growing problems in designing large on-chip caches.

Nonuniform cache architecture (NUCA) is a wire-delay aware design paradigm based on the sub-banking of a

cache, which allows the banks closer to the controller to be accessed with reduced latencies with respect to the

other banks. This feature is leveraged by dynamic NUCA (D-NUCA) caches via a migration mechanism

which speeds up frequently used data access, further reducing the effect wire delays have on performance. To

reduce leakage power consumption of static random access memory caches, various micro-architectural

techniques have been proposed. In this brief, we compare the benefits and limits of the application of some of

these techniques to a D-NUCA cache memory, and propose a novel hybrid scheme based on the Drowsy and

Way Adaptable techniques. Such a scheme allows further improvement in leakage reduction and limits the

impact of process variation on the effectiveness of the Drowsy technique.

ETPL

VLSI - 086

Evaluation of Leakage Reduction Alternatives for Deep Submicron Dynamic

Nonuniform Cache Architecture Caches








This paper presents the linearity analysis of a successive approximation registers (SAR) analog-to-digital

converters (ADC) with split DAC structure based on two switching methods: conventional charge-

redistribution and Vcm-based switching. The static linearity performance, namely the integral nonlinearity and

differential nonlinearity, as well as the parasitic effects of the split DAC, are analyzed hereunder. In addition, a

code-randomized calibration technique is proposed to correct the conversion nonlinearity in the conventional

SAR ADC, which is verified by behavioral simulations, as well as measured results. Performances of both

switching methods are demonstrated in 90 nm CMOS. Measurement results of power, speed, and linearity

clearly show the benefits of using Vcm-based switching.

ETPL

VLSI - 087

Split-SAR ADCs: Improved Linearity With Power and Speed Optimization

We present a hybrid analog/digital very large scale integration (VLSI) implementation of a spiking neural

network with programmable synaptic weights. The synaptic weight values are stored in an asynchronous Static

Random Access Memory (SRAM) module, which is interfaced to a fast current-mode event-driven DAC for

producing synaptic currents with the appropriate amplitude values. These currents are further integrated by

current-mode integrator synapses to produce biophysically realistic temporal dynamics. The synapse output

currents are then integrated by compact and efficient integrate and fire silicon neuron circuits with spike-

frequency adaptation and adjustable refractory period and spike-reset voltage settings. The fabricated chip

comprises a total of 32 × 32 SRAM cells, 4 × 32 synapse circuits and 32 × 1 silicon neurons. It acts as a

transceiver, receiving asynchronous events in input, performing neural computation with hybrid analog/digital

circuits on the input spikes, and eventually producing digital asynchronous events in output. Input, output, and

synaptic weight values are transmitted to/from the chip using a common communication protocol based on the

Address Event Representation (AER). Using this representation it is possible to interface the device to a

workstation or a micro-controller and explore the effect of different types of Spike-Timing Dependent

Plasticity (STDP) learning algorithms for updating the synaptic weights values in the SRAM module. We

present experimental results demonstrating the correct operation of all the circuits present on the chip.

ETPL

VLSI - 088

An Event-Based Neural Network Architecture With an Asynchronous

Programmable Synaptic Memory








Radio communication exhibits the highest energy consumption in wireless sensor nodes. Given their limited

energy supply from batteries or scavenging, these nodes must trade data communication for on-the-node

computation. Currently, they are designed around off-the-shelf low-power microcontrollers. But by employing

a more appropriate processing element, the energy consumption can be significantly reduced. This paper

describes the design and implementation of the newly proposed folded-tree architecture for on-the-node data

processing in wireless sensor networks, using parallel prefix operations and data locality in hardware.

Measurements of the silicon implementation show an improvement of 10-20× in terms of energy as compared

to traditional modern micro-controllers found in sensor nodes.

ETPL

VLSI - 089

Low-Power Digital Signal Processor Architecture for Wireless Sensor Nodes














ETPL

VLSI - 090










Spin-transfer-torque magnetoresistive random access memory (STT-MRAM) is an emerging type of

nonvolatile memory with compelling advantages in endurability, scalability, speed, and energy consumption.

As the process technology shrinks, STT-MRAM has limited sensing margin due to the decrease in supply

voltage and increase in process variation. Furthermore, the relatively smaller resistance difference of two

states in STT-MRAM poses challenges for its read/write circuit design to maintain an acceptable sensing

margin. The proposed reference circuits optimization scheme solves the reference resistance distribution issue

to maximize the sensing margin and minimize the read disturbance, with low power consumption. Simulation

results show that the optimization scheme is able to significantly improve the read reliability with the presence

of one or few cases of reference cell failure, thus it eliminates the requirement of additional circuits for failure

detection of reference cell or referencing to neighboring blocks.

ETPL

VLSI - 091

Optimization Scheme to Minimize Reference Resistance Distribution of Spin-

Transfer-Torque MRAM

Since the very beginning of RF and microwave integrated techniques and energy harvesting, Schottky diodes

μW -harvesting applications,

the Schottky diode technique fails to provide a satisfactory RF-dc conversion efficiency mainly because of its

high zero-bias junction resistance. This paper examines the state-of-the-art low-power microwave-to-dc

energy conversion techniques. A comprehensive picture of the state-of-the-art on this aspect is given

graphically, which compares different technologies such as transistor, diode, and CMOS schemes. Subsequent

to the highlighted limitations of current devices, this work introduces, for the first time, a nonlinear component

for low-power rectification based on a recent discovery in spintronics, namely, the spindiode. Along with an

analysis of the role of nonlinearity and zero bias resistance in the rectification process of the spindiode, it is

shown how the spindiode could enhance the rectification efficiency even at a very low-power level and how

this technique would shift the design paradigms of diode-based devices and circuits.

ETPL

VLSI - 092

Towards Low-Power High-Efficiency RF and Microwave Energy Harvesting








Superconductor electronics offers logic circuits for high-speed data processing and high-performance

computing. The main barrier to practical application is the lack of high-speed and low-power memory. It is

widely believed that the most reliable and functional bit cell for superconducting memory is the vortex

transitional bit cell, which was successfully used by Nagasawa in a 4-kb memory. This paper reviews existing

challenges in this type of Josephson memory devices and discusses engineering issues in implementing a

model single flux quantum random access memory. We evaluate the contributions that various components of

the memory system make to delay and power dissipation. The 256-bit memory provides an experimentally

confirmed read access time of 190 ps. As a result, we found that delay and power dissipation are found largely

in the address decoder, line drivers, bit-selection scheme, and the data readout circuitry. With these circuits

being similar for various magnetic memory devices, our findings provide essential data for a comprehensive

assessment of new concepts for bit cells, readout, and write in superconducting memories.

ETPL

VLSI - 093

Access Time and Power Dissipation of a Model 256-Bit Single Flux Quantum

RAM

We present a theoretical model to analyze all-optical switching by two-photon absorption induced free-carrier

injection in silicon 2 × 2 add-drop microring resonators. The theoretical simulations are in good agreement

with experimental results. The results have been used to design all-optical ultrafast (i) reconfigurable De-

multiplexer/Multiplexer logic circuits using three microring resonator switches and (ii) universal, conservative

and reversible Fredkin and Toffoli logic gates with only one and two microring resonator switches

respectively. Switching has been optimized for low-power (25 mW) ultrafast (25 ps) operation with high

modulation depth (85%) to enable logic operations at 40 Gb/s. The combined advantages of high Q-factor,

tunability, compactness, cascadibility, reversibility and reconfigurability make the designs favorable for

practical applications. The proposed designs provide a new paradigm for ultrafast CMOS-compatible all-

optical reversible computing circuits in silicon.

ETPL

VLSI - 094

All-Optical Ultrafast Switching in 2 × 2 Silicon Microring Resonators and its

Application to Reconfigurable DEMUX/MUX and Reversible Logic Gates








This paper advocates a lifetime-aware progressive programming concept to improve single-level per cell

NAND flash memory write endurance. NAND flash memory program/erase (P/E) cycling gradually degrades

memory cell storage noise margin, and sufficiently strong fault tolerance must be used to ensure the memory

P/E cycling endurance. As a result, the relatively large cell storage noise margin in early memory lifetime is

essentially wasted in conventional design practice. This paper proposes to always fully utilize the available

cell storage noise margin by adaptively adjusting the number of storage levels per cell, and progressively use

these levels to realize multiple 1-bit programming operations between two consecutive erase operations. This

simple progressive programming design concept is realized by two different implementation strategies, which

are discussed and compared in detail. On the basis of an approximate NAND flash memory device model, we

carried out simulations to quantitatively evaluate this design concept. The results show that it can improve the

write endurance by 35.9% and in the meanwhile improve the average programming speed by 12% without

sacrificing read speed.

ETPL

VLSI - 095

Using Lifetime-Aware Progressive Programming to Improve SLC NAND Flash

Memory Write Endurance








A low-voltage low-dropout (LDO) regulator that converts an input of 1 V to an output of 0.85–0.5 V, with 90-

nm CMOS technology is proposed. A simple symmetric operational transconductance amplifier is used as the

error amplifier (EA), with a current splitting technique adopted to boost the gain. This also enhances the

closed-loop bandwidth of the LDO regulator. In the rail-to-rail output stage of the EA, a power noise

cancellation mechanism is formed, minimizing the size of the power MOS transistor. Furthermore, a fast

responding transient accelerator is designed through the reuse of parts of the EA. These advantages allow the

proposed LDO regulator to operate over a wide range of operating conditions while achieving 99.94% current

efficiency, a 28-mV output variation for a 0–100 mA load transient, and a power supply rejection of roughly

50 dB over 0–100 kHz. The area of the proposed LDO regulator is only 0.0041 ${rm mm}^{2}$ , because of

the compact architecture.

ETPL

VLSI - 096

Design of a Low-Voltage Low-Dropout Regulator

An approach to test application called transparent scan provides an opportunity to share tests among different

logic blocks whose primary inputs and outputs are included in scan chains even if the blocks have different

numbers of state variables. A transparent-scan sequence for one block is likely to detect faults in other blocks

since transparent scan does not distinguish between functional and scan clock cycles, and allows faults to be

detected at all the clock cycles of the sequence. Such sharing of tests is not meaningful with conventional

scan-based tests, especially when the blocks have different numbers of state variables. Transparent scan thus

enhances the ability to produce a compact test set for a group of logic blocks. The static test compaction

procedure described in this paper uses transparent-scan sequences that follow the application of conventional

scan-based tests precisely. The procedure obtains a set of transparent-scan sequences for a group of logic

blocks from compacted test sets for the logic blocks in the group. From this set, it selects a subset that detects

all the target faults, which are detected by the complete set.

ETPL

VLSI - 097

Test Compaction by Sharing of Transparent-Scan Sequences Among Logic

Blocks








A design methodology for incorporating Residue Number System (RNS) and Polynomial Residue Number

System (PRNS) in Montgomery modular multiplication in GF(p) or GF(2n) respectively, as well as a VLSI

architecture of a dual-field residue arithmetic Montgomery multiplier are presented in this paper. An analysis

of input/output conversions to/from residue representation, along with the proposed residue Montgomery

multiplication algorithm, reveals common multiply-accumulate data paths both between the converters and

between the two residue representations. A versatile architecture is derived that supports all operations of

Montgomery multiplication in GF(p) and GF(2n), input/output conversions, Mixed Radix Conversion (MRC)

for integers and polynomials, dual-field modular exponentiation and inversion in the same hardware. Detailed

comparisons with state-of-the-art implementations prove the potential of residue arithmetic exploitation in

dual-field modular multiplication.

ETPL

VLSI - 098

Multifunction Residue Architectures for Cryptography

Montgomery modular multiplication is widely used in public-key cryptosystems. This work shows how to

relax the data dependency in conventional word-based algorithms to maximize the possibility of reusing the

current words of variables. With the greatly relaxed data dependency, we then proposed a novel scheduling

scheme to alleviate the number of memory access in the developed scalable architecture. Analytical results

show that the memory bandwidth requirement of the proposed scalable architecture is almost 1/(w - 1) times

that of conventional scalable architectures, where w denotes word size. The proposed one also retains a latency

of exactly one cycle between the operations of the same words in two consecutive iterations of the

Montgomery modular multiplication algorithm when employing enough processing elements. Compared to the

design in the related work, experimental results demonstrate that the proposed one achieves an almost 54

percent reduction in power consumption with no degradation in throughput. The reduced number of memory

access not only leads to lower power consumption, but also facilitates the design of scalable architectures for

any precision of operands.

ETPL

VLSI - 099

Scalable Montgomery Modular Multiplication Architecture with Low-Latency

and Low-Memory Bandwidth Requirement








Polar codes have recently received a lot of attention because of their capacity-achieving performance and low

encoding and decoding complexity. The performance of the successive cancellation decoder (SCD) of the

polar codes highly depends on that of the partial-sum network (PSN) implementation. Hence, in this work, an

efficient PSN architecture is proposed, based on the properties of polar codes. First, a new partial-sum

updating algorithm and the corresponding PSN architecture are introduced which achieve a delay performance

independent of the code length. Moreover, the area complexity is also reduced. Second, for a high-

performance and area-efficient semi-parallel SCD implementation, a folded PSN architecture is presented to

integrate seamlessly with the folded processing element architecture. This is achieved by using a novel folded

decoding schedule. As a result, both the critical path delay and the area (excluding the memory for folding) of

the semi-parallel SCD are approximately constant for a large range of code lengths. The proposed designs are

implemented in both FPGA and ASIC and compared with the existing designs. Experimental result shows that

for polar codes with large code length, the decoding throughput is improved by more than 1.05 times and the

area is reduced by as much as 50.4%, compared with the state-of-the-art designs.

ETPL

VLSI - 100

An Efficient Partial-Sum Network Architecture for Semi-Parallel Polar Codes

Decoder Implementation

An implementation of a reduced complexity matching pursuit channel estimator for LTE is presented. The

design contains an FFT/IFFT module with non-radix-2 units and a core estimator. The module is flexible

enough to perform FFT and IFFT at different resolutions needed, using the same hardware. Based on prior

work the needed internal word lengths are found. Internal shifts are employed to maximize the use of available

resources. The design is implemented in a 65 nm low power process from STMicroelectronics. The total area

of the implementation is 1 mm2 design, including input pads and extra control logic. The algorithmic

improvements reduce the complexity by up to 56% compared to prior art. At the same time estimator shows

great improvement in speed, allowing over 6 times the number of estimations in the same time. Power

consumption of the estimator is simulated to ~ 20 mW, running at 70 MHz.

ETPL

VLSI - 101

Improved Matching-Pursuit Implementation for LTE Channel Estimation








We present a tree router for multichip systems that guarantees deadlock-free multicast packet routing without

dropping packets or restricting their length. Multicast routing is required to efficiently connect massively

parallel systems' computational units when each unit is connected to thousands of others residing on multiple

chips, which is the case in neuromorphic systems. Our tree router implements this one-to-many routing by

branching recursively-broadcasting the packet within a specified subtree. Within this subtree, the packet is

only accepted by chips that have been programmed to do so. This approach boosts throughput because

memory look-ups are avoided enroute, and keeps the header compact because it only specifies the route to the

subtree's root. Deadlock is avoided by routing in two phases-an upward phase and a downward phase-and by

restricting branching to the downward phase. This design is the first fully implemented wormhole router with

packet-branching that can never deadlock. The design's effectiveness is demonstrated in Neurogrid, a million-

neuron neuromorphic system consisting of sixteen chips. Each chip has a 256 × 256 silicon-neuron array

integrated with a full-custom asynchronous VLSI implementation of the router that delivers up to 1.17 G

words/s across the sixteen- 1 μ

ETPL

VLSI - 102

A Multicast Tree Router for Multichip Neuromorphic Systems

Thank You !

This study presents a parallel very large scale integrated circuits architecture for an intra-predictor based on a

fast 4 × 4 algorithm. For real-time scheduling, the proposed algorithm overcomes the data dependency

between intra-prediction and intra-coding, thereby improving coding performance and reducing the number of

coding cycles. The high-speed architecture for intra-prediction includes configurable computation cores to

process YUV components using 10 pixel parallelism. Prediction for one macro-block (MB) coding

(luminance: 4 × 4 and 16 × 16 block modes; chrominance: 8 × 8 block modes) can all be completed within 256

cycles. The proposed architecture achieves throughput of 410 kMB/s, suitable for 1920 × 1080/35 Hz 4:2:0

HDTV encoder at a working frequency of 105 MHz.

ETPL

VLSI - 103

VLSI implementation of high-throughput parallel H.264/AVC baseline intra-

predictor





Documents

Elysium Technologies Private Limitedelysiumtechnologies.com/wp-content/uploads/2014/07/... · design successfully solves the long discharging path problem in conventional explicit