[IEEE Eighth International Application Specific Integrated Circuits Conference - Austin, TX, USA (18-22 Sept. 1995)] Proceedings of Eighth International Application Specific Integrated

A Single-Chip, Asynchronous Echo Canceller for High-speed Data Communication

Richard P. Mackey, Jeffrey J. Rodriguez, Jo Dale Carothers, and Sarma B. E(. Vrudhula

Dept. of Electrical and Computer Engineering The University of Arizona

Tucson, Arizona 85721 Internet: rodriguezaece. arizona. edu

Abstract- A single-chip, laS-coefficient, asynchronous echo canceller has been developed. Cancellation is performed Iby an FIR filter whose coefficients are adapted using t h e power-of-two modified LMS algorithm. The pipelined circuit updates all coefRcients and generates the filtered output every cycle while allowing a sampling rate greater than 205 kHz.

I. INTRODUCTION

In a communication system, echo occurs when the received signal is coupled onto the desired output signal. This occurs due to transmission line mismatches and re- flections off objects. The frequency response of the echo path is time-varying and dependent upon the environ- ment; hence, echo cancellation requires adaptive filtering. Echo cancellation involves two subtasks: echo removal and echo signature reestimation. Hardware implementations noirmally perform these tasks using separate proces- sors. Fixed digital filter chips are commercially available, but adaptive filtering hardware is currently limited to chip sets and/or boards [l], [5].

The LMS algorithm is a gradient descent algorithm for estimating the echo signature [9]. The original LMS algorithm suffers from either long processing delay or high hardware requirements; hence, the need for separate pro- cessors in most implementations. In this project, the order of computation in the LMS algorithm was modified in order to achieve a single-processor implementation.

Echo cancellation is generally performed using a synchronous digital finite-impulse-response (FIR) filter. Clock skew and distribution, pipeline balancing, and power limit this approach. Asynchronous design removes these limitations. Handshaking signals are used to com- municate data between register pairs in asynchronous de- signs. These signals are locally generated to compensate for hardlware delays. Unlike synchronous design, asynchronous design does not have a clock; hence, a speedup of any combinational unit will decrease the latency. Thus, by designing an asynchronous echo canceller, a higher sampling rate can be achieved.

11. ECHO CANCELLATION Fig. 3 shows a two-way communication system with

echo. The main signals are the received data, r ( n ) , desired transmit data, t(n), distorted transmit data, d ( n ) , echo, y ( n ) , echo estimate, y(n), and output/error, e(.). The echo, y(n), can only be estimated; therefore, e(.) =

t (n )+~(n ) , where ~ ( n ) is the uncancelled error. Echo cancellation is done by adapting the filter to minimize ~ ( n ) .

desired. trammil signal

. e m d output

To External Connection

I r(n) From received data 4 signal External

Connection

Fig. 1. Block diagram of communication system.

Echo can be caused by line transformer imbalances, re- flections, and objects around the listener [l]. The first two sources occur where signals travel through many dif- ferent impedance lines. The third source occurs in speak- erphone communication. Two methods to eliminate echo include half-duplex communication and use of an ane- choic room. Neither of these methods eliminates echoes generated by transformer variations and impedance mismatches. An adaptive FIR filter can eliminate these ef- fects. The echo signature is determined during the adaptation phase. Then, the adaptation sensitivity is reduced or inhibited. If adaptation is inhibited, periodic readaptation is performed to track any signature variations.

111. THE LMS ALGORITHM

Filter coefficients must adapt to maintain the charac- teristics of the echo signature. The LMS (least-mean- square) algorithm uses a gradient decent approach to minimize error [9]. The coefficient adaptation equation is h(n + 1, k ) = h(n, k ) + p e ( n ) r (n - k), where h(n, k ) is the kth filter coefficient at time n.

In conventional implementations, adaptation is nor- mally performed by either a separate processor or on a periodic basis to reduce overhead time. Periodic coefficient updating causes rapid echo signature changes to be missed until the next update, thereby increasing the residual echo in the output. On the other hand, addition of a coefficient update processor increases hardware cost, software cost, and the area required. These approaches do not facilitate a compact implementation.

0-7803-2:707-1/95 $4.00 0 1995 IEEE 181

The convergence factor, p, defines how quickly the filter coefficients will respond to changes in the echo signature. Widrow determined bounds on the convergence factor, based on the filter length and reference power [9]. Wang and Chen proposed the use of two convergence factors to decrease convergence time [SI. During the adaptation phase, a convergence factor near the upper bound is used. During the transmission phase, a significantly reduced convergence factor is used.

The adaptation term, pe(n)r(n - k ) , requires two mul- tiplications. This complexity has been modified in several attempts to trade off speed of convergence for accuracy. The most dramatic hardware and time savings occurs if the convergence factor is constrained to be a power of two. The resulting multiplication is replaced by an arithmetic shift. Since the convergence factor must be less than the upper bound, this option yields only a minor increase in convergence time [7]. The update term has also been modified using the signum function. While these modi- fications result in a smaller implementation, they suffer from higher residual error, longer convergence time, and converge for fewer environments.

IV. SYNCHRONOUS vs. ASYNCHRONOUS DESIGN

In synchronous design, the clock speed is governed by the longest path in any combinational logic block (CLB). Most systems today process data in a pipeline. A pipeline is broken into stages consisting of a register-CLB pair. As CLB delays decrease, clock frequencies can increase. Due to design issues like clock distribution and skew, the max- imum possible clock frequencies are not being achieved. Skew must be included in the calculation of the final clock frequency, hence reducing the overall clock frequency.

Asynchronous design eliminates clock skew and distribution problems by using handshaking instead of a clock. The normal flow of an asynchronous system is as follows: (1) presentation of valid data, (2) request line activation, and (3) acknowledge line activation when data has been taken. Sutherland proved that synchronously pipelined circuits can also be asynchronously pipelined. Further- more, the resulting asynchronously pipelined circuit can have lower latency [6].

Sutherland’s initial module library has been modified and expanded by several researchers. For example, the basic elements used in this design are Forks, Joins, Selec- tors, Distributors, Registers, Atomics [lo], and the new start a c t i v e element [4].

The echo canceller required a new library element, called s t a r t a c t i v e . It causes request-out to become active after reset and ignores the first acknowledge-out. It is primarily used in memory, selection, and distribution circuits. Fig. 2 shows the hardware implementation for this element. The “in out” blocks are chains of inverters used to produce the required clock pulse width.

R i

Ao

Fig. 2. The s tar tac t ive schematic symbol and implementation.

V. AN ASYNCHRONOUS ADAPTIVE ECHO CANCELLER Conventional, LMS echo cancellers are governed by

N-1

$(n) = h(n, k ) r (n - k ) (1)

e(.) = d ( n ) -jj(n) (2) k=O

h(n + 1, k ) = h(n, k) + p e ( n ) r(n - k), V k (3) where N is the filter length. However, echo cancellers governed by (2)-(3) cannot update their filter coefficients until e ( n ) is calculated. When rewritten as follows, $(n) and the new filter coefficients can be pipelined, allowing the single-chip specification to be achieved [4]:

h(n, k ) = h(n - 1, k ) + e(n - 1) r(n - 1 - k) 2-P, ~ € { 1 , 2 ,..., 1 6 } , k = O t o N - l (4) N-1

$(n) = h(n, k ) r (n - k ) (5)

e(n) = d ( n ) -y(n) (6) k=O

Here, p = 2-P. In the new processing sequence a filter coefficient is updated and used to generate the next partial sum in the echo estimate. After all the filter coefficients have been processed, the final output is generated.

The implementation is shown in Fig. 3. Here, c n t r l r is the controller, r-gen stores and distributes the received data sample, intermem stores the filter coefficients, u p d a t e 1 calculates the adaptation value and updates each coefficient, convolve generates $(n) and e(n) , and output distributes e(n) to the listener and to internal components.

Fig. 3. Asynchronous echo canceller interconnection diagram

The DELTIC filter is similar to the proposed implementation [2]. The DELTIC filter updates its coefficients

182

and generates its echo estimate in parallel. The filter is based on the following parallel equations:

h(n, k) Y(n, k)

= h(n - 1, k) + p e ( n - 1) ~ ( n - k) = $(n, k - 1) + h(n - 1, k) r (n - k)

Here, the filter coefficients used to calculate $(n) are de- layed by one sampling period because the coefficients are updated in parallel with the summation. Hence, the echo estimate is less accurate than the estimate generatcd by (5)-(6). The DELTIC filter does eliminate the delay im- posed to update the filter coefficient before it is used; however, for a pipelined design, this delay is only incurred once and is small compared to the sampling period.

VI. IMPLEMENTATION ISSUES The c n t r l r ensures that the entire time line is tra-

versed and controls the synchronization of the overall echo cancellakion process. Synchronization in an asynchronous design refers to the generation of the internal requests required amd alignment of those requests with the data path requestri such that no stalling or starvation occurs.

The icontroller requests must be aligned properly in time. I f buffer registers are not added to compensate for the hardware stages, control request starvation will occur. Specifically, when a request is forked, all places must take the output request before another request can be generated. If the request is fed forward, it will not be absorbed until the predecessor tasks have been completed. This results in only one request flowing in a pipeline at a time. By adding buffer registers, the number of time steps saved equals the number of buffer stages. For a system with m compensation registers and N data requests, the total number of time steps saved is ( N - 1)m.

The intermem block contains 128 asynchronous registers grouped into four interleaved sets of 32 registers to eliminate output stall and input blockage. The asynchronous registers form a circular register file. The register file could not be constructed using only asynchronous registers. Asynchronous registers use latches to hold data, whereas synchronous register files use flip-flops. The 128 data requests must be placed in the register file to make them equivalent. A s t a r t a c t i v e element is placed between each of the asynchronous registers to do this. The overhead incurred is approximately seven latches per asynchronous register or approximately a 33% hardware increase for a 16-bit register.

The r -gen block stores and distributes the received data samples. Data is stored in a synchronous register file. An internally generated signal is used to clock all of the registers. This signal is also used to generate the request signals which receive the stored data. Three factors led to the decision to have a synchronous section in an asynchronous design. All of the coefficients move in lockstep, similar to a standard shift register. The delay in the long chain of asynchronous registcrs made it nccessary to add a large number of extra registers to prevent stalls and starvation, and the additional hardware to make an asyn- chronoiis register file would be area-wise inefficient.

~

183

The s t a r t a c t ive element solved another issue involv- ing the Selector’s data valid request line. A Selector must have a request on its select input line, valid data on the selection line(s), a request on the selected request line, and no active request-out before generating another request- out. Hence, two input requests are required. Typically, the select data valid request must be internally generated; however, after reset, there are no initial internal requests. The new s t a r t a c t i v e element generates the internal request needed. Immediately after reset, the s t a r t - ac t ive element places a request on a register’s request line. The data and request-out from the register are fed to the Se- lector’s data valid request and data selection line(s) to select the desired input port.

Finally, since the convergence factor is chosen to be a power of two, convergence factor multiplication is implemented in a right arithmetic barrel shifter. The shift is from 1 to 15 bits. The inputs are a 4-bit shift value and the output from the multiplier. If a zero shift value is specified, then the coefficients are frozen by zeroing the update term. This implementation allows the user to cus- tomize the echo canceller for his particular application. The user must ensure that the convergence factor is sta- ble before the start of the next cycle.

VII. PERFORMANCE This implementation was captured and simulated using

software by Compass Design Automation and contains approximately 312 600 transistors. Due to the design size and the massive time required to characterize the coefficients, chip-level timing simulation was performed only to determine throughput ancl to validate hardware functionality. The design was not fabricated due to chip turnaround time, cost, and the additional design efforts required to build a test platform. The following proce- dures were used to validate the design, determine the operating rate, and calculate the area estimate.

Timing simulation was performed to verify operation and to estimate system performance. Table I shows a comparison of the theoretical synchronous and the nom- inal asynchronous 128-coefficient design. The first coefficient pair completion delay, the intercoefficient delay, and the throughput are shown in the table. Throughput gov- erns the ADC sampling rate and the allowed signal band- width. The first coefficient pair completion can be considered to be the latency of the pipeline. For operating rate comparisons, the intercoefficient delay can be considered to be the circuit’s pipeline period. Throughput equals the time to complete the first coefficient pair plus N - 2 times the intercoefficient delay. Each functional block was inde- pendently routed and pieced together using the place and route tool. The final implementation can be fabricated on a die not larger than 9.25 mm by 7.25 mm.

A method for comparing an asynchronous design to a synchronous design is to determine the longest combinational logic path and add a small compensation factor to account for register delay, setup time, and hold time. The resulting delay is the clock period for the equivalent ideal

TABLE I THEORETICAL SYNCHRONOUS VS. ASYNCHRONOUS PERFORMANCE.

No. of Coefficients First Coefficient

Pair Delay Intercoefficient

Delay/Period No. of Stages

Speedup [3] Throughput

Asynchronous Synchronous Design Design

128 128

658 ns 1859 ns

33.2 ns 33.2 ns 55 55

4842 ns 6042 ns 25% N/A

synchronous circuit. Throughput for a synchronous design equals N - l + m, where m is the number of stages. For this design, the operating period would be approximately 33.2 ns (approximately 30.1 MHz).

Simulation proved that p = 5 provided the fastest, sta- ble convergence factor for a uniformly distributed [-1,l) echo signal, and the average normalized error power was less than exp(-3) after 250 samples. The error power can be decreased by simply increasing the length of the adaptation period. Following the technique used by [l], the transmission convergence factor was determined to be p M 13.3 for Bellcore physical transmission loop 2.

Fig. 4 shows the echo canceller’s normalized error power. It was generated by coupling a uniformly distributed [-1,l) echo signal onto the input signal, t(n), and averaging the results over 1000 simulation runs. Dur- ing the adaptation phase (samples 0-299), t(n) = 0 and p = 5. During the transmission phase (samples 300-499)’ adaptation was inhibited and t(n) was generated by another uniformly distributed [-1,l) random number gener- ator. Finally, during readaptation (samples 500-1499), t(n) = 0 and p = 14.

Fig. 5 plots the frequency response of the echo path used in [I] and the estimated echo spectrum at the end of the adaptation phase.

VIII. CONCLUSION An asynchronous, single-chip echo canceller was pre-

sented. cancellation is performed using a 128-coefficient, power-of-two, LMS-based, adaptive FIR filter. The echo canceller is implemented in a pipelined configuration which enables filter coefficient updating and echo cancellation every cycle. Unlike many other implementations, the implementation presented generates the filter output based completely on the most recent coefficients.

As integration densities continue to increase, clock skew and distribution become paramount issues. Because asynchronous design eliminates the clock, the overall circuit performance can be increased. This is due to the fact that the circuit latency will be decreased if any combinational logic block delay is decreased.

The performance of the asynchronous echo canceller was compared to its theoretical synchronous counterpart. The asynchronous canceller’s performance was 25% faster,

e-2

e-3 e-4

01

0 200 400 600 800 1000 1200 1400 samples

Fig. 4. Average error power, Bellcore transmission loop 2

1

0.1

0.01

0.001

Fig. 5. Bellcore transmission loop 2 echo spectrum and estimate.

enabling a sampling rate of about 205 kHz. The echo canceller’s chip size is approximately 9.25 mm by 7.25 mm.

REFERENCES W. Y . Chen, J. L. Dixon and D. L. Waring, “High Bit Rate Digital Line Echo Cancellation,” IEEE Journal on Selected Areas of Communications, vol. 9, Aug. 1991, pp. 848-860. C. F. Cowan, et al., “An Evaluation of Analogue and Digi- tal Adaptive Filter Realisations,” Int ’1. Specialist Seminar on Case Studies in Advanced Signal Processing, Sept. 1979, pp.

J . L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach. San Mateo, CA: Morgan Kaufman Publishers, 1990. R. P. Mackey, A n Asynchronous, Single-Chip, LMS Based, Adaptive FIR Echo Canceller. M.S. thesis, Dept. of Electrical & Computer Engineering, The Univ. of Arizona, May 1995. F. Lu and H. Samueli, “A 60-MBaud Adaptive Transversal Equalizer in 1.0 pm CMOS for QAM Digital Modems,” Proc. of the 1993 Custom Integrated Circuits Conference, March 1993,

I. E. Sutherland, “Micropipelines,” Communications of the ACM, vol. 32, June 1989, pp. 720-738. J. R. Treichler, C. R. Johnson and M. G. Larimore, Theory and Design of Adaptive Filters. New York: John Wiley and Sons Publications, 1987. C.-L. Wang and R.-Y. Chen, “Optimum Design of the LMS Algorithm Using Two Step Sizes for Adaptive FIR Filtering,” Signal Processing, vol. 26, Feb. 1992, pp. 197-204. B. Widrow and S. D. Sterns, Adaptive Signal Processing. En- glewood Cliffs, NJ: Prentice Hall, Inc., 1985. T.-Y. Wuu, Synthesis of Asynchronous Systems from Data Flow Specifications. Ph.D. diss., Dept. of Electrical Engineer- ing Systems, The Univ. of Southern California, July 1994.

178-183.

pp. 16.6.1-16.6.4.

184

Documents

[IEEE Eighth International Application Specific Integrated Circuits Conference - Austin, TX, USA (18-22 Sept. 1995)] Proceedings of Eighth International Application Specific Integrated