ELEG-636: Statistical Signal Processingarce/files/Courses/StatSP/...ELEG-636: Statistical Signal Processing Gonzalo R. Arce Department of Electrical and Computer Engineering University

ELEG-636: Statistical Signal Processing

Gonzalo R. Arce

Department of Electrical and Computer EngineeringUniversity of Delaware

Spring 2010

Gonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 1 / 79

Course Objectives & Structure


Objective: Given a discrete time sequence {x(n)}, developStatistical and spectral signal representationFiltering, prediction, and system identification algorithmsOptimization methods that are

StatisticalAdaptive

Course Structure:Weekly lectures [notes: www.ece.udel.edu/!arce]Periodic homework (theory & Matlab implementations) [15%]Midterm & Final examinations [85%]

Textbook:Haykin, Adaptive Filter Theory.




Broad Applications in Communications, Imaging, Sensors.Emerging application in

Brain-imaging techniquesBrain-machine interfaces,Implantable devices.

Neurofeedback presents real-time physiological signals from MRIsin a visual or auditory form to provide information about brainactivity. These signals are used to train the patient to alter neuralactivity in a desired direction.Traditionally, feedback using EEGs or other mechanisms has notfocused on the brain because the resolution is not good enough.


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Motivation

Adaptive Optimization and Filtering Methods

MotivationAdaptive optimization and filtering methods are appropriate,advantageous, or necessary when:

Signal statistics are not known a priori and must be “learned” fromobserved or representative samplesSignal statistics evolve over timeTime or computational restrictions dictate that simple, if repetitive,operations be employed rather than solving more complex, closedform expressionsTo be considered are the following algorithms:

Steepest Descent (SD) – deterministicLeast Means Squared (LMS) – stochasticRecursive Least Squares (RLS) – deterministic


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent

Definition (Steepest Descent (SD))Steepest descent, also known as gradient descent, it is an iterativetechnique for finding the local minimum of a function.Approach: Given an arbitrary starting point, the current location (value)is moved in steps proportional to the negatives of the gradient at thecurrent point.

SD is an old, deterministic method, that is the basis for stochasticgradient based methodsSD is a feedback approach to finding local minimum of an errorperformance surfaceThe error surface must be known a prioriIn the MSE case, SD converges converges to the optimal solution,w0 = R!1p, without inverting a matrixQuestion: Why in the MSE case does this converge to the globalminimum rather than a local minimum?



ExampleConsider a well structured cost function with a single minimum. Theoptimization proceeds as follows:

Contour plot showing that evolution of the optimization



ExampleConsider a gradient ascent example in which there are multipleminima/maxima

Surface plot showing the multipleminima and maxima Contour plot illustrating that the final

result depends on starting value



To derive the approach, consider the FIR case:

{x(n)} – the WSS input samples{d(n)} – the WSS desired output{d(n)} – the estimate of the desired signal given by

d(n) = wH(n)x(n)wherex(n) = [x(n), x(n " 1), · · · , x(n "M + 1)]T [obs. vector]w(n) = [w0(n),w1(n), · · · ,wM-1(n)]T [time indexed filter coefs.]



Then similarly to previously considered cases

e(n) = d(n)" d(n) = d(n)"wH(n)x(n)

and the MSE at time n is

J(n) = E{|e(n)|2}= !2d "wH(n)p" pHw(n) +wH(n)Rw(n)

where!2d – variance of desired signalp – cross-correlation between x(n) and d(n)R – correlation matrix of x(n)

Note: The weight vector and cost function are time indexed (functionsof time)



When w(n) is set to the (optimal) Wiener solution,

w(n) = w0 = R!1p

andJ(n) = Jmin = !2d " pHw0

Use the method of steepest descent to iteratively find w0.The optimal result is achieved since the cost function is a secondorder polynomial with a single unique minimum



ExampleLet M = 2. The MSE is a bowl–shaped surface, which is a function ofthe 2-D space weight vector w(n)

A contour of the MSE is given as

w0w

J(w)

J(w)

w1

w2

1

Surface Plot

w0

w2

w1

)(J∇−

2wJ

∂

∂−

1wJ

∂

∂−

Contour Plot

Imagine dropping a marble at any point on the bowl-shaped surface.The ball will reach the minimum point by going through the path ofsteepest descent.



Observation: Set the direction of filter update as: "#J(n)Resulting Update:

w(n + 1) = w(n) + 12µ["#J(n)]

or, since #J(n) = "2p+ 2Rw(n)

w(n + 1) = w(n) + µ[p" Rw(n)] n = 0, 1, 2, · · ·

where w(0) = 0 (or other appropriate value) and µ is the step sizeObservation: SD uses feedback, which makes it possible for thesystem to be unstable

Bounds on the step size guaranteeing stability can be determinedwith respect to the eigenvalues of R (Widrow, 1970)


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Convergence AnalysisDefine the error vector for the tap weights as

c(n) = w(n)"w0

Then using p = Rw0 in the update,

w(n + 1) = w(n) + µ[p" Rw(n)]= w(n) + µ[Rw0 " Rw(n)]= w(n)" µRc(n)

and subtracting w0 from both sides

w(n + 1)"w0 = w(n)"w0 " µRc(n)$ c(n + 1) = c(n)" µRc(n)

= [I" µR]c(n)



Using the Unitary Similarity Transform

R = Q!QH

we have

c(n + 1) = [I" µR]c(n)= [I" µQ!QH ]c(n)

$ QHc(n + 1) = [QH " µQHQ!QH ]c(n)= [I" µ!]QHc(n) (%)

Define the transformed coefficients as

v(n) = QHc(n)= QH(w(n)"w0)

Then (%) becomesv(n + 1) = [I" µ!]v(n)



Consider the initial condition of v(n)

v(0) = QH(w(0)"w0)

= "QHw0 [if w(0) = 0]

Consider the kth term (mode) in

v(n + 1) = [I" µ!]v(n)

Note [I" µ!] is diagonalThus all modes are independently updatedThe update for the kth term can be written as

vk (n + 1) = (1" µ"k )vk (n) k = 1, 2, · · · ,M

or using recursion

vk (n) = (1" µ"k )nvk (0)



Observation: Conversion to the optimal solution requires

limn"#

w(n) = w0

$ limn"#

c(n) = limn"#

w(n)"w0 = 0

$ limn"#

v(n) = limn"#

QHc(n) = 0

$ limn"#

vk (n) = 0 k = 1, 2, · · · ,M (%)

Result: According to the recursion

vk (n) = (1" µ"k )nvk (0)

the limit in (%) holds if and only if

|1" µ"k | < 1 for all k

Thus since the eigenvalues are nonnegative, 0 < µ"max < 2, or

0 < µ <2

"max



Observation: The kth mode has geometric decay

vk (n) = (1" µ"k )nvk (0)

The rate of decay it is characterized by the time it takes to decayto e!1 of the initial valueLet #k denote this time for the kth mode

vk (#k ) = (1" µ"k )!k vk (0) = e!1vk (0)

$ e!1 = (1" µ"k )!k

$ #k ="1

ln(1" µ"k )&

1µ"k

for µ ' 1

Result: The overall rate of decay is

"1ln(1" µ"max)

( # ("1

ln(1" µ"min)



ExampleConsider the typical behavior of a single mode



Error AnalysisRecall that

J(n) = Jmin + (w(n)"w0)HR(w(n)"w0)

= Jmin + (w(n)"w0)HQ!QH(w(n)"w0)

= Jmin + v(n)H!v(n)

= Jmin +M!

k=1"k |vk (n)|2 [sub in vk (n) = (1" µ"k )

nvk (0)]

= Jmin +M!

k=1"k (1" µ"k )

2n|vk (0)|2

Result: If 0 < µ < 2"max

, then

limn"#

J(n) = Jmin


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor

ExampleConsider a two–tap predictor for real–valued input

λ

χ

Analyzed the effects of the following cases:Varying the eigenvalue spread $(R) = "max

"minwhile keeping µ fixed

Varying µ and keeping the eigenvalue spread $(R) fixed



SD loci plots (with shown J(n) contours) as a function of [v1(n), v2(n)]for step-size µ = 0.3

Eigenvalue spread: $(R) = 1.22Small eigenvalue spread$modes converge at a similar rate

Eigenvalue spread: $(R) = 3Moderate eigenvalue spread$modes converge at moderatelysimilar rates



SD loci plots (with shown J(n) contours) as a function of [v1(n), v2(n)]for step-size µ = 0.3

Eigenvalue spread: $(R) = 10Large eigenvalue spread$modes converge at different rates

Eigenvalue spread: $(R) = 100Very large eigenvalue spread$modes converge at very differentratesPrinciple direction convergence isfastest



SD loci plots (with shown J(n) contours) as a function of [w1(n),w2(n)]for step-size µ = 0.3

Eigenvalue spread: $(R) = 1.22Small eigenvalue spread$modes converge at a similar rate

Eigenvalue spread: $(R) = 3Moderate eigenvalue spread$modes converge at moderatelysimilar rates



SD loci plots (with shown J(n) contours) as a function of [w1(n),w2(n)]for step-size µ = 0.3

Eigenvalue spread: $(R) = 10Large eigenvalue spread$modes converge at different rates

Eigenvalue spread: $(R) = 100Very large eigenvalue spread$modes converge at very differentratesPrinciple direction convergence isfastest



Learning curves of steepest-descent algorithm with step-sizeparameter µ = 0.3 and varying eigenvalue spread.



SD loci plots (with shown J(n) contours) as a function of [v1(n), v2(n)]with $(R) = 10 and varying step–sizes

Step–sizes: µ = 0.3This is over–damped$ slowconvergence

Step–sizes: µ = 1This is under–damped$ fast(erratic) convergence



SD loci plots (with shown J(n) contours) as a function of [w1(n),w2(n)]with $(R) = 10 and varying step–sizes





ExampleConsider a system identification problem

w(n) system

{x(n)}

)(ˆ nd d(n)+_

e(n)

Suppose M = 2 and

Rx ="

1 0.80.8 1

#

P =

"

0.80.5

#



From eigen analysis we have

"1 = 1.8,"2 = 0.2$ µ <21.8

alsoq1 =

1)2

"

11

#

q2 =1)2

"

1"1

#

andQ =

1)2

"

1 11 "1

#

Also,w0 = R!1p =

"

1.11"0.389

#



Thusv(n) = QH [w(n)"w0]

Noting that

v(0) = "QHw0 = "1)2

"

1 11 "1

# "

1.11"0.389

#

=

"

0.511.06

#

andv1(n) = (1" µ(1.8))n0.51

v2(n) = (1" µ(0.2))n1.06



SD convergence properties for two µ values




Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Least Mean Squares (LMS)

Least Mean Squares (LMS)

Definition (Least Mean Squares (LMS) Algorithm)Motivation: The error performance surface used by the SD method isnot always known a prioriSolution: Use estimated values. We will use the followinginstantaneous estimates

R(n) = x(n)xH(n)

p(n) = x(n)d$(n)

Result: The estimates are RVs and thus this leads to a stochasticoptimization

Historical Note: Invented in 1960 by Stanford University professorBernard Widrow and his first Ph.D. student, Ted Hoff



Recall the SD update

w(n + 1) = w(n) + 12µ["#(J(n))]

where the gradient of the error surface at w(n) was shown to be

#(J(n)) = "2p+ 2Rw(n)

Using the instantaneous estimates,

#(J(n)) = "2x(n)d$(n) + 2x(n)xH(n)w(n)= "2x(n)[d$(n)" xH(n)w(n)]= "2x(n)[d$(n)" d$(n)]= "2x(n)e$(n)

where e$(n) is the complex conjugate of the estimate error.



Utilizing #(J(n)) = "2x(n)e$(n) in the update

w(n + 1) = w(n) + 12µ["#(J(n))]

= w(n) + µx(n)e$(n) [LMS Update]

The LMS algorithm belongs to the family of stochastic gradientalgorithmsThe update is extremely simpleAlthough the instantaneous estimates may have large variance,the LMS algorithm is recursive and effectively averages theseestimatesThe simplicity and good performance of the LMS algorithm makeit the benchmark against which other optimization algorithms arejudged



Convergence Analysis

Independence TheoremThe following conditions hold:

1 The vectors x(1), x(2), · · · , x(n) are statistically independent2 x(n) is independent of d(1), d(2), · · · , d(n " 1)3 d(n) is statistically dependent on x(n), but is independent ofd(1), d(2), · · · , d(n " 1)

4 x(n) and d(n) are mutually Gaussian

The independence theorem is invoked in the LMS algorithmanalysisThe independence theorem is justified in some cases, e.g.,beamforming where we receive independent vector observationsIn other cases it is not well justified, but allows the analysis toproceeds (i.e., when all else fails, invoke simplifying assumptions)



We will invoke the independence theorem to show that w(n) convergesto the optimal solution in the mean

limn"#

E{w(n)} = w0

To prove this, evaluate the update

w(n + 1) = w(n) + µx(n)e$(n)$ w(n + 1)"w0 = w(n)"w0 + µx(n)e$(n)

$ c(n + 1) = c(n) + µx(n)(d$(n)" xH(n)w(n))= c(n) + µx(n)d$(n)" µx(n)xH(n)[w(n)"w0 +w0]

= c(n) + µx(n)d$(n)" µx(n)xH(n)c(n)"µx(n)xH(n)w0

= [I" µx(n)xH(n)]c(n) + µx(n)[d$(n)" xH(n)w0]

= [I" µx(n)xH(n)]c(n) + µx(n)e$0(n)



Take the expectation of the update noting thatw(n) is based on past inputs and desired values

$ w(n), and consequently c(n)), are independent of x(n)(Independence Theorem)

Thus

c(n + 1) = [I" µx(n)xH(n)]c(n) + µx(n)e$0(n)$ E{c(n + 1)} = (I" µR)E{c(n)}+ µE{x(n)e$0(n)}

$ %& '

=0 why?

= (I" µR)E{c(n)}

Using arguments similar to the SD case we have

limn"#

E{c(n)} = 0 if 0 < µ <2

"max

or equivalently

limn"#

E{w(n)} = w0 if 0 < µ <2

"max



Noting that(M

i=1 "i = trace[R]

$ "max ( trace[R] = Mr(0) = M!2x

Thus a more conservative bound (and one easier to determine) is

0 < µ <2

M!2x

Convergence in the mean

limn"#

E{w(n)} = w0

is a weak condition that says nothing about the variance, whichmay even growA stronger condition is convergence in the mean square, whichsays

limn"#

E{|c(n)|2} = constant



Proving convergence in the mean square is equivalent to showing that

limn"#

J(n) = limn"#

E{|e(n)|2} = constant

To evaluate the limit, write e(n) as

e(n) = d(n)" d(n) = d(n)"wH(n)x(n)= d(n)"wH

0 x(n)" [wH(n)"wH0 ]x(n)

= e0(n)" cH(n)x(n)

Thus

J(n) = E{|e(n)|2}= E

)*

e0(n)" cH(n)x(n)+*

e$0(n)" xH(n)c(n)+,

= Jmin + E{cH(n)x(n)xH(n)c(n)}$ %& '

Jex(n)

[Cross terms* 0, why?]

= Jmin + Jex(n)



Since Jex(n) is a scalar

Jex(n) = E{cH(n)x(n)xH(n)c(n)}= E{trace[cH(n)x(n)xH(n)c(n)]}= E{trace[x(n)xH(n)c(n)cH(n)]}= trace[E{x(n)xH(n)c(n)cH(n)}]

Invoking the independence theorem

Jex(n) = trace[E{x(n)xH(n)}E{c(n)cH(n)}]= trace[RK(n)]

whereK(n) = E{c(n)cH(n)}



Thus

J(n) = Jmin + Jex(n)= Jmin + trace[RK(n)]

RecallQHRQ = ! or R = Q!QH

SetS(n) + QHK(n)Q

where S(n) need not be diagonal. Then

K(n) = QQHK(n)QQH [since Q!1 = QH ]

= QS(n)QH



Utilizing R = Q!QH and K(n) = QS(n)QH in the excess errorexpression

Jex(n) = trace[RK(n)]= trace[Q!QHQS(n)QH ]

= trace[Q!S(n)QH ]

= trace[QHQ!S(n)]= trace[!S(n)]

Since ! is diagonal

Jex(n) = trace[!S(n)] =M!

i=1"i si(n)

where s1(n), s2(n), · · · , sM(n) are the diagonal elements of S(n).



The previously derived recursion

E{c(n + 1)} = (I" µR)E{c(n)}

can be modified to yield a recursion on S(n),

S(n + 1) = (I" µ!)S(n)(I" µ!) + µ2Jmin!

which for the diagonal elements is

si(n + 1) = (1" µ"i)2si(n) + µ2Jmin"i i = 1, 2, · · · ,M

Suppose Jex(n) converges, then si(n + 1) = si(n)

$ si(n) = (1" µ"i)2si(n) + µ2Jmin"i

$ si(n) =µ2Jmin"i

1" (1" µ"i)2=

µ2Jmin"i2µ"i " µ2"2i

=µJmin2" µ"i

i = 1, 2, · · · ,M



Consider again

Jex(n) = trace[!S(n)] =M!

i=1"i si(n)

Taking the limit and utilizing si(n) = µJmin2!µ"i

,

limn"#

Jex(n) = Jmin

M!

i=1

µ"i2" µ"i

The LMS misadjustment is defined as

MA =limn"# Jex(n)

Jmin=

M!

i=1

µ"i2" µ"i

Note: A misadjustment at 10% or less is generally consideredacceptable.


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor

Example

µ

µ

This is a one tap predictor

x(n) = w(n)x(n " 1)

Take the underlying process to be areal order one AR process

x(n) = "ax(n " 1) + v(n)

The weight update is

w(n + 1) = w(n) + µx(n " 1)e(n) [LMS update for obs. x(n " 1)]= w(n) + µx(n " 1)[x(n)" w(n)x(n " 1)]



Sincex(n) = "ax(n " 1) + v(n) [AR model]

and

x(n) = w(n)x(n " 1) [one tap predictor]$ w0 = "a

Note thatE{x(n " 1)eo(n)} = E{x(n " 1)v(n)} = 0

proves the optimalitySet µ = 0.05 and consider two cases

a !2x-0.99 0.936270.99 0.995



Figure: Transient behavior of adaptive first-order predictor weight w(n) forµ = 0.05.



Figure: Transient behavior of adaptive first-order predictor squared error forµ = 0.05.



Figure: Mean-squared error learning curves for an adaptive first-orderpredictor with varying step-size parameter µ.



Consider the expected trajectory of w(n). Recall

w(n + 1) = w(n) + µx(n " 1)e(n)= w(n) + µx(n " 1)[x(n)" w(n)x(n " 1)]= [1" µx(n " 1)x(n " 1)]w(n) + µx(n " 1)x(n)

In this example, x(n) = "ax(n " 1) + v(n). Substituting in:

w(n + 1) = [1" µx(n " 1)x(n " 1)]w(n) + µx(n " 1)["ax(n " 1)+v(n)]

= [1" µx(n " 1)x(n " 1)]w(n)" µax(n " 1)x(n " 1)+µx(n " 1)v(n)

Taking the expectation and invoking the dependence theorem

E{w(n + 1)} = (1" µ!2x )E{w(n)}" µ!2xa



Figure: Comparison of experimental results with theory, based on w(n).



Next, derive a theoretical expression for J(n).Note that the initial value of J(n) is

J(0) = E{(x(0)" w(0)x("1))2} = E{(x(0))2} = !2x

and the final value isJ(,) = Jmin + Jex

= E{(x(n)" w(n)x(n " 1))2}+ Jex

= E{(v(n))2}+ Jex

= !2v + Jminµ"1

2" µ"1

Note "1 = !2x . Thus,

J(,) = !2v + !2v

-µ!2x

2" µ!2x

.

= !2v

-

1+ µ!2x2" µ!2x

.



And if µ is small

J(,) = !2v

-

1+ µ!2x2" µ!2x

.

& !2v

-

1+ µ!2x2

.

Putting all the components together:

J(n) = [!2x " !2v (1+µ

2!2x )]

$ %& '

J(0)!J(#)

(1" µ!2x )2n

$ %& '

1"0

+!2v (1+µ

2!2x )

$ %& '

J(#)

Also, the time constant is

# = "1

2 ln(1" µ"1)= "

12 ln(1" µ!2x )

&1

2µ!2x



Figure: Comparison of experimental results with theory for the adaptivepredictor, based on the mean-square error for µ = 0.001.


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Adaptive Equalization

Example (Adaptive Equalization)Objective: Pass a known signal through an unknown channel to invertthe effects the channel and noise have on the signal



The signal is a Bernoulli sequence

xn =

/

+1 with probability 1/2"1 with probability 1/2

The additive noise is ! N(0, 0.001)The channel has a raised cosine response

hn =

/ 120

1+ cos12#w (n " 2)

23

n = 1, 2, 30 otherwise

$ w controls the eigenvalue spread $(R)$ hn is symmetric about n = 2 and thus introduces a delay of 2We will use an M = 11 tap filter, which is symmetric about n = 5$ Introduce a delay of 5Thus an overall delay of % = 5+ 2 = 7 is added to the system



Channel response and Filter response

Figure: (a) Impulse response of channel; (b) impulse response of optimumtransversal equalizer.



Consider three w values

Note the step size is bound by the w = 3.5 case

µ (2

Mr(0) =2

11(1.3022) = 0.14

Choose µ = 0.075 in all cases.



Figure: Learning curves of the LMS algorithm for an adaptive equalizer withnumber of taps M = 11, step-size parameter µ = 0.075, and varyingeigenvalue spread $(R).



Ensemble-average impulse response of the adaptive equalizer (after1000 iterations) for each of four different eigenvalue spreads.



Figure: Learning curves of the LMS algorithm for an adaptive equalizer withthe number of taps M = 11, fixed eigenvalue spread, and varying step-sizeparameter µ.


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: LMS Directionality

ExampleDirectionality of the LMS algorithm

The speed of convergence of the LMS algorithm is faster incertain directions in the weight spaceIf the convergence is in the appropriate direction, the convergencecan be accelerated by increased eigenvalue spread

To investigate this phenomenon, consider the deterministic signal

x(n) = A1 cos(&1n) + A2 cos(&2n)

Even though it is deterministic, a correlation matrix can be determined:

R =12

"

A21 + A22 A21 cos(&1) + A22 cos(&2)A21 cos(&1) + A22 cos(&2) A21 + A22

#



Determining the eigenvalues and eigenvectors yields

"1 =12A

21(1+ cos(&1)) +

12A

22(1+ cos(&2))

"2 =12A

21(1" cos(&1)) +

12A

22(1" cos(&2))

andq1 =

"

11

#

q2 ="

"11

#

Case 1: A1 = 1,A2 = 0.5,&1 = 1.2,&2 = 0.1

xa(n) = cos(1.2n) + 0.5 cos(0.1n) and $(R) = 2.9

Case 2: A1 = 1,A2 = 0.5,&1 = 0.6,&2 = 0.23

xb(n) = cos(0.6n) + 0.5 cos(0.23n) and $(R) = 12.9



Since p undefined, set p = "iqiThen since p = Rw0, we see (two cases)

p = "1q1 $ Rw0 = "1q1 $ w0 = q1 ="

11

#

p = "2q2 $ Rw0 = "2q2 $ w0 = q2 ="

"11

#

Utilize 200 iterations of the algorithm.

Consider the minimum eigenfilter first, w0 = q2 ="

"11

#

Consider the maximum eigenfilter second, w0 = q1 ="

11

#



Convergence of the LMS algorithm, for a deterministic sinusoidalprocess, along “slow” eigenvector (i.e., minimum eigenfilter).

For input xa(n) ($(R) = 2.9) For input xb(n) ($(R) = 12.9)



Convergence of the LMS algorithm, for a deterministic sinusoidalprocess, along “fast” eigenvector (i.e., maximum eigenfilter).

For input xa(n) ($(R) = 2.9) For input xb(n) ($(R) = 12.9)


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Normalized LMS Algorithm

Observation: The LMS correction is proportional to µx(n)e$(n)

w(n + 1) = w(n) + µx(n)e$(n)

If x(n) is large, the LMS update suffers from gradient noiseamplificationThe normalized LMS algorithm seeks to avoid gradient noiseamplification

The step size is made time varying, µ(n), and optimized tominimize the next step error

w(n + 1) = w(n) + 12µ(n)["#J(n)]

= w(n) + µ(n)[p" Rw(n)]

Choose µ(n), such that w(n + 1) produces the minimum MSE,

J(n + 1) = E{|e(n + 1)|2}



Let #(n) + #J(n) and note

e(n + 1) = d(n + 1)"wH(n + 1)x(n + 1)

Objective: Choose µ(n) such that it minimizes J(n + 1)The optimal step size, µ0(n), will be a function of R and #(n).$ Use instantaneous estimates of these values

To determine µ0(n), expand J(n + 1)

J(n + 1) = E{e(n + 1)e$(n + 1)}= E{(d(n + 1)"wH(n + 1)x(n + 1))

(d$(n + 1)" xH(n + 1)w(n + 1))}= !2d "wH(n + 1)p" pHw(n + 1)

+wH(n + 1)Rw(n + 1)



Now use the fact that w(n + 1) = w(n)" 12µ(n)#(n)

J(n + 1) = !2d "wH(n + 1)p" pHw(n + 1)+wH(n + 1)Rw(n + 1)

= !2d ""

w(n)" 12µ(n)#(n)

#Hp

"pH"

w(n)" 12µ(n)#(n)

#

+

"

w(n)" 12µ(n)#(n)

#HR"

w(n)" 12µ(n)#(n)

#

$ %& '

= wH(n)Rw(n)" 12µ(n)w

H(n)R#(n)

"12µ(n)#

H(n)Rw(n) + 14µ

2(n)#H(n)R#(n)



J(n + 1) = !2d ""

w(n)" 12µ(n)#(n)

#Hp

"pH"

w(n)" 12µ(n)#(n)

#

+wH(n)Rw(n)" 12µ(n)w

H(n)R#(n)

"12µ(n)#

H(n)Rw(n) + 14µ

2(n)#H(n)R#(n)

Differentiating with respect to µ(n),

'J(n + 1)'µ(n) =

12#

H(n)p+12p

H#(n)" 12w

HR#(n)

"12#

H(n)Rw(n) + 12µ(n)#

H(n)R#(n) (%)



Setting (%) equal to 0

µ0(n)#H(n)R#(n) = wH(n)R#(n)" pH#(n)+#H(n)Rw(n)"#H(n)p

$ µ0(n) =wH(n)R#(n)" pH#(n) +#H(n)Rw(n)"#H(n)p

#H(n)R#(n)

=[wH(n)R" pH ]#(n) +#H(n)[Rw(n)" p]

#H(n)R#(n)

=[Rw(n)" p]H#(n) +#H(n)[Rw(n)" p]

#H(n)R#(n)

=12#

H(n)#(n) + 12#

H(n)#(n)#H(n)R#(n)

=#H(n)#(n)#H(n)R#(n)



Using instantaneous estimates

R = x(n)xH(n) and p = x(n)d$(n)$ #(n) = 2[Rw(n)" p]

= 2[x(n)xH(n)w(n)" x(n)d$(n)]= 2[x(n)(d$(n)" d$(n))]= "2x(n)e$(n)

Thus

µ0(n) =#H(n)#(n)#H(n)R#(n) =

4xH(n)e(n)x(n)e$(n)2xH(n)e(n)x(n)xH(n)2x(n)e$(n)

=|e(n)|2xH(n)x(n)

|e(n)|2(xH(n)x(n))2

=1

xH(n)x(n) =1

||x(n)||2



Result: The NLMS update is

w(n + 1) = w(n) + µ

||x(n)||2$ %& '

µ(n)

x(n)e$(n)

µ is introduced to scale the updateTo avoid problems when ||x(n)||2 & 0 we add an offset

w(n + 1) = w(n) + µ

a+ ||x(n)||2x(n)e$(n)

where a > 0


Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) NLMS Convergnce

Objective: Analyze the NLMS convergence

w(n + 1) = w(n) + µ

||x(n)||2x(n)e$(n)

Substituting e(n) = d(n)"wH(n)x(n)

w(n + 1) = w(n) + µ

||x(n)||2x(n)[d$(n)" xH(n)w(n)]

=

"

I" µx(n)xH(n)||x(n)||2

#

w(n) + µx(n)d$(n)||x(n)||2



Objective: Compare the NLMS and LMS algorithms:NLMS:

w(n + 1) ="

I" µx(n)xH(n)||x(n)||2

#

w(n) + µx(n)d$(n)||x(n)||2

LMS:

w(n + 1) = [I" µx(n)xH(n)]w(n) + µx(n)d$(n)

By observation, we see the following corresponding terms

LMS NLMSµ µ

x(n)xH(n) x(n)xH(n)||x(n)||2

x(n)d$(n) x(n)d$(n)||x(n)||2



LMS NLMSµ µ

x(n)xH(n) x(n)xH(n)||x(n)||2

x(n)d$(n) x(n)d$(n)||x(n)||2

LMS case:

0 < µ <2

trace[E{x(n)xH(n)}] =2

trace[R]

guarantees stabilityBy analogy,

0 < µ <2

trace4

E)x(n)xH(n)||x(n)||2

,5

guarantees stability of the NLMSGonzalo R. Arce (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2010 77 / 79


To analyze the bound, make the following approximation

E/x(n)xH(n)||x(n)||2

6

&E{x(n)xH(n)}E{||x(n)||2}

Then

trace"

E/x(n)xH(n)||x(n)||2

6#

=trace[E{x(n)xH(n)}]

E{||x(n)||2}

=E{trace[x(n)xH(n)]}

E{||x(n)||2}

=E{trace[xH(n)x(n)]}

E{||x(n)||2}

=E{trace[||x(n)||2]}

E{||x(n)||2}= 1



Thus0 < µ <

2trace

4

E)x(n)xH(n)||x(n)||2

,5 = 2

Final Result: The NLMS update

w(n + 1) = w(n) + µx(n)

||x(n)||2e$(n)

will converge if 0 < µ < 2Note:

The NLMS has a simpler convergence criterion than the LMSThe NLMS generally converges faster than the LMS algorithm


ELEG 636 - Statistical Signal Processing

Gonzalo R. Arce

Variants of the LMS algorithm

Department of Electrical and Computer EngineeringUniversity of Delaware

Newark, DE, 19716Fall 2013

(Variants of the LMS algorithm) Gonzalo R. Arce Spring, 2013 1 / 15

Standard LMS Algorithm

FIR filters:

y(n) = w0(n)u(n)+w1(n)u(n�1)+ ...+W

M�1(n)u(n�M +1)

=M�1

Âk=0

w

k

(n)u(n�k) = w(n)T

u(n), n = 0,1, . . . ,•

Error between filter output y(t) and a desired signal d(t):

e(n) = d(n)�y(n) = d(n)�w(n)T

u(n).

Update filter parameters according to

w(n+1) = w(n)+µu(n)e(n).


1. Normalized LMS Algorithm

Modify at time n the parameter vector from w(n) to w(n+1)Where l will result from

d(n) =M�1

Âi=0

w

i

(n+1)u(n� i).

In order to add an extra freedom degree to the adaptation strategy, oneconstant, µ, controlling the step size will be introduced:

w

j

(n+1)=w

j

(n)+ µ 1

ÂM�1i=0 (u(n� i))2

e(n)u(n� j)=w

j

(n)+µ

ku(n)k2 e(n)u(n� j).


To overcome the possible numerical difficulties when ku(n)k is close to zero,a constant a > 0 is used:

w

j

(n+1) = w

j

(n)+µ

a+ku(n)k2 e(n)u(n� j)

This is the update used in the Normalized LMS algorithm.

The Normalized LMS algorithm converges if

0 < µ < 2


Comparison of LMS and NLMS


Comparison of LMS and NLMS

The LMS was run with three different step-sizes:µ = [0.075; 0.025; 0.0075].

The NLMS was run with four different step-sizes: µ = [1.0; 0.5; 0.1].

The larger the step-size, the faster the convergence. The smallerthe step-size, the better the steady state square error.

LMS with µ = 0.0075 and NLMS with µ = 0.1 achieved similar averagesteady state square error. However, NLMS was faster.

LMS with µ = 0.075 and NLMS with µ = 1.0 had a similar convergencespeed. However, NLMS achieved a lower steady state average sqareerror.

Conclusion: NLMS offers better trade-offs than LMS.

The computational complexity of NLMS is slightly higher that that ofLMS.


2. LMS Algorithm with Time Variable Adaptation StepHeuristics: combine benefits of two different situations:

Convergence time constant is small for large µ.

Mean-square error in steady state is low for small µ. Initial adaptation µ

is kept large, then it is monotonically reduced µ(n) = 1n+c

.

Disadvantage for non-stationary data: algorithm will not react to changesin the optimum solution, for large values of n.

Variable Step algorithm:

w(n+1) = w(n)+M(n)u(n)e(n)

where M(n) =

��

µ0(n) 0 · · · 00 µ1(n) · · · 00 0 · · · 00 0 · · · µ

M�1(n)

��.

Each filter parameter w

i

()n is updated using an independent adaptationstep µ

i

(n).


Comparison of LMS and variable size LMS µ = µ0.01n+c

with c = [10;20;50]


3. Sign algorithmsIn high speed communication the time is critical, thus faster adaptationprocesses is needed.

sgn(a) =

8<

:

1; a> 00; a= 0�1; a< 0

The Sign algorithm (other names: pilot LMS, or Sign Error)

w(n+1) = w(n)+µu(n) sgn(e(n)).

The Clipped LMS (or Signed Regressor)

w(n+1) = w(n)+µ sgn(u(n))e(n).

The Zero forcing LMS (or Sign Sign)

w(n+1) = w(n)+µ sgn(u(n)) sgn(e(n)).

The Sign algorithm can be derived as a LMS algorithm for minimizing the Meanabsolute error (MAE) criterion

J(w) = E [|e(n)|] = E [|d(n)�w

T

u(n)|].


Properties of sign algorithms

Fast computation: if µ is constrained to the form µ = 2m, only shiftingand addition operations are required.

Drawback: the update mechanism is degraded, compared to LMSalgorithm, by the crude quantization of gradient estimates.

The steady state error will increaseThe convergence rate decreases

The fastest of them, Sign-Sign, is used in the CCITT ADPCM standard for32000 bps system.


Comparison of LMS and Sign LMS

Sign LMS algorithm should be operated at smaller step-sizes to get a similarbehavior as standard LMS algorithm.


4. Linear smoothing of LMS gradient estimates

Lowpass filtering the noisy gradient Rename the noisy gradientg(n) = —

w

J

g(n) = —w

J = 2u(n)e(n)

g

i

(n) =�2e(n)u(n� i).

Passing the signals g

i

(n) through low pass filters will prevent the largefluctuations of direction during adaptation process.

b

i

(n) = LPF (gi

(n)).

The updating process will use the filtered noisy gradient

w(n+1) = w(n)�µb(n).

The following versions are well known:Averaged LMS algorithm LPF is the filter with impulse responseh(m) = 1

N

m = 0,1, . . . ,N �1.

w(n+1) = w(n)+µN

n

Âj=n�N+1

e(j)u(j).


Momentum LMS algorithm LPF is an IIR filter of first orderh(0) = 1� g,h(1) = gh(0),h(2) = g2

h(0), . . . then,

b

i

(n) = LPF (gi

(n)) = gb

i

(n�1)+(1� g)gi

(n)

b(n) = gb(n�1)+(1� g)g(n)

The resulting algorithm can be written as a second order recursion:

w(n+1) = w(n)�µb(n)

gw(n) = gw(n�1)� gµb(n�1)w(n+1)� gw(n) = w(n)� gw(n�1)�µb(n)+ gµb(n�1)


w(n+1)� gw(n) = w(n)� gw(n�1)�µb(n)+ gµb(n�1)w(n+1) = w(n)+ g(w(n)�w(n�1))�µ(b(n)� gb(n�1))w(n+1) = w(n)+ g(w(n)�w(n�1))�µ(1� g)g(n)w(n+1) = w(n)+ g(w(n)�w(n�1))+2µ(1� g)e(n)u(n)

w(n+1)�w(n) = g(w(n)�w(n�1))+ µ(1� g)e(n)u(n)

Drawback: The convergence rate may decrease.Advantages: The momentum term keeps the algorithm active even in theregions close to minimum.


5. Nonlinear smoothing of LMS gradient estimates

Impulsive interference in either d(n) or u(n), drastically degrades LMSperformance.

Smooth noisy gradient components using a nonlinear filter.

The Median LMS AlgorithmThe adaptation equation can be implemented as

w

i

(n+1) = w

i

(n)�µ ·med((e(n)u(n� i)),(e(n�1)u(n�1� i)), . . . ,

(e(n�N)u(n�N � i)))

Smoothing effect in impulsive noise environment is very strong.If the environment is not impulsive, the performances of MedianLMS are comparable with those of LMS.Convergence rate is slower than in LMS.


Documents

ELEG-636: Statistical Signal Processingarce/files/Courses/StatSP/...ELEG-636: Statistical Signal Processing Gonzalo R. Arce Department of Electrical and Computer Engineering University