Estimation Theory Presentation

8/3/2019 Estimation Theory Presentation

1/66

Estimation Theory

Alireza Karimi

Laboratoire dAutomatique, MEC2 397,email: [email protected]

Spring 2011

(Introduction) Estimation Theory Spring 2011 1 / 66


2/66

Course Objective

Extract information from noisy signalsParameter Estimation Problem : Given a set of measured data

{x[0], x[1], . . . , x[N 1]}

which depends on an unknown parameter vector , determine an estimator

= g(x[0], x[1], . . . , x[N 1])

where g is some function.

Applications : Image processing, communications, biomedicine, systemidentification, state estimation in control, etc.



3/66


4/66


5/66

References

Main reference :

Fundamentals of Statistical Signal ProcessingEstimation Theory

by Steven M. KAY, Prentice-Hall, 1993 (available in Library de La

Fontaine, RLC). We cover Chapters 1 to 14, skipping Chapter 5 andChapter 9.

Other references :

Lessons in Estimation Theory for Signal Processing, Communicationsand Control. By Jerry M. Mendel, Prentice-Hall, 1995.

Probability, Random Processes and Estimation Theory for Engineers.By Henry Stark and John W. Woods, Prentice-Hall, 1986.



6/66

Review of Probability and Random Variables

(Probability and Random Variables) Estimation Theory Spring 2011 6 / 66


7/66

Random Variables

Random Variable : A rule X() that assigns to every element of a samplespace a real value is called a RV. So X is not really a variable that variesrandomly but a function whose domain is and whose range is somesubset of the real line.

Example : Consider the experiment of throwing a coin twice. The sample

space (the possible outcomes) is :

= {HH, HT, TH, TT}

We can define a random variable X such that

X(HH) = 1, X(HT) = 1.1, X(TH) = 1.6, X(TT) = 1.8

Random variable X assigns to each event (e.g. E = {HT, TH} ) asubset of the real line (in this case B = {1.1, 1.6}).



8/66

Probability Distribution Function

For any element in , the event {|X() x} is an important event.The probability of this event

Pr[{|X() x}] = PX(x)is called the probability distribution function of X.

Example : For the random variable defined earlier, we have :PX(1.5) = Pr[{|X() 1.5}] = Pr[{HH, HT}] = 0.5

PX(x) can be computed for all x R. It is clear that 0 PX(x) 1.

Remark :For the same experiment (throwing a coin twice) we could defineanother random variable that would lead to a different PX(x).

In most of engineering problems the sample space is a subset of the

real line so X() = and PX(x) is a continuous function of x.(Probability and Random Variables) Estimation Theory Spring 2011 8 / 66


9/66

Probability Density Function (PDF)

The Probability Density Function, if it exists, is given by :

pX(x) =dPX(x)

dx

When we deal with a single random variable the subscripts are removed :

p(x) =dP(x)

dx

Properties :

(i)

p(x)dx = P() P() = 1

(ii) Pr[{|X() x}] = Pr[X x] = P(x) = x

p()d

(iii) Pr[x1 < X x2] =x2x1

p(x)dx



10/66


11/66

Some other common PDF

Chi-square 2 : p(x) = 1

2n(n/2)xn/21 exp( x2 ) for x > 0

0 for x < 0

Exponential ( > 0) : p(x) =1

exp(x/)u(x)

Rayleigh ( > 0) : p(x) =x

2exp(x2

22)u(x)

Uniform (b > a) : p(x) =

1

ba a < x < b0 otherwise

where (z) =

0

tz1etdt and u(x) is the unit step function.



12/66

Joint, Marginal and Conditional PDF

Joint PDF : Consider two random variables X and Y then :

Pr[x1 < X x2 and y1 < Y y2] = x2x1 y2

y1 p(x, y)dxdy

Marginal PDF :

p(x) =

p(x, y)dy and p(y) =

p(x, y)dx

Conditional PDF : p(x|y) is defined as the PDF of X conditioned onknowing the value of Y.

Bayes Formula : Consider two RVs defined on the same probability spacethen we have :

p(x, y) = p(x|y)p(y) = p(y|x)p(x) or p(x|y) = p(x, y)p(y)



13/66

Independent Random Variables

Two RVs X and Y are independent if and only if :

p(x, y) = p(x)p(y)

A direct conclusion is that :

p(x|y) = p(x, y)p(y)

=p(x)p(y)

p(y)= p(x) and p(y|x) = p(y)

which means conditioning does not change the PDF.

Remark : For a joint Gaussian pdf the contours of constant density is anellipse centered at (x, y). For independent X and Y the major (orminor) axis is parallel to x or y axis.



14/66

Expected Value of a Random Variable

The expected value, if it exists, of a random variable X with PDF p(x) isdefined by :

E(X) =

xp(x)dx

Some properties of expected value :

E

{X + Y

}= E

{X

}+ E

{Y

}E{aX} = aE{X}The expected value of Y = g(X) can be computed by :

E(Y) =

g(x)p(x)dx

Conditional expectation : The conditional expectation of X given that aspecific value of Y has occurred is :

E(X|Y) =

xp(x|y)dx



15/66

Moments of a Random Variable

The rth moment of X is defined as :

E(Xr) =

xrp(x)dx

The first moment of X is its expected value or the mean ( = E(X)).

Moments of Gaussian RVs : A Gaussian RV with N(, 2

) hasmoments of all orders in closed form

E(X) = E(X2) = 2 + 2

E(X

3

) =

3

+ 3

2

E(X4) = 4 + 622 + 34

E(X5) = 5 + 1032 + 154

E(X6) = 6 + 1542 + 4524 + 156


C


16/66

Central Moments

The rth central moment of X is defined as :

E[(X )r] =

(x )rp(x)dx

The second central moment (variance) is denoted by 2 or var(X).

Central Moments of Gaussian RVs :

E[(X )r] =

0 if r is oddr(r 1)!! ifr is even

where n!! denotes the double factorial that is the product of every oddnumber from n to 1.


S i f G i RV


17/66

Some properties of Gaussian RVs

If X N

(x, 2

x

) then Z = (X

x)/x N

(0, 1).

If Z N(0, 1) then X = xZ + x N(x, 2x).If X N(x, 2x) then Z = aX + b N(ax + b, a22x).If X N(x, 2x) and Y N(y, 2y) are two independent RVs, then

aX + bY N(ax + by, a2x + b2y)

The sum of square of n independent RV with standard normaldistributionN(0, 1) has a 2n distribution with n degree of freedom.For large value of n, 2

n

converges toN

(n, 2n).

The Euclidian norm X2 + Y2 of two independent RVs withstandard normal distribution has the Rayleigh distribution.


C i


18/66

Covariance

For two RVs X and Y, the covariance is defined as

xy = E[(X x)(Y y)]xy =

(x x)(y y)p(x, y)dxdy

If X and Y are zero mean then xy = E{XY}.var(X + Y) = 2x + 2y + 2xyvar(aX) = a22x

Important formula : The relation between the variance and the mean of

X is given by

2 = E[(X )2] = E(X2) 2E(X) + 2= E(X2) 2

The variance is the mean of the square minus the square of the mean.(Probability and Random Variables) Estimation Theory Spring 2011 18 / 66

I d d U l d d O h li


19/66

Independence, Uncorrelatedness and Orthogonality

If xy = 0, then X and Y are uncorrelated and

E{XY} = E{X}E{Y}X and Y called orthogonal if E{XY} = 0.If X and Y are independent then they are uncorrelated.

p(x, y) = p(x)p(y) E{XY} = E{X}E{Y}Uncorrelatedness does not imply the independence. For example, if Xis a normal RV with zero mean and Y = X2 we have p(y|x) = p(y)but

xy = E{XY} E{X}E{Y} = E{X3

} 0 = 0The correlation only shows the linear dependence between two RV sois weaker than independence.

For Jointly Gaussian RVs, independence is equivalent to beinguncorrelated.


R d V t


20/66

Random Vectors

Random Vector : is a vector of random variables 1 :

x = [x1, x2, . . . , xn]T

Expectation Vector : x = E(x) = [E(x1), E(x2), . . . , E(xn)]T

Covariance Matrix : Cx = E[(x x)(x x)T]Cx is an n

n symmetric matrix which is assumed to be positive

definite and so invertible.

The elements of this matrix are : [Cx]ij = E{[xi E(xi)][xj E(xj)]}.If the random variables are uncorrelated then Cx is a diagonal matrix.

Multivariate Gaussian PDF :

p(x) =1

(2)n det(Cx)exp1

2(x x)TC1x (x x)

1In some books (including our main reference) there is no distinction between random

variable X and its specific value x. From now on we adopt the notation of our reference.(Probability and Random Variables) Estimation Theory Spring 2011 20 / 66

R d P


21/66

Random Processes

Discrete Random Process : x[n] is a sequence of random variablesdefined for every integer n.

Mean value : is defined as E(x[n]) = x[n].Autocorrelation Function (ACF) : is defined as

rxx[k, n] = E(x[n]x[n + k])

Wide Sense Stationary (WSS) : x[n] is WSS if its mean and its

autocorrelation function (ACF) do not depend on n.Autocovariance function : is defined as

cxx[k] = E[(x[n] x)(x[n + k] x)] = rxx[k] 2xCross-correlation Function (CCF) : is defined as

rxy[k] = E(x[n]y[n + k])

Cross-covariance function : is defined as

cxy[k] = E[(x[n]

x)(y[n + k]

y)] = rxy[k]

xy


Discrete White Noise


22/66

Discrete White Noise

Some properties of ACF and CCF :

rxx[0] |rxx[k]| rxx[k] = rxx[k] rxy[k] = ryx[k]Power Spectral Density : The Fourier transform of ACF and CCF givesthe Auto-PSD and Cross-PSD :

Pxx(f) =

k=rxx[k]exp(j2fk)

Pxy(f) =

k=rxy[k]exp(j2fk)

Discrete White Noise : is a discrete random process with zero mean andrxx[k] =

2[k] where [k] is the Kronecker impulse function. The PSD ofwhite noise becomes Pxx(f) =

2 and is completely flat with frequency.



23/66

Introduction and Minimum Variance UnbiasedEstimation

(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2011 23 / 66

The Mathematical Estimation Problem


24/66


Parameter Estimation Problem : Given a set of measured data

x ={

x[0], x[1], . . . , x[N

1]}

which depends on an unknown parameter vector , determine an estimator

= g(x[0], x[1], . . . , x[N 1])where g is some function.

The first step is to find the PDF of data as afunction of : p(x; )

Example : Consider the problem of DC level in white Gaussian noise withone observed data x[0] = + w[0] where w[0] has the PDF N(0, 2).Then the PDF of x[0] is :

p(x[0]; ) =1

22exp

1

22(x[0] )2




25/66


Example : Consider a data sequence that can be modeled with a linear

trend in white Gaussian noise

x[n] = A + Bn + w[n] n = 0, 1, . . . , N 1

Suppose that w[n] N(0, 2) and is uncorrelated with all the othersamples. Letting = [A B] and x = [x[0], x[1], . . . , x[N 1]] the PDF is :

p(x;) =N1n=0

p(x[n];) =1

(

22)Nexp

1

22

N1n=0

(x[n] A Bn)2

The quality of any estimator for this problem is related to the assumptionson the data model. In this example, linear trend and WGN PDFassumption.




26/66


Classical versus Bayesian estimation

If we assume is deterministic we will have a classical estimationproblem. The following method will be studied : MVU, MLE, BLUE,LSE.

If we assume is a random variable with a known PDF, then we will

have a Bayesian estimation problem. In this case the data aredescribed as the joint PDF

p(x, ) = p(x|)p()

where p() summarizes our knowledge about before any data isobserved and p(x|) summarizes our knowledge provided by data xconditioned on knowing . The following methods will be studied :MMSE, MAP, Kalman Filter.


Assessing Estimator Performance


27/66

Assessing Estimator Performance

Consider the problem of estimating a DC level A in uncorrelated noise :

x[n] = A + w[n] n = 0, 1, . . . , N 1Consider the following estimators :

A1 =1

N

N1

n=0 x[n]A2 = x[0]

Suppose that A = 1, A1 = 0.95 and A2 = 0.98. Which estimator is better ?

An estimator is a random variable, so itsperformance can only be described by its PDF or

statistically (e.g. by Monte-Carlo simulation).


Unbiased Estimators


28/66

Unbiased Estimators

An estimator that on the average yield the true value is unbiased.Mathematically

E( ) = 0 for a < < bLets compute the expectation of the two estimators A1 and A2 :

E(A1) =1

N

N1

n=0 E(x[n]) =1

N

N1

n=0 E(A + w[n]) =1

N

N1

n=0 (A + 0) = AE(A2) = E(x[0]) = E(A + w[0]) = A + 0 = A

Both estimators are unbiased. Which one is better ?Now, lets compute the variance of the two estimators :

var(A1) = var

1

N

N1n=0

x[n]

=

1

N2

N1n=0

var(x[n]) =1

N2N2 =

2

N

var(A2) = var(x[0]) = 2 > var(A1)


Unbiased Estimators


29/66

Unbiased Estimators

Remark : When several unbiased estimators of the same parameters fromindependent set of data are available, i.e., 1, 2, . . . , n, a better estimatorcan be obtained by averaging :

=1

n

n

i=1i E() =

Assuming that the estimators have the same variance, we have :

var() =1

n2

n

i=1var(i) =

1

n2n var(i) =

var(i)

n

By increasing n, the variance will decrease (if n , ).It is not the case for biased estimators, no matter how many estimators areaveraged.


Minimum Variance Criterion


30/66

Minimum Variance Criterion

The most logical criterion for estimation is the Mean Square Error (MSE) :

mse() = E[( )2]Unfortunately this type of estimators leads to unrealizable estimators (theestimator will depend on ).

mse() = E{[ E() + E() ]2} = E{[ E() + b()]2}where b() = E() is defined as the bias of the estimator. Therefore :

mse() = E{[ E()]2} + 2b()E[ E()] + b2() = var() + b2()

Instead of minimizing MSE we can minimize the variance ofthe unbiased estimators :

Minimum Variance Unbiased Estimator


Minimum Variance Unbiased Estimator


31/66

u a a ce U b ased st ato

Existence of MVU Estimator : In general MVU estimator does notalways exist. There may be no unbiased estimator or non of unbiasedestimators has a uniformly minimum variance.

Finding the MVU Estimator : There is no known procedure which

always leads to the MVU estimator. Three existing approaches are :1 Determine the Cramer-Rao lower bound (CRLB) and check to see if

some estimator satisfies it.

2 Apply the Rao-Blackwell-Lehmann-Scheffe theorem (we will skip it).

3 Restrict to linear unbiased estimators.



32/66

Cramer-Rao Lower Bound

(Cramer-Rao Lower Bound) Estimation Theory Spring 2011 32 / 66



33/66

CRLB is a lower bound on the variance of any unbiased estimator.

var() CRLB()

Note that the CRLB is a function of.It tells us what is the best performance that can be achieved(useful in feasibility study and comparison with other estimators).

It may lead us to compute the MVU estimator.




34/66

Theorem (scalar case)

Assume that the PDF p(x; ) satisfies the regularity condition

E

ln p(x; )

= 0 for all

Then the variance of any unbiased estimator satisfies

var() E

2 ln p(x; )

2

1An unbiased estimator that attains the CRLB can be found iff :

ln p(x; )

= I()(g(x) )

for some functions g(x) and I(). The estimator is = g(x) and theminimum variance is 1/I().




35/66

Example : Consider x[0] = A + w[0] with w[0] N(0, 2).

p(x[0]; A) = 122

exp 122

(x[0] A)2ln p(x[0], A) = ln

22 1

22(x[0] A)2

Then

ln p(x[0]; A)

A=

1

2(x[0] A)

2 ln p(x[0]; A)

A2=

1

2

According to Theorem :

var(A) 2 and I(A) = 12

and A = g(x[0]) = x[0]




36/66

Example : Consider multiple observations for DC level in WGN :

x[n] = A + w[n] n = 0, 1, . . . , N

1 with w[n]

N(0, 2)

p(x; A) =1

(

22)Nexp

1

22

N1n=0

(x[n] A)2

Then

ln p(x; A)

A=

A

ln[(22)N/2] 1

22

N1n=0

(x[n] A)2

=1

2

N1

n=0 (x[n] A) =N

2 1NN1

n=0 x[n] AAccording to Theorem :

var(A) 2

Nand I(A) =

N

2and A = g(x[0]) =

1

N

N1

n=0 x[n](Cramer-Rao Lower Bound) Estimation Theory Spring 2011 36 / 66

Transformation of Parameters


37/66

If it is desired to estimate = g(), then the CRLB is :

var() E2 ln p(x; )2

1g2

Example : Compute the CRLB for estimation of the power (A2) of a DClevel in noise :

var(A2) 2

N(2A)2 = 4A

2

2

N

Definition

Efficient estimator : An unbiased estimator that attains the CRLB is said

to be efficient.

Example : Knowing that x =1

N

N1n=0

x[n] is an efficient estimator for A, is

x2 an efficient estimator for A2 ?(Cramer-Rao Lower Bound) Estimation Theory Spring 2011 37 / 66



38/66

Solution : Knowing that x N(A, 2/N), we have :

E(x2) = E2(x) + var(x) = A2 +2

N = A2

So the estimator A2 = x2 is not even unbiased.Lets look at the variance of this estimator :

var(x2) = E(x4) E2(x2)but we have from the moments of Gaussian RVs (slide 15) :

E(x4) = A4 + 6A22

N

+ 3(2

N

)2

Therefore :

var(x2) = A4 + 6A22

N+

34

N2

A2 +2

N

2=

4A22

N+

24

N2




39/66

Remarks :

The estimator A2 = x2 is biased and not efficient.As N the bias goes to zero and the variance of the estimatorapproaches the CRLB. This type of estimators are calledasymptotically efficient.

General Remarks :

If g() = a + b is an affine function of , then g() = g() is anefficient estimator. First, it is unbiased : E(a + b) = a + b = g(),moreover :

var(g()) g

2

var() = a2var()

but var(g()) = var(a + b) = a2var(), so that the CRLB is achieved.If g() is a nonlinear function of and is an efficient estimator,then g() is an asymptotically efficient estimator.




40/66

Theorem (Vector Parameter)

Assume that the PDF p(x;) satisfies the regularity condition

E

ln p(x;)

= 0 for all

Then the variance of any unbiased estimator satisfies C

I1()

0where 0 means that the matrix is positive semidefinite. I() is called theFisher information matrix and is given by :

Iij() =

E

2 ln p(x;)

ij An unbiased estimator that attains the CRLB can be found iff :

ln p(x;)

= I()(g(x) )


CRLB Extension to Vector Parameter


41/66

Example : Consider a DC level in WGN with A and 2 unknown.Compute the CRLB for estimation of = [A 2]T.

ln p(x;) = N2

ln 2 N2

ln 2 122

N1n=0

(x[n] A)2

The Fisher information matrix is :

I() = E 2 ln p(x;)

A22 ln p(x;)A2

2 ln p(x;)A2

2 ln p(x;)(2)2

= N

20

0 N24

The matrix is diagonal (just for this example) and can be easily inverted to

yield :

var() 2

Nvar(

2) 2

4

N

Is there any unbiased estimator that achieves these bounds ?




42/66

If it is desired to estimate = g(), and the CRLB for the covariance of is I1(), then :

C g

E

2 ln p(x; )2

1g

TExample : Consider a DC level in WGN with A and 2 unknown.

Compute the CRLB for estimation of signal to noise ratio = A2

/2

.We have = [A 2]T and = g() = 21/2, then the Jacobian is :

g()

=

g()

1

g()

2

=

2A

2 A

2

4

So the CRLB is :

var()

2A

2 A

2

4

N2

0

0 N24

1 2A2A24

=

4 + 22

N


Linear Models with WGN


43/66

If N point samples of data observed can be modeled as

x = H + w

where

x = N 1 observation vectorH = N p observation matrix (known, rank p) = p 1 vector of parameters to be estimatedw = N 1 noise vector with PDF N(0, 2I)

Compute the CRLB and the MVU estimator that achieves this bound.

Step 1 : Compute ln p(x;).

Step 2 : Compute I() =

E

2 ln p(x;)

2 and the covariance

matrix of : C = I1().

Step 3 : Find the MVU estimator g(x) by factoring

ln p(x;)

= I()[g(x) ]

(Linear Models) Estimation Theory Spring 2011 43 / 66



44/66

Step 1 : ln p(x;) = ln(

22)N 122

(x H)T(x H).Step 2 :

ln p(x;)

= 122

[xTx 2xTH + THTH]

=1

2[HTx HTH]

Then I() = E2 ln p(x;)2 = 12 HTHStep 3 : Find the MVU estimator g(x) by factoring

ln p(x;)

= I()[g(x) ]

=HTH

2[(HTH)1HTx ]

Therefore :

= g(x) = (HTH)1HTx C = I1() = 2(HTH)1




45/66

For a linear model with WGN represented by x = H + w the MVU

estimator is : = (HTH)1HTx

This estimator is efficient and attains the CRLB.

That the estimator is unbiased can be seen easily by :

E() = (HTH)1HTE(H + w) =

The statistical performance of is completely specified because is alinear transformation of a Gaussian vector x and hence has a Gaussian

distribution : N(, 2(HTH)1)


Example (Curve Fitting)


46/66

Consider fitting the data x[n] by a p-th order polynomial function of n :

x[n] = 0 + 1n + 2n2 +

+ pn

p + w[n]

We have N data sample, then :

x = [x[0], x[1], . . . , x[N 1]]Tw = [w[0], w[1], . . . , w[N 1]]T

= [0, 1, . . . , p]T

so x = H + w, where H is :

1 0 0 01 1 1 11 2 4 2

p

......

.... . .

...1 N 1 (N 1)2 (N 1)p

N(p+1)

Hence the MVU estimator is : = (HTH)1HTx(Linear Models) Estimation Theory Spring 2011 46 / 66

Example (Fourier Analysis)


47/66

Consider the Fourier analysis of the data x[n] :

x[n] =

Mk=1

akcos(2knN

) +

Mk=1

bk sin( 2knN

) + w[n]

so we have = [a1, a2, . . . , aM, b1, b2, . . . , bM]T and x = H + w where :

H = [ha1, ha2, . . . , haM, hb1, hb2, . . . , hbM]

with

hak =

1

cos( 2kN

)

cos(2k2N )...

cos( 2k(N1)N

)

, hbk =

1

sin( 2kN

)

sin(2k2N )...

sin( 2k(N1)N

)

Hence the MVU estimate of the Fourier coefficients is : = (HTH)1HTx(Linear Models) Estimation Theory Spring 2011 47 / 66

Example (Fourier Analysis)


48/66

After simplification (noting that (HTH)1 = 2N

I), we have :

=

2

N[(ha

1)T

x, . . . , (haM)

T

x , (hb

1)T

x, . . . , (hbM)

T

x]T

which is the same as the standard solution :

ak =2

N

N1

n=0 x[n]cos2kn

N , bk =2

N

N1

n=0 x[n]sin2kn

N From the properties of linear models the estimates are unbiased.

The covariance matrix is :

C = 2(HTH)1 = 2

2

NI

Note that is Gaussian and C is diagonal (the amplitude estimatesare independent).


Example (System Identification)


49/66

Consider identification of a Finite Impulse Response (FIR) model, h[k] fork = 0, 1, . . . , p

1, with input u[n] and output x[n] provided for

n = 0, 1, . . . , N 1 :

x[n] =

p1k=0

h[k]u[n k] + w[n] n = 0, 1, . . . , N 1

FIR model can be represented by the linear model x = H + w where

=

h[0]h[1]

. . .h[p 1]

p1

H =

u[0] 0 0u[1] u[0] 0

.

..

.

.... .

.

..u[N 1] u[N 2] u[N p]

Np

The MVU estimate is = (HTH)1HTx with C = 2(HTH)1.


Linear Models with Colored Gaussian Noise


50/66

Determine the MVU estimator for the linear model x = H + w with w acolored Gaussian noise with N(0, C).Whitening approach : Since C is positive definite, its inverse can befactored as C1 = DTD where D is an invertible matrix. This matrix actsas a whitening transformation for w :

E[(Dw)(Dw)T] = E(DwwTD) = DCDT = DD1DTD = I

Now if we transform the linear model x = H + w to :x = Dx = DH + Dw = H + w

where w = Dw N(0, I) is white and we can compute the MVUestimator as :

= (HTH)1HTx = (HTDTDH)1HTDTDx

so, we have :

= (HTC1H)1HTC1x with C = (HTH)1 = (HTC1H)1


Linear Models with known components


51/66

Consider a linear model x = H + s + w, where s is a known signal. Todetermine the MVU estimator let x = x s, so that x = H + w is astandard linear model. The MVU estimator is :

= (HTH)1HT(x s) with C = 2(HTH)1

Example : Consider a DC level and exponential in WGN :x[n] = A + rn + w[n] where r is known. Then we have :

x[0]x[1]

...x[N

1]

=

11...1

A +

1r...

rN1

+

w[0]w[1]

...w[N

1]

The MVU estimator is :

A = (HTH)1HT(x s) = 1N

N1n=0

(x[n] rn) with var(A) = 2

N


Best Linear Unbiased Estimators (BLUE)


52/66

Problems of finding the MVU estimators :

The MVU estimator does not always exist or impossible to find.

The PDF of data may be unknown.

BLUE is a suboptimal estimator that :

restricts estimates to be linear in data ; = Ax

restricts estimates to be unbiased ; E() = AE(x) =

minimizes the variance of the estimates ;

needs only the mean and the variance of the data (not the PDF). Asa result, in general, the PDF of the estimates cannot be computed.

Remark : The unbiasedness restriction implies a linear model for the data.However, it may still be used if the data are transformed suitably or themodel is linearized.

(Best Linear Unbiased Estimators) Estimation Theory Spring 2011 52 / 66

Finding the BLUE (Scalar Case)


53/66

1 Choose a linear estimator for the observed data

x[n] , n = 0, 1, . . . , N 1

=N1n=0

anx[n] = aTx where a = [a0, a1, . . . , aN1]T

2 Restrict estimate to be unbiased :

E() =

N1n=0

anE(x[n]) =

3 Minimize the variance

var() = E{[ E()]2} = E{[aTx aTE(x)]2}= E{aT[x E(x)][x E(x)]Ta} = aTCa


Finding the BLUE (Scalar Case)


54/66

Consider the problem of amplitude estimation of known signals in noise :

x[n] = s[n] + w[n]

1 Choose a linear estimator : =

N1n=0

anx[n] = aTx

2 Restrict estimate to be unbiased : E() = aTE(x) = aTs =

then aTs = 1 where s = [s[0], s[1], . . . , s[N 1]]T3 Minimize aTCa subject to aTs = 1.

The constrained optimization can be solved using Lagrangian Multipliers :

Minimize J = aT

Ca + (aT

s 1)The optimal solution is :

=sTC1xsTC1s

and var() =1

sTC1s(Best Linear Unbiased Estimators) Estimation Theory Spring 2011 54 / 66

Finding the BLUE (Vector Case)


55/66

Theorem (GaussMarkov)

If the data are of the general linear model form

x = H + w

with w is a noise vector with zero mean and covariance C (the PDF of w

is arbitrary), then the BLUE of is :

= (HTC1H)1HTC1x

and the covariance matrix of is

C = (HTC1H)1

Remark : If noise is Gaussian then BLUE is MVU estimator.


Finding the BLUE

E l C id h bl f DC l l i hi i


56/66

Example : Consider the problem of DC level in white noise :x[n] = A + w[n], where w[n] is of unspecified PDF with var(w[n]) = 2n.

We have = A and H = 1 = [1, 1, . . . , 1]T

. The covariance matrix is :

C =

20 0 00 21 0...

.... . .

...

0 0 2N1

C1 =

120

0 00 1

21 0

......

. . ....

0 0 12N1

and hence the BLUE is :

= (HTC1H)1HTC1x = N1

n=01

2n

1 N1

n=0x[n]

2n

and the minimum covariance is :

C = (HTC1H)1 =

N1

n=01

2n

1(Best Linear Unbiased Estimators) Estimation Theory Spring 2011 56 / 66


57/66

Maximum Likelihood Estimation

(Maximum Likelihood Estimation) Estimation Theory Spring 2011 57 / 66


Problems : MVU ti t d t ft i t t b f d


58/66

Problems : MVU estimator does not often exist or cannot be found.BLUE is restricted to linear models.

Maximum Likelihood Estimator (MLE) :

can always be applied if the PDF is known ;

is optimal for large data size ;

is computationally complex and requires numerical methods.

Basic Idea : Choose the parameter value that makes the observed data,the most likely data to have been observed.

Likelihood Function : is the PDF p(x; ) when is regarded as a variable

(not a parameter).

ML Estimate : is the value of that maximizes the likelihood function.

Procedure : Find log-likelihood function ln p(x; ) ; differentiate w.r.t and set to zero and solve for .



Example : Consider DC level in WGN with unknown variance


59/66

Example : Consider DC level in WGN with unknown variancex[n] = A + w[n]. Suppose that A > 0 and 2 = A. The PDF is :

p(x; A) =1

(2A)N2

exp 1

2A

N1n=0

(x[n] A)2Taking the derivative of the log-likelihood function, we have :

ln p(x; A)

A= N

2A+

1

A

N1n=0

(x[n] A) + 12A2

N1n=0

(x[n] A)2

What is the CRLB ? Does an MVU estimator exist ?

MLE can be found by setting the above equation to zero :

A = 12

+

1N

N1n=0

x2[n] +1

4



Properties : MLE may be biased and is not necessarily an efficient


60/66

Properties : MLE may be biased and is not necessarily an efficientestimator. However :

MLE is asymptotically unbiased meaning that limNE() .

MLE is asymptotically attains the CRLB.MLE has asymptotically a Gaussian PDF :

a N(, I1()).If an MVE estimator exists then ML procedure will find it.

Example : Consider DC level in WGN with known variance 2

. Then

p(x; A) =1

(22)N2

exp

1

22

N1n=0

(x[n] A)2

ln p(x; A)A = 12

N1n=0

(x[n] A) = 0 N1n=0

x[n] N = 0

which leads to = x =1

N

N1

n=0

x[n]


MLE for Transformed Parameters (Invariance Property)Find MLE for = g(), when PDF p(x; ) is given :


61/66

( ) ( )1 If = g() is a one-to-one function, then

= arg max p(x; g1

()) = g()

Example : Consider DC level in WGN and find MLE of = exp(A).Since g() is a one-to-one function then = exp(x).

2 If = g() is not a one-to-one function, thenpT(x; ) = max

{:=g()}p(x; ) and = arg max

pT(x; )

Example : Consider DC level in WGN and find MLE of = A2. Sinceg() is not a one-to-one function then :

= arg max

0{p(x; ), p(x; )}

=

arg max0

{p(x; ), p(x; )}2= arg max


62/66

Example : DC Level in WGN with unknown variance. The vectorparameter = [A 2]T should be estimated. We have :

ln p(x;)

A=

1

2

N1n=0

(x[n] A)

ln p(x;)

2

=

N

22

+1

24

N1

n=0 (x[n] A)2which leads to the following MLE :

A = x

2

=

1

N

N1

n=0 (x[n] x)2Asymptotic properties of the MLE : Under some regularity conditions,the MLE is asymptotically normally distributed

a N(, I1()) even ifthe PDF of x is not Gaussian.


MLE for General Gaussian Case


63/66

Consider the general Gaussian case where x N((), C()).The partial derivative of the PDF is :

ln p(x;)

k= 1

2tr

C1()

C()

k

+

()T

kC1()(x ())

1

2

(x

())T

C1()

k(x

())

for k = 1, . . . , p.

By setting the above equations equal to zero, MLE can be found.

A particular case is when C is known (the first and third terms

become zero).

In addition, if() is linear in , the general linear model is obtained.


MLE for General Linear Models

Consider the general linear model x = H + w where w is a noise vector


64/66

Consider the general linear model x H + w where w is a noise vectorwith PDF N(0, C) :

p(x;) = 1(2)n det(C)

exp 12

(x H)TC1(x H)Taking the derivative of ln p(x;) leads to :

ln p(x;)

= (H)T

C1(x H)

Then

H

T

C1

(x H) = 0 = (H

T

C1

H)1

H

T

C1

xwhich is the same as MVU estimator. The PDF of is :

N(, (HTC1H)1)(Maximum Likelihood Estimation) Estimation Theory Spring 2011 64 / 66

MLE (Numerical Method)


65/66

NewtonRaphson : A closed form estimator cannot be always computedby maximizing the likelihood function. However, the maximum value can

be computed by the numerical methods like the iterative NewtonRaphsonalgorithm.

k+1 = k 2 ln p(x;)

T 1

ln p(x;)

=kRemarks :

The Hessian can be replaced by the negative of its expectation, theFisher information matrix I().

This method suffers from convergence problems (local maximum).Typically, for large data length, the log-likelihood function becomesmore quadratic and the algorithm will produce the MLE.



66/66

Least Squares Estimation

(Least Squares) Estimation Theory Spring 2011 66 / 66

Documents

Estimation Theory Presentation