ECE Lecture Notes: Matrix Computations for Signal Processingreilly/ee731/ch2.pdf · ECE Lecture Notes: Matrix Computations for Signal Processing James P. Reilly c Department of Electrical

ECE Lecture Notes: Matrix Computations for

Signal Processing

James P. Reilly c©Department of Electrical and Computer Engineering

McMaster University

October 17, 2005

2 Lecture 2

This lecture discusses eigenvalues and eigenvectors in the context of theKarhunen–Loeve (KL) expansion of a random process. First, we discussthe fundamentals of eigenvalues and eigenvectors, then go on to covariancematrices. These two topics are then combined into the K-L expansion. Anexample from the field of array signal processing is given as an applicationof algebraic ideas.

A major aim of this presentation is an attempt to de-mystify the conceptsof eigenvalues and eigenvectors by showing a very important application inthe field of signal processing.

2.1 Eigenvalues and Eigenvectors

Suppose we have a matrix A:

A =

[4 11 4

]

(1)

We investigate its eigenvalues and eigenvectors.

1

1

2

3

4

5

1 2 3 4 5

Note ccw rotation of Ax

Note cw rotation

x

Note no rotationAx2

Ax3

Ax1

Figure 1: Matrix-vector multiplication for various vectors.

Suppose we take the product Ax1, where x1 = [1, 0]T , as shown in Fig. 1.

Then,

Ax1 =

[41

]

. (2)

By comparing the vectors x1 and Ax1 we see that the product vector isscaled and rotated counter–clockwise with respect to x1.

Now consider the case where x2 = [0, 1]T . Then Ax2 = [1, 4]T . Here, wenote a clockwise rotation of Ax2 with respect to x2.

Now lets consider a more interesting case. Suppose x3 = [1, 1]T . ThenAx3 = [5, 5]T . Now the product vector points in the same direction as x3.The vector Ax3 is a scaled version of the vector x3. Because of this property,x3 = [1, 1]T is an eigenvector of A. The scale factor (which in this case is 5)is given the symbol λ and is referred to as an eigenvalue.

Note that x = [1,−1]T is also an eigenvector, because in this case, Ax =[3,−3]T = 3x. The corresponding eigenvalue is 3.

Thus we have, if x is an eigenvector of A ∈ ℜn×n,

Ax = λx (3)

↑ scalar multiple

(eigenvalue)

2

i.e., the vector Ax is in the same direction as x but scaled by a factor λ.

Now that we have an understanding of the fundamental idea of an eigenvec-tor, we proceed to develop the idea further. Eq. (3) may be written in theform

(A− λI)x = 0 (4)

where I is the n × n identity matrix. Eq. (4) is a homogeneous system ofequations, and from fundamental linear algebra, we know that a nontrivialsolution to (4) exists if and only if

det(A − λI) = 0 (5)

where det(·) denotes determinant. Eq. (5), when evaluated, becomes apolynomial in λ of degree n. For example, for the matrix A above we have

det

[(4 11 4

)

− λ

(1 00 1

)]

= 0

det

[4 − λ 1

1 4 − λ

]

= (4 − λ)2 − 1

= λ2 − 8λ + 15 = 0. (6)

It is easily verified that the roots of this polynomial are (5,3), which corre-spond to the eigenvalues indicated above.

Eq. (5) is referred to as the characteristic equation of A, and the corre-sponding polynomial is the characteristic polynomial. The characteristicpolynomial is of degree n.

More generally, if A is n×n, then there are n solutions of (5), or n roots ofthe characteristic polynomial. Thus there are n eigenvalues of A satisfying(3); i.e.,

Axi = λixi, i = 1, . . . , n. (7)

If the eigenvalues are all distinct, there are n associated linearly–independenteigenvectors, whose directions are unique, which span an n–dimensional Eu-clidean space.

Repeated Eigenvalues: In the case where there are e.g., r repeated eigen-values, then a linearly independent set of n eigenvectors exist, provided therank of the matrix (A−λI) in (5) is rank n− r. Then, the directions of ther eigenvectors associated with the repeated eigenvalues are not unique.

3

In fact, consider a set of r linearly independent eigenvectors v1, . . . ,vr as-sociated with the r repeated eigenvalues. Then, it may be shown that anyvector in span[v1, . . . ,vr] is also an eigenvector. This emphasizes the factthe eigenvectors are not unique in this case.

Example 1: Consider the matrix given by

1 0 00 0 00 0 0

It may be easily verified that any vector in span[e2, e3] is an eigenvectorassociated with the zero repeated eigenvalue.

Example 2 : Consider the n×n identity matrix. It has n repeated eigenvaluesequal to one. In this case, any n–dimensional vector is an eigenvector, andthe eigenvectors span an n–dimensional space.

—————–

Eq. (5) gives us a clue how to compute eigenvalues. We can formulatethe characteristic polynomial and evaluate its roots to give the λi. Oncethe eigenvalues are available, it is possible to compute the correspondingeigenvectors vi by evaluating the nullspace of the quantity A − λiI, for i= 1, . . . , n. This approach is adequate for small systems, but for thoseof appreciable size, this method is prone to appreciable numerical error.Later, we consider various orthogonal transformations which lead to muchmore effective techniques for finding the eigenvalues.

We now present some very interesting properties of eigenvalues and eigen-vectors, to aid in our understanding.

Property 1 If the eigenvalues of a (Hermitian) 1 symmetric matrix aredistinct, then the eigenvectors are orthogonal.

1A symmetric matrix is one where A = AT , where the superscript T means transpose,

i.e, for a symmetric matrix, an element aij = aji. A Hermitian symmetric (or justHermitian) matrix is relevant only for the complex case, and is one where A = A

H , wheresuperscript H denotes the Hermitian transpose. This means the matrix is transposed and

complex conjugated. Thus for a Hermitian matrix, an element aij = a∗

ji.In this course we will generally consider only real matrices. However, when complex

matrices are considered, Hermitian symmetric is implied instead of symmetric.

4

Proof. Let {vi} and {λi}, i = 1, . . . , n be the eigenvectors and correspond-ing eigenvalues respectively of A ∈ ℜn×n. Choose any i, j ∈ [1, . . . , n], i 6= j.Then

Avi = λivi (8)

andAvj = λjvj. (9)

Premultiply (8) by vTj and (9) by vT

i :

vTj Avi = λiv

Tj vi (10)

vTi Avj = λjv

Ti vj (11)

The quantities on the left are equal when A is symmetric. We show this asfollows. Since the left-hand side of (10) is a scalar, its transpose is equal toitself. Therefore, we get vT

j Avi = vTi ATvj .

2 But, since A is symmetric,

AT = A. Thus, vTj Avi = vT

i ATvj = vTi Axj , which was to be shown.

Subtracting (10) from (11), we have

(λi − λj)vTj vi = 0 (12)

where we have used the fact vTj vi = vT

i vj . But by hypothesis, λi − λj 6= 0.

Therefore, (12) is satisfied only if vTj vi = 0, which means the vectors are

orthogonal.

�

Here we have considered only the case where the eigenvalues are distinct.If an eigenvalue λ̃ is repeated r times, and rank(A − λ̃I) = n − r, then amutually orthogonal set of n eigenvectors can still be found.

Another useful property of eigenvalues of symmetric matrices is as follows:

Property 2 The eigenvalues of a (Hermitian) symmetric matrix are real.

2Here, we have used the property that for matrices or vectors A and B of conformablesize, (AB)T = B

TA

T .

5

Proof:3 (By contradiction): First, we consider the case where A is real.Let λ be a non–zero complex eigenvalue of a symmetric matrix A. Then,since the elements of A are real, λ∗, the complex–conjugate of λ, must alsobe an eigenvalue of A, because the roots of the characteristic polynomialmust occur in complex conjugate pairs. Also, if v is a nonzero eigenvectorcorresponding to λ, then an eigenvector corresponding λ∗ must be v∗, thecomplex conjugate of v. But Property 1 requires that the eigenvectors beorthogonal; therefore, vT v∗ = 0. But vTv∗ = (vHv)∗, which is by definitionthe complex conjugate of the norm of v. But the norm of a vector is a purereal number; hence, vTv∗ must be greater than zero, since v is by hypothesisnonzero. We therefore have a contradiction. It follows that the eigenvaluesof a symmetric matrix cannot be complex; i.e., they are real.

While this proof considers only the real symmetric case, it is easily extendedto the case where A is Hermitian symmetric.

�

Property 3 Let A be a matrix with eigenvalues λi, i = 1, . . . , n and eigen-vectors vi. Then the eigenvalues of the matrix A + sI are λi + s, withcorresponding eigenvectors vi, where s is any real number.

Proof: From the definition of an eigenvector, we have Av = λv. Further,we have sIv = sv. Adding, we have (A + sI)v = (λ + s)v. This new eigen-vector relation on the matrix (A+sI) shows the eigenvectors are unchanged,while the eigenvalues are displaced by s.

�

Property 4 Let A be an n × n matrix with eigenvalues λi, i = 1, . . . , n.Then

• The determinant det(A) =∏n

i=1 λi.

3From Lastman and Sinha, Microcomputer–based Numerical Methods for Science and

Engineering.

6

• The trace4 tr(A) =∑n

i=1 λi.

The proof is straightforward, but because it is easier using concepts pre-sented later in the course, it is not given here.

Property 5 If v is an eigenvector of a matrix A, then cv is also an eigen-vector, where c is any real or complex constant.

The proof follows directly by substituting cv for v in Av = λv. This meansthat only the direction of an eigenvector can be unique; its norm is notunique.

2.1.1 Orthonormal Matrices

Before proceeding with the eigendecomposition of a matrix, we must developthe concept of an orthonormal matrix. This form of matrix has mutuallyorthogonal columns, each of unit norm. This implies that

qTi qj = δij , (13)

where δij is the Kronecker delta, and qi and qj are columns of the orthonor-mal matrix Q. With (13) in mind, we now consider the product QTQ. Theresult may be visualized with the aid of the diagram below:

QTQ =

qT1 →

qT2 →...

qTN →

q1 q2 · · · qN

↓ ↓ ↓

= I. (14)

(When i = j, the quantity qTi qi defines the squared 2 norm of qi, which has

been defined as unity. When i 6= j, qTi qj = 0, due to the orthogonality of

the qi). Eq. (14) is a fundamental property of an orthonormal matrix.

4The trace denoted tr(·) of a square matrix is the sum of its elements on the maindiagonal (also called the “diagonal” elements).

7

Thus, for an orthonormal matrix, (14)implies the inverse may be computedsimply by taking the transpose of the matrix, an operation which requiresalmost no computational effort.

Eq. (14) follows directly from the fact Q has orthonormal columns. It isnot so clear that the quantity QQT should also equal the identity. We canresolve this question in the following way. Suppose that A and B are anytwo square invertible matrices such that AB = I. Then, BAB = B. Byparsing this last expression, we have

(BA) ·B = B. (15)

Clearly, if (15) is to hold, then the quantity BA must be the identity5; hence,if AB = I, then BA = I. Therefore, if QTQ = I, then also QQT = I. Fromthis fact, it follows that if a matrix has orthonormal columns, then it alsomust have orthonormal rows. We now develop a further useful property oforthonormal marices:

Property 6 The vector 2-norm is invariant under an orthonormal trans-formation.

If Q is orthonormal, then

||Qx||22 = xTQTQx = xT x = ||x||22 .

Thus, because the norm does not change, an orthonormal transformationperforms a rotation operation on a vector. We use this norm–invarianceproperty later in our study of the least–squares problem.

Suppose we have a matrix U ∈ ℜm×n, where m > n, whose columns areorthonormal. We see in this case that U is a tall matrix, which can beformed by extracting only the first n columns of an arbitrary orthonormalmatrix. (We reserve the term orthonormal matrix to refer to a completem × m matrix). Because U has orthonormal columns, it follows that thequantity UTU = In×n. However, it is important to realize that the quantityUUT 6= Im×m in this case, in contrast to the situation when m = n . Thelatter relation follows from the fact that the m column vectors of UT of

5This only holds if A and B are square invertible.

8

length n, n < m, cannot all be mutually orthogonal. In fact, we see laterthat UUT is a projector onto the subspace R(U).

Suppose we have a vector b ∈ ℜm. Because it is easiest, by conventionwe represent b using the basis [e1, . . . , em], where the ei are the elemen-tary vectors (all zeros except for a one in the ith position). However it isoften convenient to represent b in a basis formed from the columns of anorthonormal matrix Q. In this case, the elements of the vector c = QTbare the coefficients of b in the basis Q. The orthonormal basis is convenientbecause we can restore b from c simply by taking b = Qc.

An orthonormal matrix is sometimes referred to as a unitary matrix. Thisfollows because the determinant of an orthonormal matrix is ±1.

2.1.2 The Eigendecomposition (ED) of a Square Symmetric Ma-trix

Almost all matrices on which ED’s are performed (at least in signal pro-cessing) are symmetric. A good example are covariance matrices, which arediscussed in some detail in the next section.

Let A ∈ ℜn×n be symmetric. Then, for eigenvalues λi and eigenvectors vi,we have

Avi = λivi, i = 1, . . . , n. (16)

Let the eigenvectors be normalized to unit 2–norm. Then these n equationscan be combined, or stacked side–by–side together, and represented in thefollowing compact form:

AV = VΛ (17)

where V = [v1,v2, . . . ,vn] (i.e., each column of V is an eigenvector), and

Λ =

λ1 0λ2

. . .

0 λn

= diag(λ1 . . . λn). (18)

Corresponding columns from each side of (17) represent one specific value ofthe index i in (16). Because we have assumed A is symmetric, from Property

9

1, the vi are orthogonal. Furthermore, since we have assumed ||vi||2 = 1,V is an orthonormal matrix. Thus, post-multiplying both sides of (17) byVT , and using VVT = I we get

A = VΛVT . (19)

Eq. (19) is called the eigendecomposition (ED) of A. The columns of Vare eigenvectors of A, and the diagonal elements of Λ are the correspondingeigenvalues. Any symmetric matrix may be decomposed in this way. Thisform of decomposition, with Λ being diagonal, is of extreme interest and hasmany interesting consequences. It is this decomposition which leads directlyto the Karhunen-Loeve expansion which we discuss shortly.

Note that from (19), knowledge of the eigenvalues and eigenvectors of A issufficient to completely specify A. Note further that if the eigenvalues aredistinct, then the ED is unique. There is only one orthonormal V and onediagonal Λ which satisfies (19).

Eq. (19) can also be written as

VT AV = Λ. (20)

Since Λ is diagonal, we say that the unitary (orthonormal) matrix V ofeigenvectors diagonalizes A. No other orthonormal matrix can diagonalizeA. The fact that only V diagonalizes A is the fundamental property ofeigenvectors. If you understand that the eigenvectors of a symmetric matrixdiagonalize it, then you understand the “mystery” behind eigenvalues andeigenvectors. Thats all there is to it. We look at the K–L expansion laterin this lecture in order to solidify this interpretation, and to show somevery important signal processing concepts which fall out of the K–L idea.But the K–L analysis is just a direct consequence of that fact that only theeigenvectors of a symmetric matrix diagonalize.

2.1.3 Conventional Notation on Eigenvalue Indexing

Let A ∈ ℜn×n have rank r ≤ n. Also assume A is positive semi–definite;i.e., all its eigenvalues are ≥ 0. This is a not too restrictive assumptionbecause most of the matrices on which the eigendecomposition is relevant

10

are positive semi–definite. Then, we see in the next section we have r non-zero eigenvalues and n − r zero eigenvalues. It is common convention toorder the eigenvalues so that

λ1 ≥ λ2 ≥ . . . ≥ λr︸︷︷︸

r nonzero eigenvalues

> λr+1 = . . . , λn︸︷︷︸

n−r zero eigenvalues

= 0 (21)

i.e., we order the columns of eq. (17) so that λ1 is the largest, with theremaining nonzero eigenvalues arranged in descending order, followed byn − r zero eigenvalues. Note that if A is full rank, then r = n and thereare no zero eigenvalues. The quantity λn is the eigenvalue with the lowestvalue.

The eigenvectors are reordered to correspond with the ordering of the eigen-values. For notational convenience, we refer to the eigenvector correspondingto the largest eigenvalue as the “largest eigenvector”. The “smallest eigen-vector” is then the eigenvector corresponding to the smallest eigenvalue.

2.2 The Eigendecomposition in Relation to the Fundamental

Matrix Subspaces

In this section, we develop relationships between the eigendecomposition ofa matrix and its range, null space and rank.

Here, we consider square symmetric positive semi–definite matrices A ∈ℜn×n, whose rank r ≤ n. Let us partition the eigendecomposition of A inthe following form:

A = VΛVT

=

[V1 V2

]

r n−r

[Λ1 00 Λ2

] [VT

1

VT2

]r

n−r(22)

where

V1 = [v1,v2, . . . ,vr] ∈ ℜn×r

V2 = [vr+1, . . . ,vn] ∈ ℜn×n−r,

(23)

The columns of V1 are eigenvectors corresponding to the first r eigenvaluesof A, and the columns of V2 correspond to the n − r smallest eigenvalues.

11

We also have

Λ1 = diag[λ1, . . . , λr] =

λ1

. . .

λr

∈ ℜr×r, (24)

and

Λ2 =

λr+1

. . .

λn

∈ ℜ(n−r)×(n−r). (25)

In the notation used above, the explicit absence of a matrix element in anoff-diagonal position implies that element is zero. We now show that thepartition (22) reveals a great deal about the structure of A.

2.2.1 Range

Let us look at R(A) in the light of the decomposition of (22). The definitionof R(A), repeated here for convenience, is

R(A) = {y | y = Ax,x ∈ ℜn} , (26)

where x is arbitrary. The vector quantity Ax is therefore given as

Ax =[

V1 V2

][

Λ1 00 Λ2

] [VT

1

VT2

]

x. (27)

Let us define the vector c as

c =

[c1

c2

]

=

[VT

1

VT2

]

x, (28)

where c1 ∈ ℜr and c2 ∈ ℜn−r. Then,

y = Ax =[

V1 V2

][

Λ1 00 Λ2

] [c1

c2

]

.

=[

V1 V2

][

Λ1c1

Λ2c2

]

= V1 (Λ1c1) + V2 (Λ2c2) . (29)

12

We are given that A is rank r ≤ n. Therefore, R(A) by definition canonly span r linearly independent directions, and the dimension of R(A)is r. Since V1 has r linearly independent columns, we can find a V1 sothat R(A) = R(V1). Then, the vector Ax from (29) cannot contain anycontributions from V2. Since x and therefore c are arbitrary, Λ2 = 0.

We therefore have the important result that a rank r ≤ n matrix must haven − r zero eigenvalues, and that V1 is an orthonormal basis for R(A).

2.2.2 Nullspace

In this section, we explore the relationship between the partition of (22) andthe nullspace of A. Recall that the nullspace N(A) of A is defined as

N(A) = {0 6= x ∈ ℜn | Ax = 0} . (30)

From (22), and the fact that Λ2 = 0, we have

Ax =[

V1 V2

][

Λ1 00 0

] [VT

1

VT2

]

x. (31)

We now choose x so that x ∈ span(V2). Then x = V2c2, where c2 is anyvector in ℜn−r. Then since V1 ⊥ V2, we have

Ax =[

V1 V2

][

Λ1 00 0

] [0c2

]

=[

V1 V2

][

00

]

= 0. (32)

Thus, V2 is an orthonormal basis for N(A). Since V2 has n − r columns,then the dimension of N(A) (i.e., the nullity of A) is n − r.

13

2.3 Matrix Norms

Now that we have some understanding of eigenvectors and eigenvalues, wecan now present the matrix norm. The matrix norm is related to the vectornorm: it is a function which maps ℜm×n into ℜ. A matrix norm must obeythe same properties as a vector norm. Since a norm is only strictly definedfor a vector quantity, a matrix norm is defined by mapping a matrix into avector. This is accomplished by post multiplying the matrix by a suitablevector. Some useful matrix norms are now presented:

Matrix p-Norms: A matrix p-norm is defined in terms of a vector p-norm.The matrix p-norm of an arbitrary matrix A, denoted ||A||p, is defined as

||A||p = supx 6=0

||Ax||p||x||p

(33)

where “sup” means supremum; i.e., the largest value of the argument overall values of x 6= 0. Since a property of a vector norm is ||cx||p = |c| ||x||p forany scalar c, we can choose c in (33) so that ||x||p = 1. Then, an equivalentstatement to (33) is

||A||p = max||x||p=1

||Ax||p . (34)

We now provide some interpretation for the above definition for the specificcase where p = 2 and for A square and symmetric, in terms of the eigen-decomposition of A. To find the matrix 2–norm, we differentiate (34) andset the result to zero. Differentiating ||Ax||2 directly is difficult. However,we note that finding the x which maximizes ||Ax||2 is equivalent to findingthe x which maximizes ||Ax||22 and the differentiation of the latter is mucheasier. In this case, we have ||Ax||22 = xTATAx. To find the maximum,we use the method of Lagrange multipliers, since x is constrained by (34).Therefore we differentiate the quantity

xTATAx + γ(1 − xTx) (35)

and set the result to zero. The quantity γ above is the Lagrange multiplier.The details of the differentiation are omitted here, since they will be coveredin a later lecture. The interesting result of this process is that x must satisfy

ATAx = γx, ||x||2 = 1. (36)

14

Therefore the stationary points of (34) are the eigenvectors of ATA. WhenA is square and symmetric, the eigenvectors of ATA are equivalent to thoseof A6. Therefore the stationary points of (34) are given by the eigenvectorsof A. By substituting x = v1 into (34) we find that ||Ax||2 = λ1.

It then follows that the solution to (34) is given by the eigenvector corre-sponding to the largest eigenvalue of A, and ||Ax||2 is equal to the largesteigenvalue of A.

More generally, it is shown in the next lecture for an arbitrary matrix Athat

||A||2 = σ1 (37)

where σ1 is the largest singular value of A. This quantity results from thesingular value decomposition, to be discussed next lecture.

Matrix norms for other values of p, for arbitrary A, are given as

||A||1 = max1≤j≤n

m∑

i=1

|aij| (maximum column sum) (38)

and

||A||∞ = max1≤i≤m

n∑

j=1

|aij | ( maximum row sum). (39)

Frobenius Norm: The Frobenius norm is the 2-norm of the vector con-sisting of the 2- norms of the rows (or columns) of the matrix A:

||A||F =

m∑

i=1

n∑

j=1

|aij |2

1/2

2.3.1 Properties of Matrix Norms

1. Consider the matrix A ∈ ℜm×n and the vector x ∈ ℜn. Then,

||Ax||p ≤ ||A||p ||x||p (40)

6This proof is left as an exercise.

15

This property follows by dividing both sides of the above by ||x||p, andapplying (33).

2. If Q and Z are orthonormal matrices of appropriate size, then

||QAZ||2 = ||A||2 (41)

and||QAZ||F = ||A||F (42)

Thus, we see that the matrix 2–norm and Frobenius norm are invariantto pre– and post– multiplication by an orthonormal matrix.

3. Further,||A||2F = tr

(ATA

)(43)

where tr(·) denotes the trace of a matrix, which is the sum of itsdiagonal elements.

2.4 Covariance Matrices

Here, we investigate the concepts and properties of the covariance matrixRxx corresponding to a stationary, discrete-time random process x[n]. Webreak the infinite sequence x[n] into windows of length m, as shown in Fig. 2.The windows generally overlap; in fact, they are typically displaced from oneanother by only one sample. The samples within the ith window becomean m-length vector xi, i = 1, 2, 3, . . .. Hence, the vector corresponding toeach window is a vector sample from the random process x[n]. Processingrandom signals in this way is the fundamental first step in many forms ofelectronic system which deal with real signals, such as process identification,control, or any form of communication system including telephones, radio,radar, sonar, etc.

The word stationary as used above means the random process is one forwhich the corresponding joint m–dimensional probability density functiondescribing the distribution of the vector sample x does not change with time.This means that all moments of the distribution (i.e., quantities such as themean, the variance, and all cross–correlations, as well as all other higher–order statistical characterizations) are invariant with time. Here however,we deal with a weaker form of stationarity referred to as wide–sense sta-tionarily (WSS). With these processes, only the first two moments (mean,

16

signal x(k)

x1

x2

o o o

xi

time

sampled

x(k)

windows oflength m

msamples

oo

o o

(k)

xn

o o

Figure 2: The received signal x[n] is decomposed into windows of length m.The samples in the ith window comprise the vector xi, i = 1, 2, . . ..

variances and covariances) need be invariant with time. Strictly, the idea ofa covariance matrix is only relevant for stationary or WSS processes, sinceexpectations only have meaning if the underlying process is stationary.

The covariance matrix Rxx ∈ ℜm×m corresponding to a stationary or WSSprocess x[n] is defined as

Rxx∆= E

[(x− µ)(x − µ)T

](44)

where µ is the vector mean of the process and E(·) denotes the expectationoperator over all possible windows of index i of length m in Fig. 2.. Oftenwe deal with zero-mean processes, in which case we have

Rxx = E[xix

Ti

]= E

x1

x2...

xm

(x1 x2 . . . xm

)

= E

x1x1 x1x2 · · · x1xm

x2x1 x2x2 · · · x2xm...

.... . .

...xmx1 xmx2 · · · xmxm

, (45)

where (x1, x2, . . . , xm)T = xi. Taking the expectation over all windows, eq.(45) tells us that the element r(1, 1) of Rxx is by definition E(x2

1), which isthe mean-square value (the preferred term is variance, whose symbol is σ2)of the first element x1 of all possible vector samples xi of the process. But

17

because of stationarity, r(1, 1) = r(2, 2) = . . . ,= r(m,m) which are all equalto σ2. Thus all main diagonal elements of Rxx are equal to the variance of theprocess. The element r(1, 2) = E(x1x2) is the cross–correlation between thefirst element of xi and the second element. Taken over all possible windows,we see this quantity is the cross–correlation of the process and itself delayedby one sample. Because of stationarity, r(1, 2) = r(2, 3) = . . . = r(m−1,m)and hence all elements on the first upper diagonal are equal to the cross-correlation for a time-lag of one sample. Since multiplication is commutative,r(2, 1) = r(1, 2), and therefore all elements on the first lower diagonal arealso all equal to this same cross-correlation value. Using similar reasoning,all elements on the jth upper or lower diagonal are all equal to the cross-correlation value of the process for a time lag of j samples. Thus we seethat the matrix Rxx is highly structured.

Let us compare the process shown in Fig. 2 with that shown in Fig. 3. Inthe former case, we see that the process is relatively slowly varying. Becausewe have assumed x[n] to be zero mean, adjacent samples of the process inFig. 2 will have the same sign most of the time, and hence E(xixi+1) willbe a positive number, coming close to the value E(x2

i ). The same can besaid for E(xixi+2), except it is not so close to E(x2

i ). Thus, we see that forthe process of Fig. 2, the diagonals decay fairly slowly away from the maindiagonal value.

However, for the process shown in Fig. 3, adjacent samples are uncorrelatedwith each other. This means that adjacent samples are just as likely tohave opposite signs as they are to have the same signs. On average, theterms with positive values have the same magnitude as those with negativevalues. Thus, when the expectations E(xixi+1), E(xixi+2) . . . are taken, theresulting averages approach zero. In this case then, we see the covariancematrix concentrates around the main diagonal, and becomes equal to σ2I.We note that all the eigenvalues of Rxx are equal to the value σ2. Becauseof this property, such processes are referred to as “white”, in analogy towhite light, whose spectral components are all of equal magnitude.

The sequence {r(1, 1), r(1, 2), . . . , r(1,m)} is equivalent to the autocorrela-tion function of the process, for lags 0 to m−1. The autocorrelation functionof the process characterizes the random process x[n] in terms of its variance,and how quickly the process varies over time. In fact, it may be shown7 that

7A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw Hill,

18

time

x(k)

(k)

Figure 3: An uncorrelated discrete–time process.

the Fourier transform of the autocorrelation function is the power spectraldensity of the process. Further discussion on this aspect of random processesis beyond the scope of this treatment; the interested reader is referred tothe reference.

In practice, it is impossible to evaluate the covariance matrix Rxx usingexpectations as in (44). Expectations cannot be evaluated in practice– theyrequire an infinite amount of data which is never available, and furthermore,the data must be stationary over the observation interval, which is rarelythe case. In practice, we evaluate an estimate R̂xx of Rxx, based on anobservation of finite length N of the process x[n], by replacing the ensembleaverage (expectation) with a finite temporal average over the N availabledata points as follows8:

R̂xx =1

N − m + 1

N−m+1∑

i=1

xixTi . (46)

If (46) is used to evaluate R̂, then the process need only be stationary over

3rd Ed.8Process with this property are referred to as ergodic processes.

19

the observation length. Thus, by using the covariance estimate given by(46), we can track slow changes in the true covariance matrix of the processwith time, provided the change in the process is small over the observationinterval N . Further properties and discussion covariance matrices are givenin Haykin.9

It is interesting to note that R̂xx can be formed in an alternate way from(46). Let X ∈ ℜm×(N−m+1) be a matrix whose ith column is the vectorsample xi, i = 1, . . . , N − m + 1 of x[n]. Then R̂xx is also given as

R̂xx =1

N − m + 1XXT . (47)

The proof of this statement is left as an exercise.

Some Properties of Rxx:

1. Rxx is (Hermitian) symmetric i.e. rij = r∗ji, where ∗ denotes complexconjugation.

2. If the process x[n] is stationary or wide-sense stationary, then Rxx isToeplitz. This means that all the elements on a given diagonal of thematrix are equal. If you understand this property, then you have agood understanding of the nature of covariance matrices.

3. If Rxx is diagonal, then the elements of x are uncorrelated. If themagnitudes of the off-diagonal elements of Rxx are significant withrespect to those on the main diagonal, the process is said to be highlycorrelated.

4. R is positive semi–definite. This implies that all the eigenvalues aregreater than or equal to zero. We will discuss positive definiteness andpositive semi–definiteness later.

5. If the stationary or WSS random process x has a Gaussian probabilitydistribution, then the vector mean and the covariance matrix Rxx

are enough to completely specify the statistical characteristics of theprocess.

9Haykin, “Adaptive Filter Theory”, Prentice Hall, 3rd. ed.

20

2.5 The Karhunen-Loeve Expansion of a Random Process

In this section we combine what we have learned about eigenvalues andeigenvectors, and covariance matrices, into the K-L orthonormal expansionof a random process. The KL expansion is extremely useful in compressionof images and speech signals.

An orthonormal expansion of a vector x ∈ ℜm involves expressing x as alinear combination of orthonormal basis vectors or functions as follows:

x = Qa (48)

where a = [a1, . . . , am] contains the coefficients or weights of the expansion,and Q = [q1 . . . ,qm] is an m × m orthonormal matrix.10 Because Q isorthonormal, we can write

a = QTx. (49)

The coefficients a represent x in a coordinate system whose axes are thebasis [q1 . . . ,qm], instead of the conventional basis [e1, . . . , em]. By usingdifferent basis functions Q, we can generate sets of coefficients with differentproperties. For example, we can express the discrete Fourier transform(DFT) in the form of (49), where the columns of Q are harmonically–relatedrotating exponentials. With this basis, the coefficients a tell us how muchof the frequency corresponding to qi is contained in x.

For each vector observation xi, the matrix Q remains constant but a newvector ai of coefficients is generated. To emphasize this point, we re-write(48) as

xi = Qai, i = 1, . . . , N (50)

where i is the vector sample index (corresponding to the window position inFig. 2) and N is the number of vector observations.

2.5.1 Development of the K–L Expansion

Figure 4 shows a scatterplot corresponding to a slowly–varying random pro-cess, of the type shown in Figure 2. A scatterplot is a collection of dots,

10An expansion of x usually requires the basis vectors to be only linearly independent–not necessarily orthonormal. But orthonormal basis vectors are most commonly usedbecause they can be inverted using the very simple form of (49).

21

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x1

x2

Figure 4: A scatterplot of vectors xi ∈ ℜ2, corresponding to a highly cor-related (in this case, slowly varying) random process similar to that shownin Figure 2. Each dot represents a separate vector sample, where its firstelement x1 is plotted against the second element x2.

where the ith dot is the point on the m–dimensional plane correspond-ing to the vector xi. Because of obvious restrictions in drawing, we arelimited here to the value m = 2. Because the process we have chosen inthis case is slowly varying, the elements of each xi are highly correlated;i.e., knowledge of one element implies a great deal about the value of theother. This forces the scatterplot to be elliptical in shape (ellipsoidal inhigher dimensions), concentrating along the principal diagonal in the x1 –x2 plane. Let the quantities θ1, θ2, . . . , θm be the lengths of the m principalaxes of the scatterplot ellipse. With highly correlated processes we find thatθ1 > θ2 > . . . > θm. Typically we find that the values θi diminish quicklywith increasing i in larger dimensional systems, when the process is highlycorrelated.

For the sake of contrast, Figure 5 shows a similar scatterplot, except theunderlying random process is white. Here there is no correlation betweenadjacent samples of the process, so there is no diagonal concentration of thescatterplot in this case. This scatterplot is an m–dimensional spheroid.

22

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x1

x2

Figure 5: Similar to Figure 4, except the underlying random process is white.

As we discuss later, there are many advantages to transforming the coordi-nate system of the process x into one which is aligned along the principleaxes of the scatterplot ellipsoid. The proposed method of finding this coor-dinate system is to find a basis vector q1 ∈ ℜm such that the correspondingcoefficient θ1 = E(qT

1 x) has the maximum possible mean–squared value(variance). Then, we find a second basis vector q2 which is constrained tobe orthogonal to q1, such that the variance of the coefficient θ2 = E(qT

2 x) ismaximum. We continue in this way until we obtain a complete orthonormalbasis Q = [q1, . . . ,qm]. Heuristically, we see from Figure 4 that the desiredbasis is the set of principal axes of the scatterplot ellipse.

The procedure to determine the qi is straightforward. The basis vector q1

is given as the solution to the following problem:

q1 = arg max||q||2=1

E[|qT xi|

2]

(51)

where the expectation is over all values of i. The constraint on the 2–normof q is to prevent the solution from going to infinity. Eq. (51) can be written

23

as

q1 = arg max||q||2=1

E[qT xxTq

]

= arg max||q||2=1

qT E(xxT

)q

= arg max||q||2=1

qTRxxq. (52)

where we have assumed a zero–mean process. The optimization problemabove is precisely the same as that for the matrix norm of section 2.3,where it is shown that the stationary points of the argument in (52) are theeigenvectors of Rxx. Therefore, the solution to (52) is q1 = v1, the largesteigenvector of Rxx. Similarly, q2, . . . ,qm are the remaining successivelydecreasing eigenvectors of Rxx. Thus, the desired orthonormal matrix isthe eigenvector matrix V corresponding to the covariance matrix of therandom process. The decomposition of the vector x in this way is called theKarhunen Loeve (KL) expansion of a random process.

Thus, the K–L expansion can be written as follows:

xi = Vθi (53)

andθi = VTxi, (54)

where V ∈ ℜm×m is the orthonormal matrix of eigenvectors, which is thebasis of the KL expansion, and θi ∈ ℜm is the vector of KL coefficients.

Thus, the coefficient θ1 of θ on average contains the most energy (variance)of all the coefficients in θ; θ2 is the coefficient which contains the next–highest variance, etc. The coefficient θm contains the least variance. Thisis in contrast to the conventional coordinate system, in which all axes haveequal variances.

Question: Suppose the process x is white, so that Rxx = E(xxT ) is alreadydiagonal, with equal diagonal elements; i.e., Rxx = σ2I, as in Figure 5. Whatis the K-L basis in this case?

To answer this, we see that all the eigenvalues of Rxx are repeated. There-fore, the eigenvector basis is not unique. In fact, in this case, any vector inℜm is an eigenvector of the matrix σ2I (the eigenvalue is σ2). Therefore,

24

any orthonormal basis is a K-L basis for a white process. This concept isevident from the circular scatterplot of figure 5.

2.5.2 Properties of the KL Expansion

Property 7 The coefficients θ of the KL expansion are uncorrelated.

To prove this, we evaluate the covariance matrix Rθθ of θ, using the defini-tion (54) as follows:

Rθθ = E(θθT

)

= E(VTxxTV

)

= VTRxxV

= Λ. (55)

Since Rθθ is equal to the diagonal eigenvalue matrix Λ of Rxx, the KLcoefficients are uncorrelated.

Property 8 The variance of the ith K–L coefficient θi is equal to the itheigenvalue λi of Rxx.

The proof follows directly from (55); Rθθ = Λ.

Property 9 The variance of a highly correlated random process x concen-trates in the first few KL coefficients.

This property may be justified intuitively from the scatterplot of Figure 4,due to the fact that the length of the first principal axis is greater than thatof the second. (This effect becomes more pronounced in higher dimensions.)However here we wish to formally prove this property.

Let us denote the covariance matrix of the process shown in Fig. 2 as R2,and that shown in Fig. 3 as R3. We assume both processes are stationary

25

with equal powers. Let αi be the eigenvalues of R2 and βi be the eigenvaluesof R3. Because R3 is diagonal with equal diagonal elements, all the βi areequal. Our assumptions imply that the main diagonal elements of R2 areequal to the main diagonal elements of R3, and hence from Property 4, thetrace and the eigenvalue sum of each covariance matrix are equal.

To obtain further insight into the behavior of the two sets of eigenvalues, weconsider Hadamard’s inequality11 which may be stated as:

Consider a square matrix A ∈ ℜm×m. Then, detA ≤∏m

i=1 aii,with equality if and only if A is diagonal.

From Hadamard’s inequality, detR2 < detR3, and so also from Property4,

∏ni=1 αi <

∏ni=1 βi. Under the constraint

∑αi =

∑βi , it follows that

α1 > αn; i.e., the eigenvalues of R2 are not equal. (We say the eigenvaluesbecome disparate). Thus, the variance in the first K-L coefficients of acorrelated process is larger than that in the later K-L coefficients. Typicallyin a highly correlated system, only the first few coefficients have significantvariance.

To illustrate this phenomenon further, consider the extreme case where theprocess becomes so correlated that all elements of its covariance matrixapproach the same value. (This will happen if the process x[n] does notvary with time). Then, all columns of the covariance matrix are equal, andthe rank of Rxx in this case becomes equal to one, and therefore only oneeigenvalue is nonzero. Then all the energy of the process is concentratedinto only the first K-L coefficient. In contrast, when the process is whiteand stationary, all the eigenvalues are of Rxx are equal, and the varianceof the process is equally distributed amongst all the K–L coefficients. Thepoint of this discussion is to indicate a general behavior of random processes,which is that as they become more highly correlated, the variance in the K-L coefficients concentrates in the first few elements. The variance in theremaining coefficients becomes negligible.

11For a proof, refer to Cover and Thomas, Elements of Information Theory

26

2.5.3 Applications of the K-L Expansion

Suppose a communications system transmits a stationary, zero–mean highly–correlated sequence x. This means that to transmit the elements of x di-rectly, one sends a particular element xi of x using as many bits as is nec-essary to convey the information with the required fidelity. However, insending the next element xi+1, almost all of the same information is sentover again, due to the fact that xi+1 is highly correlated with xi and itsprevious few samples. That is, xi+1 contains very little new informationrelative to xi. It is therefore seen that if x is highly correlated, transmittingthe samples directly (i.e., using the conventional coordinate system) is verywasteful in terms of the number of required bits to transmit.

But if x is stationary and Rxx is known at the receiver 12, then it is possiblefor both the transmitter and receiver to “know” the eigenvectors of Rxx, thebasis set. If the process is sufficiently highly correlated, then, because of theconcentration properties of the K–L transform, the variance of the first fewcoefficients θ dominates that of the remaining ones. The later coefficients onaverage typically have a small variance and are not required to accuratelyrepresent the signal.

To implement this form of signal compression, let us say that an acceptablelevel of distortion is obtained by retaining only the first j significant coeffi-cients. We form a truncated K-L coefficient vector θ̂ in a similar manner to(54) as

θ̂ =

θ1...θj

0...0

=

vT1

. . .vT

j

0T

...0T

x. (56)

where coefficients θj+1, . . . , θm are set to zero and therefore need not betransmitted. This means we can represent the vector sample xi more com-pactly without sacrificing significant loss of quality; i.e., we have achievedsignal compression.

12This is not necessarily a valid assumption. We discuss this point further, later in thesection.

27

An approximation x̂ to the original signal can be reconstructed by:

x̂ = Vθ̂. (57)

From Property 8, the mean–squared error ǫ⋆j in the KL reconstruction x̂ is

given as

ǫ⋆j =

m∑

i=j+1

λi, (58)

which corresponds to the sum of the truncated (smallest) eigenvalues. It iseasy to prove that no other basis results in a smaller error. The error ǫj inthe reconstructed x̂ using any basis [q1, . . . ,qm] is given by

ǫj =

m∑

i=j+1

E|qTi x|22

=m∑

i=j+1

qTi Rxxqi. (59)

where the last line uses (51) and (52). We have seen previously that theeigenvectors are the stationary points of each term in the sum above. Sinceeach term in the sum is positive semi–definite, ǫj is minimized by minimizingeach term individually. Therefore, the minimum of (59) is obtained whenthe qi are assigned the m − j smallest eigenvectors. Since vT

i Rxxvi = λi

when ||v||2 = 1, ǫj = ǫ⋆j only when qi = vi. This completes the proof.

In speech applications for example, fewer than one tenth of the coefficientsare needed for reconstruction with imperceptible degradation. Note thatsince R̂xx is positive semi–definite, all eigenvalues are non–negative. Hence,the energy measure (58) is always non–negative for any value of j. Thistype of signal compression is the ultimate form of a type of coding knownas transform coding.

Transform coding is now illustrated by an example. A process x[n] wasgenerated by passing a unit-variance zero–mean white noise sequence w(n)through a 3rd-order lowpass digital lowpass Butterworth filter with a rela-tively low normalized cutoff frequency (0.1 Hz), as shown in Fig. 6. Vectorsamples xi are extracted from the sequence x[n] as shown in Fig. 2. Thefilter removes the high-frequency components from the input and so theresulting output process x[n] must therefore vary slowly in time. Thus,

28

w(n) x(n)low-passdiscrete-timefilter

low normalizedcutofffrequency

Figure 6: Generation of a highly correlated process x[n]

the K–L expansion is expected to require only a few principal eigenvectorcomponents, and significant compression gains can be achieved.

We show this example for m = 10. Listed below are the 10 eigenvaluescorresponding to R̂xx, the covariance matrix of x, generated from the outputof the lowpass filter:

Eigenvalues:

0.54680.19750.1243 × 10−1

0.5112 × 10−3

0.2617 × 10−4

0.1077 × 10−5

0.6437 × 10−7

0.3895 × 10−8

0.2069 × 10−9

0.5761 × 10−11

The error ǫ⋆j for j = 2 is thus evaluated from the above data as 0.0130, which

may be compared to the value 0.7573, which is the total eigenvalue sum. Thenormalized error is 0.0130

0.7573 = 0.0171. Because this error may be considered alow enough value, only the first j = 2 K-L components may be consideredsignificant. In this case, we have a compression gain of 10/2 = 5; i.e., theKL expansion requires only one fifth of the bits relative to representing thesignal directly.

The corresponding two principal eigenvectors are plotted in Fig. 7. These

29

1 2 3 4 5 6 7 8 9 10−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5First Two Principal Components (10 x 10 system)

time

x

v1

v2

Figure 7: First two eigenvector components as functions of time, for But-terworth lowpass filtered noise example.

plots show the value of the kth element vk of the eigenvector, plotted againstits index k for k = 1, . . . ,m. These waveforms may be interpreted as func-tions of time.

In this case, we would expect that any observation xi can be expressed ac-curately as a linear combination of only the first two eigenvector waveformsshown in Fig. 7, whose coefficients θ̂ are given by (56). In Fig. 8 we showsamples of the true observation x shown as a waveform in time, comparedwith the reconstruction x̂i formed from (57) using only the first j = 2 eigen-vectors. It is seen that the difference between the true and reconstructedvector samples is small, as expected.

One of the practical difficulties in using the K–L expansion for coding isthat the eigenvector set V is not usually known at the receiver in practi-cal cases when the observed signal is mildly or severely nonstationary (e.g.speech or video signals). In this case, the covariance matrix estimate R̂xx

is changing with time; hence so are the eigenvectors. Transmission of theeigenvector set to the receiver is expensive in terms of information and sois undesirable. This fact limits the explicit use of the K–L expansion forcoding. However, it has been shown 13 that the discrete cosine transform

13K.R. Rao and P. Yip, “Discrete Cosine Transform– Algorithms, Advantages, Appli-cations”.

30

1 2 3 4 5 6 7 8 9 10−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4Original and Reconstructed Vector Samples

Time

Am

plitu

de

Figure 8: Original vector samples of x as functions of time (solid), comparedwith their reconstruction using only the first two eigenvector components(dotted). Three vector samples are shown.

(DCT), which is another form of orthonormal expansion whose basis con-sists of cosine–related functions, closely approximates the eigenvector basisfor a certain wide class of signals. The DCT uses a fixed basis, independentof the signal, and hence is always known at the receiver. Transform codingusing the DCT enjoys widespread practical use and is the fundamental ideabehind the so–called JEPEG and MPEG international standards for imageand video coding. The search for other bases, including particularly waveletfunctions, to replace the eigenvector basis is a subject of ongoing research.Thus, even though the K–L expansion by itself is not of much practicalvalue, the theoretical ideas behind it are of significant worth.

2.6 Example: Array Processing

Here, we present a further example of the concepts we have developed sofar. This example is concerned with direction of arrival estimation usingarrays of sensors.

Consider an array of M sensors (e.g., antennas) as shown in Fig. 9. Letthere be K < M plane waves incident onto the array as shown. Assume theamplitudes of the incident waves do not change during the time taken for the

31

sensors

(antennas,etc. )

Normal to array

incident wave 1

theta1

wavefront 1

d

o

o

o

o

incident wave K

wavefront K

wavelength lambda

thetaK

Figure 9: Physical description of incident signals onto an array of sensors.

wave to traverse the array. Also assume for the moment that the amplitudeof the first incident wave at the first sensor is unity. Then, from the physicsshown in Fig. 9, the signal vector x received by sampling each element of thearray simultaneously, from the first incident wave alone, may be describedin vector format by x = [1, ejφ, ej2φ, . . . , ej(M−1)φ]T , where φ is the electricalphase–shift between adjacent elements of the array, due to the first incidentwave. 14 When there are K incident signals, with corresponding amplitudesak, k = 1, . . . ,K, the effects of the K incident signals each add linearlytogether, each weighted by the corresponding amplitude ak, to form thereceived signal vector x. The resulting received signal vector, including thenoise can then be written in the form

xn

(M×1)

= S(M×K)

an

(K×1)

+ wn,(M×1)

n = 1, . . . , N, (60)

where

wn = M -length noise vector at time n whose elements are independent

14It may be shown that if d ≤ λ/2, then there is a one–to–one relationship betweenthe electrical angle φ and the corresponding physical angle θ. In fact, φ = 2πd

λsin θ.

We can only observe the electrical angle φ, not the desired physical angle θ. Thus, wededuce the desired physical angle from the observed electrical angle from this mathematicalrelationship.

32

random variables with zero mean and variance σ2, i.e., E(wi2) = σ2.

The vector w is assumed uncorrelated with the signal.

S = [s1 . . . sK ]

sk = [1, ejφk , ej2φk , . . . , ej(M−1)φk ]T are referred to as steering vectors.

φk, k = 1, . . . ,K are the electrical phase–shift angles corresponding to theincident signals. The φk are assumed to be distinct.

an = [a1 . . . aK ]Tn is a vector of independent random variables, describingthe amplitudes of each of the incident signals at time n.

In (60) we obtain N vector samples xn ∈ ℜM×1, n = 1, . . . , N by simultane-ously sampling all array elements at N distinct points in time. Our objectiveis to estimate the directions of arrival φk of the plane waves relative to thearray, by observing only the received signal.

Note K < M . Let us form the covariance matrix R of the received signal x:

R = E(xxH) = E[(Sa + w)(aHSH + wH)

]

= SE(aaH)SH + σ2I (61)

The last line follows because the noise is uncorrelated with the signal, thusforcing the cross–terms to zero. In the last line of (61) we have also used thatfact that the covariance matrix of the noise contribution (second term) isσ2I. This follows because the elements of the noise vector w are independentwith equal power. The first term of (61) we call Ro, which is the contributionto the covariance matrix due only to the signal.

33

Lets look at the structure of Ro:

Ro =

KK

K

M

K

S E(aaH) SH

↑non-singular

From this structure, we may conclude that Ro is rank K. This may be seen

as follows. Let us define A∆= E(aaH) and B

∆= ASH . Because the φk

are distinct, S is full rank (rank K), and because the ak are independent,A is full rank (K). Therefore the matrix B ∈ ℜK×M is of full rank K.Then, Ro = SB. From this last relation, we can see that the ith, i =1, . . . ,M column of Ro is a linear combination of the K columns of S,whose coefficients are the ith column of B. Because B is full rank, Klinearly independent linear combinations of the K columns of S are used toform Ro. Thus Ro is rank K. Because K < M , Ro is rank deficient.

Let us now investigate the eigendecomposition on Ro, where λk are theeigenvalues of Ro:

Ro = VΛVH (62)

or

Ro =

λ1

. . .

λK

0. . .

0

. (63)

↑ eigenvectors

Because Ro ∈ ℜM×M is rank K, it has K non-zero eigenvalues and M −Kzero eigenvalues. We enumerate the eigenvectors v1, . . . ,vK as those associ-ated with the largest K eigenvalues, and vK+1, . . . ,vM as those associated

34

with the zero eigenvectors. 15 16

From the definition of an eigenvector, we have

Rovi = 0 (68)

or SASHvi = 0, i = K + 1, . . . ,M. (69)

Since A = E(aaH) and S are full rank, the only way (69) can be satis-fied is if the vi, i = K + 1, . . . ,M are orthogonal to all columns of S =[s(φ1), . . . , s(φK)]. Therefore we have

sHk vi = 0, k = 1, . . . ,K, (70)

i = K + 1, . . . ,M,

We define the matrix VN∆= [vK+1, . . . ,vM ]. Therefore (70) may be written

asSHVN = 0. (71)

We also have[

1, ejφk , ej2φk , . . . , ej(M−1)φk

]HVN = 0, k = 1, . . . ,K. (72)

15Note that the eigenvalue zero has multiplicity M − K. Therefore, the eigenvectorsvK+1, . . . ,vM are not unique. However, a set of orthonormal eigenvectors which areorthogonal to the remaining eigenvectors exist. Thus we can treat the zero eigenvectorsas if they were distinct.

16Let us define the so–called signal subspace SS as

SS = span [v1, . . . ,vK ] (64)

and the noise subspace SN as

SN = span [vK+1, . . . ,vM ] . (65)

We now digress briefly to discuss these two subspaces further. From our discussion above,all columns of Ro are linear combinations of the columns of S. Therefore

span[Ro] = span[S]. (66)

But it is also easy to verify thatspan[Ro] ∈ SS (67)

Comparing (66) and (67), we see that S ∈ SS. From (60) we see that any received signalvector x, in the absence of noise, is a linear combination of the columns of S. Thus, anynoise–free signal resides completely in SS. This is the origin of the term “signal subspace”.Further, any component of the received signal residing in SN must be entirely due to thenoise. This is the origin of the term “noise subspace”. We note that the signal and noisesubspaces are orthogonal complement subspaces of each other.

35

Up to now, we have considered only the noise–free case. What happenswhen the noise component σ2I is added to Ro to give Rxx in (61)? FromProperty 3, Lecture 1, we see that if the eigenvalues of Ro are λi, thenthose of Rxx are λi +σ2. The eigenvectors remain unchanged with the noisecontribution, and (70) still holds when noise is present. Note these propertiesonly apply to the true covariance matrix formed using expectations, ratherthan the estimated covariance matrix formed using time averages.

With this background in place we can now discuss the MUSIC 17 algorithmfor estimating directions of arrival of plane waves incident onto arrays ofsensors.

2.6.1 The MUSIC Algorithm 18

We wish to estimate the unknown values [φ1, . . . , φK ] which comprise S =[s(φ1), . . . , s(φK)]. The MUSIC algorithm assumes the quantity K is known.In the practical case, where expectations cannot be evaluated because theyrequire infinite data, we form an estimate R̂ of R based on a finite numberN observations as follows:

R̂ =1

N

N∑

n=1

xnxHn .

Only if N → ∞ does R̂ → R.

An estimate V̂N of VN may be formed from the eigenvectors associatedwith the smallest M − K eigenvalues of R̂. Because of the finite N andthe presence of noise, (72) only holds approximately when V̂N is used inplace of VN . Thus, a reasonable estimate of the desired directions of arrivalmay be obtained by finding values of the variable φ for which the expressionon the left of (72) is small instead of exactly zero. Thus, we determine Kestimates φ̂ which locally satisfy

φ̂ = argminφ

∣∣∣

∣∣∣sH(φ)V̂N

∣∣∣

∣∣∣ (73)

17This word is an acronym for MUltiple SIgnal Classification.18R.O. Schmidt, “Multiple emitter location and parameter estimation”, IEEE Trans.

Antennas and Propag., vol AP-34, Mar. 1986, pp 276-280.

36

phi

P(phi)

phi1 phi2

Figure 10: MUSIC spectrum P (φ) for the case K = 2 signals.

By convention, it is desirable to express (73) as a spectrum–like function,where a peak instead of a null represents a desired signal. It is also conve-nient to use the squared-norm instead of the norm itself. Thus, the MUSIC“spectrum” P (φ) is defined as:

P (φ) =1

s(φ)HV̂NV̂HN s(φ)

It will look something like what is shown in Fig. 10, when K = 2 incidentsignals. Estimates of the directions of arrival φk are then taken as the peaksof the MUSIC spectrum.

2.7 TO SUMMARIZE

• An eigenvector x of a matrix A is such that Ax points in the samedirection as x.

• The covariance matrix Rxx of a random process x is defined as E(xxH ).For stationary processes, Rxx completely characterizes the process,and is closely related to its covariance function. In practice, the ex-pectation operation is replaced by a time-average.

• the eigenvectors of Rxx form a natural basis to represent x, since it isonly the eigenvectors which diagonalize Rxx. This leads to the coef-ficients a of the corresponding expansion x = Va being uncorrelated.This has significant application in speech/video encoding.

37

• The expectation of the square of the coefficients above are the eigen-values of Rxx. This gives an idea of the relative power present alongeach eigenvector.

• If the variables x are Gaussian, then the K-L coefficients are indepen-dent. This greatly simplifies receiver design and analysis.

Many of these points are a direct consequence of the fact that it is onlythe eigenvectors which can diagonalize a matrix. That is basically the onlyreason why eigenvalues/eigenvectors are so useful. I hope this serves to de-mystify this subject. Once you see that it is only the eigenvectors whichdiagonalize, the property that they are a natural basis for the process xbecomes easy to understand.

An interpretation of an eigenvalue is that it represents the average energyin each coefficient of the K–L expansion.

38

Documents

ECE Lecture Notes: Matrix Computations for Signal Processingreilly/ee731/ch2.pdf · ECE Lecture Notes: Matrix Computations for Signal Processing James P. Reilly c Department of Electrical