ECE 712 Lecture Notes: Matrix Computations for Signal ...reilly/ece712/ch2.pdf · ECE 712 Lecture Notes: Matrix Computations for Signal Processing James P. Reilly c Department of

ECE 712 Lecture Notes: Matrix Computations for

Signal Processing

James P. Reilly c©Department of Electrical and Computer Engineering

McMaster University

October 18, 2019

2 Lecture 2

This lecture discusses eigenvalues and eigenvectors in the context of principalcomponent analysis of a random process. First, we discuss the fundamentalsof eigenvalues and eigenvectors, then go on to covariance matrices. Thesetwo topics are then combined into PCA analysis. An example from the fieldof array signal processing is given as an application of algebraic ideas.

A major aim of this presentation is an attempt to de-mystify the conceptsof eigenvalues and eigenvectors by showing a very important application inthe field of signal processing.

2.1 Eigenvalues and Eigenvectors

We first discuss this subject from the classical mathematical viewpoint, andthen when the requisite background is in place we will apply eigenvalue andeigenvectors in an engineering context. We investigate the underlying ideasof this topic using the matrix A as an example:

A =

[4 11 4

](1)

1

1

2

3

4

5

1 2 3 4 5

Note ccw rotation of Ax

Note cw rotation

x

Note no rotationAx2

Ax3

Ax1

Figure 1. Matrix-vector multiplication for various vectors.

The product Ax1, where x1 = [1, 0]T , is shown in Fig. 1. Then,

Ax1 =

[41

]. (2)

By comparing the vectors x1 and Ax1 we see that the product vector isscaled and rotated counter–clockwise with respect to x1.

Now consider the case where x2 = [0, 1]T . Then Ax2 = [1, 4]T . Here, wenote a clockwise rotation of Ax2 with respect to x2.

We now let x3 = [1, 1]T . Then A x3 = [5, 5]T . Now the product vectorpoints in the same direction as x3; i.e., Ax3 ∈ span(x3) and Ax3 = λx3.Because of this property, x3 = [1, 1]T is an eigenvector of A. The scalefactor (which in this case is 5) is given the symbol λ and is referred to as aneigenvalue.

Note that x = [1,−1]T is also an eigenvector, because in this case, Ax =[3,−3]T = 3x. The corresponding eigenvalue is 3.

Thus if x is an eigenvector of A ∈ Rn×n we have,

Ax = λx (3)

i.e., the vector Ax is in the same direction as x but scaled by a factor λ.

Now that we have an understanding of the fundamental idea of an eigenvec-tor, we proceed to develop the idea further. Eq. (3) may be written in the

2

form Ax = λIx, or(A− λI)x = 0 (4)

where I is the n × n identity matrix. Thus to be an eigenvector, x mustlie in the nullspace of Ax − λI. We know that a nontrivial solution to (4)exists if and only if N(A− λI) is non–empty, which implies that

det(A− λI) = 0 (5)

where det(·) denotes determinant. Eq. (5), when evaluated, becomes apolynomial in λ of degree n. For example, for the matrix A above we have

det

[(4 11 4

)− λ

(1 00 1

)]= 0

det

[4− λ 1

1 4− λ

]= (4− λ)2 − 1

= λ2 − 8λ+ 15 = 0. (6)

It is easily verified that the roots of this polynomial are (5,3), which corre-spond to the eigenvalues indicated above.

Eq. (6) is referred to as the characteristic equation of A, and the corre-sponding polynomial is the characteristic polynomial. The characteristicpolynomial is of degree n.

More generally, if A is n×n, then there are n solutions of (5), or n roots ofthe characteristic polynomial. Thus there are n eigenvalues of A satisfying(3); i.e.,

Axi = λixi, i = 1, . . . , n. (7)

Note that the eigenvalues of a diagonal matrix are the diagonal elementsthemselves, with corresponding eigenvector given by the respective elemen-tary vector e1, . . . , en.

If the eigenvalues are all distinct, there are n associated linearly–independenteigenvectors, whose directions are unique, which span an n–dimensional Eu-clidean space.

In the case where there are r ≤ n repeated eigenvalues, then a linearlyindependent set of n eigenvectors still exist(provided rank(A−λI) = n−r).However, their directions are not unique in this case. In fact, if [v1 . . .vr]

3

are a set of r linearly independent eigenvectors associated with a repeatedeigenvalue, then it is trivial to show that any vector in span[v1 . . .vr] is alsoan eigenvector.

Example 1: Consider the matrix given by 1 0 00 0 00 0 0

.Here the eigenvalues are [1, 0, 0]T , and a corresponding linearly independenteigenvector set is [e1, e2, e3]. Then it may be verified that any vector inspan[e2, e3] is also an eigenvector associated with the zero repeated eigen-value.

Example 2 : Consider the n×n identity matrix. It has n repeated eigenvaluesequal to one. In this case, any n–dimensional vector is an eigenvector, andthe eigenvectors span an n–dimensional space.

—————–

Eq. (5) gives us a clue how to compute eigenvalues. We can formulatethe characteristic polynomial and evaluate its roots to give the λi. Oncethe eigenvalues are available, it is possible to compute the correspondingeigenvectors vi by evaluating the nullspace of the quantity A − λiI, for i= 1, . . . , n. This approach is adequate for small systems, but for thoseof appreciable size, this method is prone to appreciable numerical error.Later, we consider various orthogonal transformations which lead to muchmore effective techniques for finding the eigenvalues.

We now present some very interesting properties of eigenvalues and eigen-vectors, to aid in our understanding.

Property 1 If the eigenvalues of a (Hermitian) 1 symmetric matrix are

1A symmetric matrix is one where A = AT , where the superscript T means transpose,i.e, for a symmetric matrix, an element aij = aji. A Hermitian symmetric (or justHermitian) matrix is relevant only for the complex case, and is one where A = AH , wheresuperscript H denotes the Hermitian transpose. This means the matrix is transposed andcomplex conjugated. Thus for a Hermitian matrix, an element aij = a∗ji.

In this course we will generally consider only real matrices. However, when complexmatrices are considered, Hermitian symmetric is implied instead of symmetric.

4

distinct, then the eigenvectors are orthogonal.

Proof. Let {vi} and {λi}, i = 1, . . . , n be the eigenvectors and correspond-ing eigenvalues respectively of A ∈ <n×n. Choose any i, j ∈ [1, . . . , n], i 6= j.Then

Avi = λivi (8)

andAvj = λjvj . (9)

Premultiply (8) by vTj and (9) by vTi :

vTj Avi = λivTj vi (10)

vTi Avj = λjvTi vj (11)

The quantities on the left are equal when A is symmetric. We show this asfollows. Since the left-hand side of (10) is a scalar, its transpose is equal toitself. Therefore, we get vTj Avi = vTi A

Tvj .2 But, since A is symmetric,

AT = A. Thus, vTj Avi = vTi ATvj = vTi Axj , which was to be shown.

Subtracting (10) from (11), we have

(λi − λj)vTj vi = 0 (12)

where we have used the fact vTj vi = vTi vj . But by hypothesis, λi − λj 6= 0.

Therefore, (12) is satisfied only if vTj vi = 0, which means the vectors areorthogonal.

�

Here we have considered only the case where the eigenvalues are distinct.If an eigenvalue λ̃ is repeated r times, and rank(A − λ̃I) = n − r, then amutually orthogonal set of n eigenvectors can still be found.

Another useful property of eigenvalues of symmetric matrices is as follows:

2Here, we have used the property that for matrices or vectors A and B of conformablesize, (AB)T = BTAT .

5

Property 2 The eigenvalues of a (Hermitian) symmetric matrix are real.

Proof:3 (By contradiction): First, we consider the case where A is real.Let λ be a non–zero complex eigenvalue of a symmetric matrix A. Then,since the elements of A are real, λ∗, the complex–conjugate of λ, must alsobe an eigenvalue of A, because the roots of the characteristic polynomialmust occur in complex conjugate pairs. Also, if v is a nonzero eigenvectorcorresponding to λ, then an eigenvector corresponding λ∗ must be v∗, thecomplex conjugate of v. But Property 1 requires that the eigenvectors beorthogonal; therefore, vTv∗ = 0. But vTv∗ = (vHv)∗, which is by definitionthe complex conjugate of the norm of v. But the norm of a vector is a purereal number; hence, vTv∗ must be greater than zero, since v is by hypothesisnonzero. We therefore have a contradiction. It follows that the eigenvaluesof a symmetric matrix cannot be complex; i.e., they are real.

While this proof considers only the real symmetric case, it is easily extendedto the case where A is Hermitian symmetric.

�

Property 3 Let A be a matrix with eigenvalues λi, i = 1, . . . , n and eigen-vectors vi. Then the eigenvalues of the matrix A + sI are λi + s, withcorresponding eigenvectors vi, where s is any real number.

Proof: From the definition of an eigenvector, we have Av = λv. Further,we have sIv = sv. Adding, we have (A + sI)v = (λ+ s)v. This new eigen-vector relation on the matrix (A+sI) shows the eigenvectors are unchanged,while the eigenvalues are displaced by s. This property has very useful con-sequences with regard to regularization of unstable systems of equations.We discuss this point further in Ch. 4.

�

Property 4 Let A be an n × n matrix with eigenvalues λi, i = 1, . . . , n.Then

3From Lastman and Sinha, Microcomputer–based Numerical Methods for Science andEngineering.

6

• The determinant det(A) =∏ni=1 λi.

• The trace4 tr(A) =∑n

i=1 λi.

The proof is straightforward, but because it is easier using concepts pre-sented later in the course, it is not given here.

Property 5 If v is an eigenvector of a matrix A, then cv is also an eigen-vector, where c is any real or complex constant.

The proof follows directly by substituting cv for v in Av = λv. This meansthat only the direction of an eigenvector can be unique; its norm is notunique.

2.1.1 Orthonormal Matrices

Before proceeding with the eigendecomposition of a matrix, we must developthe concept of an orthonormal matrix. This form of matrix has mutuallyorthogonal columns, each of unit 2–norm. This implies that

qTi qj = δij , (13)

where δij is the Kronecker delta, and qi and qj are columns of the orthonor-

mal matrix Q. When i = j, the quantity qTi qi defines the squared 2–normof qi, which has been defined as unity. When i 6= j, qTi qj = 0, due to theorthogonality of the qi). We therefore have

QTQ = I. (14)

Thus, for an orthonormal matrix, (14) implies Q−1 = QT . Thus the inversemay be computed simply by taking the transpose of the matrix, an operationwhich requires almost no computational effort.

Eq. (14) follows directly from the fact Q has orthonormal columns. It isnot so clear that the quantity QQT should also equal the identity. We can

4The trace denoted tr(·) of a square matrix is the sum of its elements on the maindiagonal (also called the “diagonal” elements).

7

resolve this question in the following way. Suppose that A and B are anytwo square invertible matrices such that AB = I. Then, BAB = B. Byparsing this last expression, we have

(BA) ·B = B. (15)

Clearly, if (15) is to hold, then the quantityBAmust be the identity5; hence,if AB = I, then BA = I. Therefore, if QTQ = I, then also QQT = I.From this fact, it follows that if a matrix has orthonormal columns, then italso must have orthonormal rows. We now develop a further useful propertyof orthonormal marices:

An orthonormal matrix is sometimes referred to as a unitary matrix. Thisfollows because the determinant of an orthonormal matrix is ±1.

Property 6 The vector 2-norm is invariant under an orthonormal trans-formation.

If Q is orthonormal, then for any x we have

||Qx||22 = xTQTQx = xTx = ||x||22 .

Thus, because the norm does not change, an orthonormal transformationperforms a rotation operation on a vector. We use this norm–invarianceproperty later in our study of the least–squares problem.

Orthonormal Matrices as a Basis Set: To represent anm-length vectorx in an m×m orthonormal basis represented by Q (as per the discussion inSect. 1.2.4), we form the coefficients by taking c = QTx. Then x is givenas x = Qc. Because of the simplicity of these operations, orthonormalmatrices are convenient to use as a basis. Later in this chapter, we use theeigenvector set of a covariance matrix as a basis for principal componentanalysis.

Consider the case where we have a tall matrix U ∈ Rm×n, where m > n,whose columns are orthonormal. U can be formed by extracting only thefirst n columns of an arbitrary orthonormal matrix. (We reserve the term

5This only holds if A and B are square invertible.

8

orthonormal matrix to refer to a complete m×m matrix). Because U hasorthonormal columns, it follows that the quantity UTU = In×n. However,it is important to realize that the quantity UUT 6= Im×m in this case, incontrast to the situation when m ≥ n. This fact is easily verified, sincerank(UUT ) = n, which less than m, and so cannot be the identity.

However, the matrixUUT has interesting consequences when we interpretUas an (incomplete) basis. For an m-length vector x, we form the coefficientsas c = UTx, where c is length n < m. Then the representation x̃ of x inthe new basis is x̃ = Uc = UUTx. Notice however that R(U) is a subspaceof dimension n of Rm, and that x̃ is in this subspace, even though x itselfis in the full m–dimensional universe. So the operation UUTx projects xinto the subspace R(U). The matrix UUT is referred as a projector. Wewill study projector matrices in more detail later in the course.

2.1.2 The Eigendecomposition (ED) of a Square Symmetric Ma-trix

The eigendecomposition is a very useful tool in many branches of engineeringanalysis. While it may only be applied to square symmetric matrices, thatis not too limiting a restriction in practice, since many matrices that are ofinterest in signal processing related diisciplines fit into this category.

Let A ∈ Rn×n be symmetric. Then, for eigenvalues λi and eigenvectors vi,we have

Avi = λivi, i = 1, . . . , n. (16)

Let vi be an eigenvector of A having arbitrary 2–norm. Then, using Prop-erty 5, we can normalize vi to unit 2-norm by replacing it with the quantityvi/c, where c = ||vi||2. We therefore assume all the eigenvectors have beennormalized in this manner.

Then these n equations can be combined, or stacked side–by–side together,and represented in the following compact form:

AV = VΛ (17)

9

where V = [v1,v2, . . . ,vn] (i.e., each column of V is an eigenvector), and

Λ =

λ1 0

λ2

. . .

0 λn

= diag(λ1 . . . λn). (18)

Eq. (17) may be verified by realizing that corresponding columns fromeach side of this equation represent one specific value of the index i in(16). Because we have assumed A is symmetric, from Property 1, the viare orthogonal. Furthermore, since we have chosen ||vi||2 = 1, V is anorthonormal matrix. Thus, post-multiplying both sides of (17) by VT , andusing V V T = I we get

A = VΛVT . (19)

Eq. (19) is called the eigendecomposition (ED) of A. The columns of Vare eigenvectors of A, and the diagonal elements of Λ are the correspondingeigenvalues. Any square symmetric matrix may be decomposed in this way.This form of decomposition, with Λ being diagonal, is of extreme inter-est and has many interesting consequences. It is this decomposition whichleads directly to the concepts behind principal component analysis, whichwe discuss shortly.

Note that from (19), knowledge of the eigenvalues and eigenvectors of A issufficient to completely specify A. Note further that if the eigenvalues aredistinct, then the ED is unique. There is only one orthonormal V and onediagonal Λ which satisfies (19).

Eq. (19) can also be written as

VTAV = Λ.

Since Λ is diagonal, we say that the orthonormal matrix V of eigenvectorsdiagonalizes A. No other orthonormal matrix can diagonalize A. The factthat only V diagonalizes A is a very important property of eigenvectors.

2.1.3 Conventional Notation on Eigenvalue Indexing

Let A ∈ Rn×n be symmetric and have rank r ≤ n. Then, we see in thenext section we have r non-zero eigenvalues and n − r zero eigenvalues. It

10

is common convention to order the eigenvalues so that

|λ1| ≥ |λ2| ≥ . . . ≥ |λr|︸︷︷︸r nonzero eigenvalues

> λr+1 = . . . , λn︸︷︷︸n−r zero eigenvalues

= 0 (20)

i.e., we order the columns of eq. (17) so that λ1 is the largest in absolutevalue, with the remaining nonzero eigenvalues arranged in descending order,followed by n − r zero eigenvalues. Note that if A is full rank, then r = nand there are no zero eigenvalues. The quantity λn is the eigenvalue withthe smallest absolute value.

The eigenvectors are reordered to correspond with the ordering of the eigen-values. For notational convenience, we refer to the eigenvector correspondingto the largest eigenvalue as the “largest eigenvector”. The “smallest eigen-vector” is then the eigenvector corresponding to the smallest eigenvalue.

2.2 The Eigendecomposition in Relation to the FundamentalMatrix Subspaces

In this section, we develop relationships between the eigendecomposition ofa matrix and its range, null space and rank.

Here we entertain the possibility that the matrix A may be rank deficient;i.e., r ≤ n. Let us partition the eigendecomposition of A in the followingblock format:

A = V ΛV T

=

[V 1 V 2

]r n−r

[Λ1 00 Λ2

] [V T

1

V T2

]r

n−r(21)

where

V 1 = [v1,v2, . . . ,vr] ∈ Rn×r

V 2 = [vr+1, . . . ,vn] ∈ Rn×n−r,

We note that the block partitioning in (21) may be reconciled because mul-tiplication of block matrices is achieved by treating the blocks as if they

11

are elements in regular matrix multiplication, provided the blocks are ofconformable dimensions.

The columns of V 1 are eigenvectors corresponding to the first r (larger)eigenvalues of A, and the columns of V 2 are eigenvectors corresponding tothe n− r smallest eigenvalues. We also have

Λ1 = diag[λ1, . . . , λr] =

λ1

. . .

λr

∈ Rr×r,

and

Λ2 = diag[λr+1, . . . , λn] =

λr+1

. . .

λn

∈ R(n−r)×(n−r).

In the notation used above, the explicit absence of a matrix element inan off-diagonal position implies that element is zero. As we will see, thepartitioning of A in the form of (21) reveals a great deal about its structure.

2.2.1 Range

We look at R(A) in the light of the decomposition of (21). The definitionof R(A), repeated here for convenience, is

R(A) = {y | y = Ax,x ∈ Rn} , (22)

where x takes on all values in the n–dimensional universe. The vectorquantity Ax is therefore given as

Ax =[V 1 V 2

] [ Λ1 00 Λ2

] [V T

1

V T2

]x.

Let us define the vector c as

c = V Tx =

[c1

c2

]=

[V T

1

V T2

]x, (23)

where c1 ∈ Rr and c2 ∈ Rn−r. Then,

y = Ax =[V 1 V 2

] [ Λ1 00 Λ2

] [c1

c2

].

12

We can therefore write:

y =[V 1 V 2

] [ Λ1c1

Λ2c2

]= V 1 (Λ1c1) + V 2 (Λ2c2) . (24)

We are given that A is rank r ≤ n. Therefore, the subspace spanned by y in(24) as x assumes all values, i.e. R(A), by definition can only span r linearlyindependent directions, (recall r is the number of columns of V 1). But sinceby definition Λ1 ∈ Rr×r contains the eigenvalues with largest absolute value,y must be a linear combination of only the columns of V 1 and therefore Λ1

must be non-zero. Also we have

1. all eigenvalues in Λ2 must be zero.

2. V 1 is an orthonormal basis for R(A).

We therefore have the important result that a rank r ≤ n matrix musthave n − r zero eigenvalues. This property has very useful consequencesin engineering, since in many cases it enables us to compress a signal orsystem of interest using only r components instead instead of n, where oftenin practical scenarios we have r � n. We discuss this possibility further inSect. xx.

2.2.2 Nullspace

In this section, we explore the relationship between the partition of (21) andthe nullspace of A. Recall that the nullspace N(A) of A is defined as

N(A) = {x ∈ <n | Ax = 0} . (25)

From (21), and the fact that Λ2 = 0, we have

Ax =[V 1 V 2

] [ Λ1 00 0

] [V T

1

V T2

]x

13

We now define x = V c from (23), and using the fact that V 1 ⊥ V 2, wehave

Ax =[V 1 V 2

] [ Λ1 00 0

] [V T

1

V T2

] [V 1V 2

] [ c1

c2

]=

[V 1 V 2

] [ Λ1 00 0

] [I 00 I

] [c1

c2

]=

[V 1 V 2

] [ Λ1 00 0

] [c1

c2

](26)

It is clear that (26) is zero for any nonzero x if and only if c1 = 0, whichimplies that x ∈ spanV 2. In this case, we have

Ax =[V 1 V 2

] [ Λ1 00 0

] [0c2

]=

[V 1 V 2

] [ 00

]= 0. (27)

Thus, V 2 is an orthonormal basis for N(A). Since V 2 has n − r columns,then the dimension of N(A) (i.e., the nullity of A) is n− r.

2.2.3 Diagonalizing a system of equations

We now show that transforming various quantities involved in a linear sys-tem of equations into the eigenvector basis diagonalizes the system of equa-tions. We are given a system of equations Ax = b, where A is assumedsquare and symmetric, and hence can be decomposed using the eigendecom-position of (19), as A = V ΛV T . Thus our system can be written in theform

Λc = d, (28)

where c = V Tx and d = V Tb. Since Λ is diagonal, (28) has a muchsimpler structure than the original form. Thus, transforming x and b intothe eigenvector basis greatly simplifies the analysis.

14

2.3 Covariance and Covariance Matrices

This is a very important topic in any form of statistical analysis and sig-nal processing. The definitions vary across various books and articles, butthe fundamental idea remains unchanged. Here, we start with a generaldefinition of covariance, and then provide specific examples to provide in-terpretation.

We are given two scalar random variables x1 and x2. The covariance σ12

between them is defined as

σ12 = E(x1 − µ1)(x2 − µ2), (29)

where E(·) is the expectation operator, and µi, i ∈ [1, 2] are the respec-tive means of the variables. A related quantity is correlation ρ, which is anormalized version of covariance, defined as

ρ12 =E(x1 − µ1)(x2 − µ2)

σ1σ2, (30)

where σi is the standard deviation of the respective random variable. Itfollows directly from the definition of standard deviation that −1 ≤ ρ12 ≤ 1.

Because the presentation is easier in the case where the means are zero,and because most of the real–life signals we deal with have zero mean, fromthis point onwards we assume either that the means of the variables arezero, or that we have subtracted the means as a pre–processing step – i.e.,xi ← xi − µi.

In the following presentation, it is important to understand the relation-ship between variance, mean–squared value and power, with reference to arandom variable or random process x. Here we assume zero mean. Thevariance σ2 of a zero–mean process is defined as σ2

x = E(x)2, where theexpectation is taken over all realizations of the process. The same definitionapplies to the mean–squared value and power. However, the term power isonly applied when the random process is specifically a function of time.

We offer an example to help in the interpretation of (29). Let x1 be aperson’s height and x2 be their corresponding weight. Then with respectto a particular person, if x1 is above the mean value, then it is more likelythat x2 is also above its mean, and vice–versa. Thus it is likely that x1

15

and x2 both have the same sign (after removal of the mean) and thereforethe product x1x2 is most often positive. Occasionally we encounter a short,stout individual, or a tall, slender person, and in this case the product x1x2

would be negative. By taking the expectation over all possible people, onaverage we will obtain a positive value, i.e., in this case the covariance ispositive. We express this idea by saying that these variables are “positivelycorrelated”.

Another example is where we let x1 be the final mark a person receives inthis course, and x2 remains as their weight. Then, since the two variablesappear to be unrelated, we would expect that the product x1x2 is positiveas often as it is negative. In this case, the expectation over all instances ofthe random variables goes to zero with the result that the covariance is zero.In this case, the variables are uncorrelated.

The final example is where x1 is the maximum speed at which a personcan run, and again x2 remains as the person’s corresponding weight. Thengenerally, the higher the person’s weight, the slower they run. So in this case,the variables most often have opposite signs, so the covariance is negative. Anon-zero covariance between two random variables implies that one variableaffects the other, whereas a zero covariance implies there is no (first–order)relationship between them.

The situations corresponding to these three examples are depicted in whatis referred to as a scatter plot, shown in Fig. 2. Each point in the figurescorrespond to a single measurement of the two variables x = [x1, x2]T . Thescatterplot is characteristic of the underlying probability density function(pdf), which in this example is a multi–variate, Gaussian distribution. Thevariables have been normalized to zero mean and unit standard deviation.We see that when the covariance is positive as in part (a), (here the covari-ance has the value +0.8), the effect of the positive covariance is to clusterthe samples within an ellipse which is oriented along the direction [1, 1]T .When the covariance is zero (part b), there is no directional discrimination.When the covariance is negative, the samples are again clustered within anellipse, but oriented oriented along the direction [1,−1]T .

16

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

5

x1 in normalized units

x2 in

nor

mal

ized

uni

ts

(a)

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4


x2 in

nor

mal

ized

uni

ts

(b)

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4


x2 in

nor

mal

ized

uni

ts

(c)

Figure 2. Scatter plots for different cases of random vectors [x1x2]T for different valuesof covariance. (a) covariance σ12 = +0.8, (b) covariance σ12 = 0, and (c) covarianceσ12 = −0.8. The axes are normalized to zero mean and standard deviation = 1. Eachpoint in each figure represents one observation [x1, x2]T of the random vector x. In eachfigure there are 1000 observations.

17

In this example, the effect of a positive covariance between the variables is tocause the respective scatterplot to become elongated along the major axis,which in this case is along the direction [1, 1]T , where the elements have thesame sign (i.e., where height and weight are simultaneously either larger orsmaller than their means). Note that the direction [1, 1]T coincides with thatof the first eigenvector – see Sect. 2.3.2. In this case, due to the positivecorrelation, observations relatively far from the mean along this directionhave a relatively high probability, and so the variance of the observationsalong this direction is relatively high. On the other hand, again due tothe positive correlation, observations along the direction [1,−1]T ] that arefar from the mean have a lower probability, with the result the variance inthis direction is smaller (i.e., tall and skinny people occur more rarely thantall and correspondingly heavier people). This behaviour is apparent fromFig. 2a. In cases where the governing distribution is not Gaussian, similarbehaviour persists, although the scatterplots will not be elliptical in shape.

As a further example, take the limiting case in Fig 2a where σ12 → 1.Then, the knowledge of one variable completely specifies the other, and thescatterplot becomes a straight line. In summary, we see that as the value ofthe covariance increases from zero, the scatterplot ellipse transistions frombeing circular (when the variance of the variables are equal) to becomingelliptical, with the eccentricity increasing with the covariance.

2.3.1 Covariance Matrices

The covariance structure of a vector of random variables can be convenientlyencapsulated into a covariance matrix, R ∈ Rm×m. The covariance matrixcorresponding to the vector random process x ∈ Rm is defined as:

R∆= E

[(x− µ)(x− µ)T

](31)

= E

(x1 − µ1)(x1 − µ1) . . . . . . (x1 − µ1)(xm − µm)(x2 − µ2)(x1 − µ1) (x2 − µ2)(x2 − µ2) . . . (x1 − µ1)(xm − µm)

......

. . ....

(xm − µm)(x1 − µ1) . . . . . . (xm − µm)(xm − µm)

.where µ is the vector mean.

We recognize the diagonal elements as the variances σ21 . . . σ

2m of the elements

18

x1 . . . xm respectively, and the (i, j)th off–diagonal element as the covariancebetween xi and xj . Since multiplication is commutative, σij = σji and soR is symmetric. It is also apparent that covariance matrices are square. Ittherefore follows that its eigenvectors are orthogonal.

Suppose we have available n observations of the vector xi ∈ Rm, i = 1 . . . , n.(In these examples, m = 2 and n = 1000). These samples can be arrangedinto an m × n matrix X, where each column contains the m variables forthe ith observation. In this case, if the process is ergodic, an estimate of theunderlying covariance matrix R̂ ∈ Rm×m may be obtained by replacing theexpectation in (31) with an arithmetic average, as follows

R̂ =1

n

n∑i=1

xixTi =

1

nXXT . (32)

The last equality in the above follows from the outer product rule of matrixmultiplication.

We see that the vectors xi involved in forming the outer products in (32)contain a single observation of the variables of interest (i.e., height andweight for the ith person in the present context). These outer products arethen averaged across observations to form the covariance matrix. Thus, theform in (32) requires that the columns of X contain the variables from oneobservation. It is also possible to form the covariance matrix by evaluating1nX

TX. In this case however, the rows of X contain the variables from oneobservation.

The covariance matrices for each of our three examples in Fig. 2 are givenas

R1 =

[1 +0.8

+0.8 1

], R2 =

[1 00 1

], and R3 =

[1 −0.8−0.8 1

]. (33)

Note that 1’s on each diagonal result because the variances of x1 and x2

have been normalized to 1.

2.3.2 Alternate Interpretation of Eigenvectors

Now that we have the requisite background in place, we can present an alter-native to the classical interpretation of eigenvectors that we have previously

19

discussed. This alternate interpretation is very useful in the science and en-gineering contexts, since it sets the stage for principal component analysis,which is a widely used tool in many applications in signal processing.

Consider the m× n matrix X as above. We address the question “What isthe vector q with unit 2-norm so that the quantity yi = qTxi, i = 1, . . . , n,has maximum variance (or mean–squared value), when taken over all valuesof i?” Note that the inner product operation to form the quantity yi isequivalent to the projection of the ith observation xi onto the vector q.This is equivalent to finding the vector q? of unit 2-norm which maximizes||y||22, where y = qTX. We can state the problem more formally as

q∗ = arg maxq

∣∣|qTX∣∣ |22 ≡ arg maxqqTXXTq, subject to ||q||22 = 1. (34)

The constraint ||q||2 = 1 necessary to prevent ||q|| from going to infinity.We explicitly use the 2-norm squared, since it enables a closed–form solutionfor q. Heuristically, with respect to Fig. 2a, we can see the solution is aunit vector pointing in the direction of [1, 1]T .

Eq.(34) is a constrained optimization problem which may be solved by themethod of Lagrange multipliers6. Briefly, this method may be stated asfollows. Given some objective function G(x) which we wish to minimizeor maximize with respect to x, subject to an equality constraint functionf(x) = 0, we formulate the Lagrangian function L(x) which is defined as

L(x) = G(x) + λf(x),

where λ is referred to as a Lagrange multiplier, whose value is to be deter-mined. The constrained solution x∗ then satisfies

dL(x∗)

dx= 0. (35)

Applying the Lagrange multiplier method to (34), the Lagrangian is givenby

L(q) = qTXXTq + λ(1− qTq). (36)

From The Matrix Cookbook7, the derivative of the first term (when theassociated matrix is symmetric) is 2XXTq. It is straightforward to show

6There is a good description of the method of Lagrange multipliers in Wikipedia.7There is a link to this document on the course website.

20

that with respect to the second term,

d

dq(1− qTq) = −2q.

Therefore (35) in this case becomes

2XXTq − 2λq = 0

orXXTq = λq. (37)

We therefore have the important result that the stationary points of the con-strained optimization problem (36) are given by the eigenvectors of XXT .Thus, the vector q∗ onto which the observations should be projected for max-imum variance is given as the largest eigenvector of the matrix XXT . Thisdirection coincides with the major axis of the scatterplot ellipse. The di-rection, orthogonal to the first, which results in the second-largest variance,is the second-largest eigenvector, and so on. The direction which results inminimum variance of y is the smallest eigenvector. Each eigenvector alignsitself along one of the major axes of the corresponding scatterplot ellipse.

The matrix XXT in (37), except for the 1/n term, is the estimated covari-ance matrix R̂. The 1/n term is irrelevant in this case, since the eigenvectorsare constrained to unit 2–norm, regardless of whether the term is present ornot. Since it is easily verified that the largest eigenvector of R1 with respectto the example in Fig. 2(a) is 1√

2[1, 1]T , this result checks with our earlier

heuristic analysis that the optimal direction should be along the direction[1, 1]T .

2.3.3 Covariance Matrices of Stationary Time Series

Here, we extend the concept of covariance matrices to a discrete–time, one-dimensional, stationary time series, represented by x[k]. The objective is tocharacterize the covariance structure of the time series over an interval of msamples. In this vein we form vectors xi ∈ Rm, which consist of m sequentialsamples of the time series within a window of length m, as shown in Fig.3. The windows typically overlap; in fact, they are typically displaced fromone another by only one sample. Hence, the vector corresponding to eachwindow is a vector sample from the random process x[k].

21

signal x(k)

x1

x2

o o o

xi

time

sampled

x(k)

windows oflength m

msamples

oo

o o

(k)

xn

o o

Figure 3. The received signal x[k] is decomposed into windows of length m. The samplesin the ith window comprise the vector xi, i = 1, 2, . . ..

0 10 20 30 40 50 60 70 80 90 100

time

-4

-3

-2

-1

0

1

2

3

w[k

]

A sample of a white Gaussian random process with µ = 1 and σ2 = 1.

Figure 4. A sample of a white Gaussian discrete–time process, with mean µ = 1 andvariance σ2 = 1.

22

The word stationary as used above means the random process is one forwhich the corresponding joint m–dimensional probability density functiondescribing the distribution of the vector sample x does not change with time.This means that all moments of the distribution (i.e., quantities such as themean, the variance, and all cross–correlations, as well as all other higher–order statistical characterizations) are invariant with time. Here however,we deal with a weaker form of stationarity referred to as wide–sense sta-tionarily (WSS). With these processes, only the first two moments (mean,variances and covariances) need be invariant with time. Strictly, the ideaof a covariance matrix is only relevant for stationary or WSS processes,since expectations only have meaning if the underlying process is station-ary. However, we see later that this condition can be relaxed in the case ofa slowly–varying non–stationary signal, if the expectation is replaced witha time average over an interval which has shorter duration than the intervalover which the signal may be considered stationary.

Here, as in (31), Rx corresponding to a stationary or WSS process x[k] isdefined for the case of zero mean, as

Rx = E[xix

Ti

]= E

x1x1 x1x2 · · · x1xmx2x1 x2x2 · · · x2xm

......

. . ....

xmx1 xmx2 · · · xmxm

, (38)

where xi = (x1, x2, . . . , xm)T . Taking the expectation over all windows, eq.(38) tells us that the element r(1, 1) of Rx is by definition E(x2

1), which isthe variance of the first element x1 of all possible vector samples xi of theprocess. But because of stationarity, r(1, 1) = r(2, 2) = . . . ,= r(m,m)which are all equal to σ2. Thus all main diagonal elements of Rx areequal to the variance of the process. The element r(1, 2) = E(x1x2) isthe cross–correlation between the first element of xi and the second ele-ment. Taken over all possible windows, we see this quantity is the cross–correlation of the process and itself delayed by one sample. Because ofstationarity, r(1, 2) = r(2, 3) = . . . = r(m − 1,m) and hence all elementson the first upper diagonal are equal to the cross-correlation for a time-lagof one sample. Since multiplication is commutative, r(2, 1) = r(1, 2), andtherefore all elements on the first lower diagonal are also all equal to thissame cross-correlation value. Using similar reasoning, all elements on thejth upper or lower diagonal are all equal to the cross- correlation value ofthe process for a time lag of j samples. Thus we see that the matrix Rx is

23

highly structured.

If we compare the process shown in Fig. 3 with that shown in Fig. 4, wesee that in the former case the process is relatively slowly varying. Becausewe have assumed x[k] to be zero mean, adjacent samples of the process inFig. 3 will have the same sign most of the time, and hence E(xixi+1) willbe a positive number, coming close to the value E(x2

i ). The same can besaid for E(xixi+2), except it is not so close to E(x2

i ). Thus, we see that forthe process of Fig. 3, the diagonals decay fairly slowly away from the maindiagonal value.

However, for the process shown in Fig. 4, adjacent samples are uncorrelatedwith each other. This means that adjacent samples are just as likely tohave opposite signs as they are to have the same signs. On average, theterms with positive values have the same magnitude as those with negativevalues. Thus, when the expectations E(xixi+1), E(xixi+2) . . . are taken, theresulting averages approach zero. In this case then, we see the covariancematrix concentrates around the main diagonal, and becomes equal to σ2I.We note that all the eigenvalues of Rx are equal to the value σ2. Because ofthis property, such processes are referred to as “white”, in analogy to whitelight, whose spectral components are all of equal magnitude.

The sequence {r(1, 1), r(1, 2), . . . , r(1,m)} is equivalent to the autocorrela-tion function of the process, for lags 0 to m−1. The autocorrelation functionof the process characterizes the random process x[k] in terms of its variance,and how quickly the process varies over time. In fact, it may be shown8 thatthe Fourier transform of the autocorrelation function is the power spectraldensity of the process. Further discussion on this aspect of random processesis beyond the scope of this treatment; the interested reader is referred tothe reference.

In practice, based on (32) we evaluate an estimate R̂x of Rx, based on anobservation of finite length n of the process x[k], by replacing the ensembleaverage (expectation) with a finite temporal average over the n available

8A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw Hill,3rd Ed.

24

data points as follows:

R̂x =1

n−m+ 1

n−m+1∑i=1

xixTi . (39)

If (39) is used to evaluate R̂, then the process need only be stationary overthe observation length. Thus, by using the covariance estimate given by(39), we can track slow changes in the true covariance matrix of the processwith time, provided the change in the process is small over the observationinterval n. Further properties and discussion covariance matrices are givenin Haykin.9

Let X ∈ Rm×(n−m+1) be a matrix whose ith column is the vector samplexi, i = 1, . . . , n−m+ 1 of x[k]. Then, as in (32), R̂x is also given as

R̂x =1

n−m+ 1XXT , (40)

where the outer–product rule for matrix multiplication has been invoked.

Some Properties of Rx:

1. Rx is (Hermitian) symmetric i.e. rij = r∗ji, where ∗ denotes complexconjugation.

2. If the process x[k] is stationary or wide-sense stationary, then Rx isToeplitz. This means that all the elements on a given diagonal of thematrix are equal. If you understand this property, then you have agood understanding of the nature of covariance matrices.

3. If Rx is diagonal, then the elements of x are uncorrelated. If themagnitudes of the off-diagonal elements of Rx are significant withrespect to those on the main diagonal, the process is said to be highlycorrelated.

4. R is positive semi–definite. This implies that all the eigenvalues aregreater than or equal to zero. We will discuss positive definiteness andpositive semi–definiteness later.

9Haykin, “Adaptive Filter Theory”, Prentice Hall, 3rd. ed.

25

2.4 Principal Component Analysis (PCA)

The basic idea behind PCA is that we wish to transform each observationxi into a new basis so that as much variance is concentrated into as fewcomponents as possible. The motivation for this objective, as we see shortly,is that it provides the means for data compression.

As we have seen in Sect. 2.3.2, projecting x onto the first eigenvector of R̂x

maximizes the variance of the resulting projection y. Likewise, by projectingonto the second eigenvector, which is in a direction orthogonal to the first,results in a variable with the second largest variance, etc. It follows that ourobjective can be met by transforming x into the orthonormal eigenvectorbasis.

Let the eigendecomposition of R̂x be given as

Rx = V ΛV T .

Then the coefficients θ of the PCA transformation are given as

θi = V Txi. (41)

where i is the observation (i.e. window) index as in Fig. 3. The usefulproperty of this transformation is that, in the case where the time seriesis correlated, the variance of the coefficients θ concentrates in the first fewvalues, leaving the remaining elements with negligible value, most of thetime. This means that these smaller coefficients can be discarded and theentire m–length sequence x can be accurately represented with only a few(i.e. significantly less than m) coefficients. Thus PCA analysis is a means toaccomplish data compression. Let’s say that r < m coefficients are sufficientto represent the sequence xi with sufficient fidelity. We can then form acompressed version x̂i from xi by first forming a truncated version θ̂ fromθ in (41). θ̂ is given as

θ̂ =

θ1...θr0...0

. (42)

26

The compressed version x̂i of xi is then reconstructed as

x̂i = V θ̂, (43)

i.e., the smaller eigenvector components are neglected in the representationof x.

We can offer a heuristic explanation as to why the θ’s concentrate in thefirst few values when x[k] is a slowly varying process. A more rigorous ex-planation follows. If x[k] is slowly varying, then the m elements of eachxi are positively correlated. As in the m = 2 case, here it is most likelythat variables all have the same sign, and hence the scatterplot concentratesalong the direction [1, 1, . . . , 1]. As per the discussion in Sect. 2.3.2, this thedirection of the largest eigenvector and has the largest variance. There willbe another direction, orthogonal to the first and coincident with the secondeigenvector, where some values are positive and some are negative. Due tothe positive correlation among all the variables, it is less likely for variablessituated along this direction to be far from the mean, and so this directionwill have smaller variance than the first direction. This direction has thesecond-largest variance. Generalizing this argument, as the dimensionalityincreases to m > 2 dimensions, there are additional directions of the scatter-plot along which the observations become less and less likely to lie far fromthe mean, and hence these directions have successively smaller variances.This is the motivation for truncating the elements of θ with higher indexvalues in (42).

2.4.1 Properties of the PCA Representation

Here we introduce several properties of PCA analysis. These properties leadto a more rigorous framework for the understanding of this topic.

Property 7 The coefficients θ are uncorrelated.

27

To prove this, we evaluate the covariance matrix Rθθ of θ, using the defini-tion (41) as follows:

Rθθ = E(θθT

)= E

(V TxxTV

)= V TRxV

= Λ. (44)

Since Rθθ is equal to the diagonal eigenvalue matrix Λ of Rx, the PCAcoefficients are uncorrelated.

Property 8 The variance of the ith PCA coefficient θi is equal to the itheigenvalue λi of Rx.

The proof follows directly from prop. (7) and the fact that the ith diagonalelement of Rθθ is the variance of θi.

From this property, we can infer that the length of the ith semi–axis of thescatterplot ellipse is directly proportional to the ith eigenvalue, which isequal to the variance of θi. The next property shows that the eigenvaluesindeed become smaller, and therefore so do the variances of the θi, withincreasing index i.

Property 9 The variance of a correlated random process x concentrates inthe first few PCA coefficients.

Let us denote the covariance matrix associated with the process shown inFig. 3 as Rc, and that shown in Fig.4 as Rw. We assume both processes arestationary with equal powers. Let αi be the eigenvalues of Rc and βi be theeigenvalues of Rw. Because Rw is diagonal with equal diagonal elements, allthe βi are equal. Our assumptions imply that the main diagonal elements ofRc are equal to the main diagonal elements of Rw, and hence from Property4, the trace and the eigenvalue sum of each covariance matrix are equal.

To obtain further insight into the behavior of the two sets of eigenvalues, weconsider Hadamard’s inequality10 which may be stated as:

10For a proof, refer to Cover and Thomas, Elements of Information Theory

28

Consider a square matrix A ∈ <m×m. Then, detA ≤∏mi=1 aii,

with equality if and only if A is diagonal.

From Hadamard’s inequality, detRc < detRw, and so also from Property4,∏ni=1 αi <

∏ni=1 βi. Under the constraint

∑αi =

∑βi , it follows that

α1 > αn; i.e., the eigenvalues of Rc are not equal. (We say the eigenvaluesbecome disparate). Thus, according to prop.(8), the variances in the firstfew PCA coefficients of a correlated process are larger than those in the laterPCA coefficients. In practice, when x[k] is highly correlated, the variancesin the later coefficients become negligible.

Property 10 The mean–squared error ε∗r = Ei||xi − x̂i||22 in the PCA rep-resentation x̂i of xi using r components is given by

ε∗r = Ei||xi − x̂i||22 =m∑

i=r+1

λi, (45)

which corresponds to the sum of the truncated eigenvalues11.

Proof:

ε∗r = Ei||xi − x̂i||22 = Ei||V θ − V θ̂||22= E||V (θ − θ̂)||22= E||(θ − θ̂)||22

=m∑

i=r+1

E(θi)2

=

m∑i=r+1

λi, (46)

where in the last line we have used prop.(8) and the third line follows dueto the fact that the 2–norm is invariant to multiplication by an orthonormalmatrix.

Here we have the important result that since (46) involves only the latereigenvalues, which according to prop. (9) are small, the minimum mean–squared error ε∗r in the reconstruction of x can be small. This is especially so

11We see later in Ch.4 that Rx is positive definite and so all the eigenvalues are positive.

29

in highly correlated systems. This is the underlying principle of PCA encod-ing – a signal can be transformed into a basis consisting of the eigenvectorsof Rx. Doing so results in a significant fraction of coefficients having smallvariance and therefore can be truncated without causing significant error inthe representation of the original signal x[k]. The truncated version x̂ is rep-resented using only r < m coefficients, and so therefore data compression isachieved. PCA has application in machine learning, noise reduction, imagecompression, voice compression, least–squares analysis, and many others.

The compression ratio we achieve with PCA analysis is m/r. It is then clearthat if we reduce r, we achieve higher compression. On the other hand, forsmaller r, ε∗r in (46) increases. So the choice of the parameter r is a tradeoffbetween compression and fidelity. The actual value to be used dependson the degree of correlation structure in the signal x[k], (specifically howsmoothly it varies), which in turn governs the rate of eigenvalue decay.

A real–life example of compression with PCA is in speech coding. Thetypical bit rate from a regular land-line telephone is 64 kbits/sec. UsingPCA analysis (aka transform coding in this application), we can reducethe bit rate down to the order of 14.4 kbits/sec. Thus the compressionratio is 64/14.4 = 4.44. Compression schemes similar to this are commonlyused in modern cell phones, where most of the time any distortion in thereconstructed voice is imperceptible, thus demonstrating the effectiveness ofthe transform coding technique.

We can offer an additional example illustrating the compression phenomenon.Consider the extreme case where the process becomes so correlated that allelements of its covariance matrix approach the same value. (This will happenif the process x[k] is non–zero and does not vary with time. In this case, wemust consider the fact that the mean is non–zero). Then all columns of thecovariance matrix are equal, and the rank of Rx is one, and therefore onlyone eigenvalue is nonzero. Then all the power of the process is concentratedinto only the first PCA coefficient and therefore all the later coefficients aresmall (i.e. zero) in comparison to the first coefficients, and the process ishighly compressible.

In contrast, in the opposite extreme when x[k] is white, the correspondingRx = σ2

xI. Thus all the eigenvalues are equal to σ2x, and none of them are

small compared to the others. In this case there can be no compression

30

w(n) x(n)low-passdiscrete-timefilter

low normalizedcutofffrequency

Figure 5. Generation of a highly correlated process x[k].

(truncation) without inducing significant distortion in (46). Also, since wehave m repeating eigenvectors, the eigenvectors are not unique. In fact, anyorthonormal set of m vectors is an eigenvector basis for a white signal.

2.4.2 Examples of PCA Analysis

Here we present a simulation example using PCA analysis, to illustrate theeffectiveness of this technique.

A process x[k] was simulated by passing a unit-variance zero–mean whitenoise sequence w(n) through a 3rd-order lowpass digital lowpass Butter-worth filter with a relatively low normalized cutoff frequency (0.1 Hz), asshown in Fig. 5. A white noise sample and its corresponding filtered ver-sion are shown in Fig. 6. One can observed that there is no relationshipbetween successive samples of the white sequence, however we can see thatthe filtered version is considerably smoother and thus successive samplesare correlated. Vector samples xi ∈ Rm are extracted from the sequencex[k] in the manner shown in Fig. 3. The filter removes the high-frequencycomponents from the input and so the resulting output process x[k] musttherefore vary slowly in time and therefore must exhibit a significant covari-ance structure. As a result, we expect to be able to accurately representthe original signal using only a few principal eigenvector components, andbe able to achieve significant compression gains.

The sample covariance matrix R̂x was computed from the sequence x[k] forthe value m = 10. Listed below are the 10 eigenvalues of R̂x:

31

0 20 40 60 80 100 120 140 160 180 200

time

-3

-2

-1

0

1

2

3

4w

[k],

x[k]

Sample of a white noise sequence (red) and its low-pass filtered version (blue)

Figure 6. A sample of a white noise sequence and its corresponding filtered version, forthe filter shown in Fig. 5. The white noise sequence is a Gaussian random process withµ = 0 and σ2 = 1, generated using the matlab command “randn”.

Eigenvalues:

0.54680.19750.1243× 10−1

0.5112× 10−3

0.2617× 10−4

0.1077× 10−5

0.6437× 10−7

0.3895× 10−8

0.2069× 10−9

0.5761× 10−11

Inspection of the eigenvalues above indicates that a large part of the total

32

1 2 3 4 5 6 7 8 9 10−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5First Two Principal Components (10 x 10 system)

time

x

v1

v2

Figure 7. First two eigenvector components as functions of time, for Butterworth lowpassfiltered noise example.

variance is contained in the first two eigenvalues. We therefore choose r = 2.The error ε?r for r = 2 is thus evaluated from the above data as 0.0130, whichmay be compared to the value 0.7573, which is the total eigenvalue sum. Thenormalized error is 0.0130

0.7573 = 0.0171. Because this error may be considereda low enough value, only the first r = 2 components may be consideredsignificant. In this case, we have a compression gain of 10/2 = 5; i.e., thePCA expansion requires only one fifth of the bits relative to representingthe signal directly.

The data vector x̂i is a function of time over the time interval spanningthe ith window, and since x̂i = V θ̂i, i.e., x̂i is a linear combination ofthe columns of V , then the elements vjk of the kth column vk must alsorepresent a discrete function of time over the same respective interval, wherethe index j may be interpreted as a time variable. As such, Fig. 7 showsplots of the elements vjk, k ∈ [1, 2] for the two principal eigenvectors plottedagainst the time index j.

In this case, we would expect that any observation xi can be expressed ac-curately as a linear combination of only the first two eigenvector waveformsshown in Fig. 7, whose coefficients θ̂ are given by (41). In Fig. 8 we showsamples of the true observation xi shown as a waveform in time, comparedwith the reconstruction x̂i formed from (43) using only the first r = 2 eigen-vectors. It is seen that the difference between the true and reconstructed

33

1 2 3 4 5 6 7 8 9 10−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4Original and Reconstructed Vector Samples

Time

Am

plitu

de

Figure 8. Original vector samples of x as functions of time (solid), compared with theirreconstruction using only the first two eigenvector components (dotted). Three vectorsamples are shown.

vector samples is small, as expected.

One of the practical difficulties in using PCA for compression is that theeigenvector set V is usually not available at the reconstruction stage inpractical cases when the observed signal is mildly or severely nonstationary,as with the case with speech or video signals. In this case, the covari-ance matrix estimate R̂x changes with time; hence so do the eigenvectors.Provision of the eigenvector set for reconstruction is expensive in terms ofinformation storage and so is undesirable. Wavelet functions, which can beregarded as another form of orthonormal basis, can replace the eigenvectorbasis in many cases. While not optimal, the wavelet transform still displaysan ability to concentrate coefficients, and so performs reasonably well incompression situations. The advantage is that the wavelet basis, unlike theeigenvector basis, is constant and so does not vary with time. The currentMPEG standard for audio and video signals use the wavelet transform forcompression.

34

sensors

(antennas,etc. )

Normal to array

incident wave 1

theta1

wavefront 1

d

o

o

o

o

incident wave K

wavefront K

wavelength lambda

thetaK

Figure 9. Physical description of incident signals onto an array of sensors.

2.5 Example: Array Processing

Here, we present a further example of the concepts we have developed sofar. This example is concerned with direction of arrival estimation usingarrays of sensors.

Consider an array of M sensors (e.g., antennas) as shown in Fig. 9. Letthere be K < M plane waves incident onto the array as shown. Assumethe amplitudes of the incident waves do not change during the time takenfor the wave to traverse the array. Also assume for the moment that theamplitude of the first incident wave at the first sensor is unity. Then, fromthe physics shown in Fig. 9, the signal vector x received by sampling eachelement of the array simultaneously, from the first incident wave alone, maybe described in vector format by x = [1, ejφ, ej2φ, . . . , ej(M−1)φ]T , where φ isthe electrical phase–shift between adjacent elements of the array, due to thefirst incident wave12. When there are K incident signals, with correspondingamplitudes ak, k = 1, . . . ,K, the effects of the K incident signals each addlinearly together, each weighted by the corresponding amplitude ak, to form

12It may be shown that if d ≤ λ/2, then there is a one–to–one relationship betweenthe electrical angle φ and the corresponding physical angle θ. In fact, φ = 2πd

λsin θ. If

d ≤ λ/2, then θ can be inferred from φ.

35

the received signal vector x. The resulting received signal vector, includingthe noise can then be written in the form

xn(M×1)

= S(M×K)

an(K×1)

+ wn,(M×1)

n = 1, . . . , N, (47)

where

wn = M -length noise vector at time n whose elements are independentrandom variables with zero mean and variance σ2, i.e., E(wi

2) = σ2.The vector w is assumed uncorrelated with the signal.

S = [s1 . . . sK ]

sk = [1, ejφk , ej2φk , . . . , ej(M−1)φk ]T , where j =√−1 are referred to as

steering vectors.

φk, k = 1, . . . ,K are the electrical phase–shift angles corresponding to theincident signals. The φk are assumed to be distinct.

an = [a1 . . . aK ]Tn is a vector of independent random variables, describingthe amplitudes of each of the incident signals at time n.

In (47) we obtain N vector samples xn ∈ CM×1, n = 1, . . . , N by simultane-ously sampling all array elements at N distinct points in time. Our objectiveis to estimate the directions of arrival φk of the plane waves relative to thearray, by observing only the received signal.

Note K < M . Let us form the covariance matrix R of the received signalx:

R = E(xxH) = E[(Sa + w)(aHSH + wH)

]= SE(aaH)SH + σ2I

= SASH + σ2I (48)

where A = E(aa)H . The second line follows because the noise is uncorre-lated with the signal, thus forcing the cross–terms to be zero. In the last line

36

of (48) we have also used that fact that the covariance matrix of the noisecontribution (second term) is σ2I. This follows because the noise is assumedwhite. We refer to the first term of (48) as Ro, which is the contribution tothe covariance matrix due only to the signal. The matrix A ∈ CK×K is fullrank if the incident signals are not fully correlated. In this case, Ro ∈ CM×Mis rank K < M . Therefore Ro has K non-zero eigenvalues and M −K zeroeigenvalues.

From the definition of one of the zero eigenvectors, we have

Rovi = 0

or SASHvi = 0, i = K + 1, . . . ,M.

Since A and S are assumed full rank, we must have

SHV N = 0, (49)

where V N = [vK+1, . . . ,vM ]. (These eigenvectors are referred to as thenoise eigenvectors. More on this later). It follows that if φo corresponds toone of the true directions of arrival that are contained in S, then[

1, ejφo , ej2φo , . . . , ej(M−1)φo]V N = 0. (50)

Up to now, we have considered only the noise–free case. What happenswhen the noise component σ2I is added to Ro to give Rx in (48)? FromProperty 3, Lecture 1, we see that if the eigenvalues of Ro are λi, then(only in the asymptotic case when the noise is white) those ofRx are λi+σ

2.The eigenvectors remain unchanged with the noise contribution, and (50)still holds when noise is present.

With this background in place we can now discuss the MUSIC 13 algorithmfor estimating directions of arrival φ1 . . . φK of plane waves incident ontoarrays of sensors. The basic idea follows from (50), where s(φ) ⊥ V N iffφ = φo.

The MUSIC algorithm assumes the quantity K is known. We form anestimate R̂ of R based on a finite number N observations as follows:

R̂ =1

N

N∑n=1

xnxHn .

13This word is an acronym for MUltiple SIgnal Classification. See R.O. Schmidt, “Mul-tiple emitter location and parameter estimation”, IEEE Trans. Antennas and Propag.,vol. AP-34, Mar. 1986, pp 276-280

37

phi

P(phi)

phi1 phi2

Figure 10. MUSIC spectrum P (φ) for the case K = 2 signals.

and then extract the M −K noise eigenvectors V N , which are those associ-ated with the smallest M −K eigenvalues of R̂. Because of the finite N andthe presence of noise, (50) only holds approximately for the true φo. Thus,a reasonable estimate of the desired directions of arrival may be obtained byfinding values of the variable φ for which the expression on the left of (50)is small instead of exactly zero. Thus, we determine K estimates φ̂ whichlocally satisfy

φ̂ = arg minφ

∣∣∣∣∣∣sH(φ)V̂N

∣∣∣∣∣∣22. (51)

By convention, it is desirable to express (51) as a spectrum–like function,where a peak instead of a null represents a desired signal. Thus, the MUSIC“spectrum” P (φ) is defined as:

P (φ) =1

s(φ)HV̂NV̂HNs(φ)

.

It will look something like what is shown in Fig. 10, when K = 2 incidentsignals. Estimates of the directions of arrival φk are then taken as the peaksof the MUSIC spectrum.

The MUSIC algorithm opens up some insight into the use of the eigendecom-position that will be of use later in the course. Let us define the so–called

38

signal subspace SS asSS = span [v1, . . . ,vK ] (52)

and the noise subspace SN as

SN = span [vK+1, . . . ,vM ] . (53)

From (48), all columns of Ro are linear combinations of the columns of S.Therefore

span[Ro] = span[S]. (54)

It follows directly from the eigendecomposition of Ro that

span[Ro] ∈ SS . (55)

Comparing (54) and (55), we see that S ∈ SS . From (47) we see that anyreceived signal vector x, in the absence of noise, is a linear combinationof the columns of S. Thus, any noise–free signal resides completely in SS .This is the origin of the term “signal subspace”. Further, any componentof the received signal residing in SN must be entirely due to the noise,although noise can also reside in the signal subspace. This is the origin ofthe term “noise subspace”. We note that the signal and noise subspaces areorthogonal complement subspaces of each other. The ideas surrounding thisexample lead to the ability to de–noise a signal in some situations.

2.6 Further Discussion on Signal and Noise Subspaces

Partitioning the observed signal between the signal and noise subspaces hasrepercussions well beyond the MUSIC algorithm. We see in this sectionthat this partitioning can be used for denoising a signal. If the signal ofinterest has significant correlation structure, then we can take advantageof the fact that the eigenvalues and eigenvectors of the signal covariancematrix are typically concentrated in a few principal components (i.e., thesignal subspace), just as they are in the case when we are interested incompression with the PCA method. The noise is typically spread over asignificantly greater number of eigen–components (the noise subspace), soreconstructing the signal using only the signal subspace components has theeffect of suppressing a large part of the noise.

We illustrate the idea using an example which simulates the event relatedpotential (ERP). The ERP is a signal closely resembling a Gaussian pulse–like waveform, that is the brain’s response to a sensory stimulus, which is

39

often in the form of a short (e.g., 50 msec) auditory tone, as measured usingthe EEG (electroencephalogram). Since the ERP pulse is very weak and iscorrupted by strong background coloured noise from other brain processes,the SNR is typically enhanced by averaging ERP responses over multipletrials of the stimulus, which are repeated at regular intervals. The difficultywith this technique is that the ERP responses are subject to significanttiming jitter, where the peak of the waveform varies significantly from onetrial to another. Thus, the averaging procedure has the effect of smearingout the averaged waveform, giving rise to distortion and reduction of theeffective SNR of the resulting waveform. However, the following exampleshows that we can significantly enhance the SNR in the presence of jitterwithout distortion using PCA, which in effect is equivalent to partitioningthe received signal into a signal and noise subspace.

The simulated duration of the ERP interval is m = 100 samples. We gener-ated n = 1000 ERP responses where the pulse in each trial was subject torandom timing jitter and amplitude variation. The data were then collectedinto a 100× 1000 matrix X, where each column represents the 100 samplesfrom a distinct ERP trial. The basic prototype Gaussian pulse is shown inFig. 11. Also, 50 of the 1000 pulses, which are subject to timing jitter,amplitude variation, and additive coloured noise, are shown superimposedin Fig. 12.

40

0 10 20 30 40 50 60 70 80 90 100

time

0

0.2

0.4

0.6

0.8

1

1.2

Gau

ssia

n pu

lse

Example of a Gaussian Pulse

Figure 11. The basic Gaussian pulse, which simulates the ERP waveform.

0 10 20 30 40 50 60 70 80 90 100

time

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Gau

ssia

n P

ulse

50 superimposed jittered waveforms corrupted by coloured noise

Figure 12. 50 superimposed, simulated ERP pulses corrupted by timing jitter, ampliudevariation and additive coloured noise.

41

The covariance matrix estimate R̂ ∈ Rm×m = 1/n XXT was then calcu-lated. Note that the columns of X in this case contain the variables ofinterest (i.e., the samples of the observed waveform), as is appropriate inthis case. Then the eigendecomposition of R was computed. The first reigenvectors were then assembled into the matrix V r ∈ Rm×r, where it wasempirically determined that the best value for r is 3. In this case, it isevident that an orthonormal basis for the signal subspace is V r.

Denoised waveforms can then be obtained by projecting each observation(column of X), onto the signal subspace. In this vein, we form a matrix Xr

whose columns represent denoised versions of those ofX. This operation canbe conveniently programmed in matlab as Xr = V V TX. This operation isequivalent to a PCA compression process, although in this case the primaryobjective is to denoise the signal rather than to compress it.

The first r = 3 principal eigenvectors are shown superimposed in Fig. 13,where the elements of each eigenvector are regarded as as functions of timeover the ERP interval. A typical denoised waveform is shown in Fig. 14,showing the original ERP waveform which has been subjected to timing jit-ter (dotted, red), the same waveform corrupted by additive coloured noise(blue, dash–dot), and the corresponding restored, denoised version shown in(black, dashed). It may be seen that the quality of the recovered signal isquite remarkable, in the presence of substantial timing jitter and noise, usingonly r = 3 eigenvector components. The denoising effect works most effec-tively when the signal is highly correlated (thus increasing the concentrationof the eigenvalues in the first few coefficients) and the noise components areconcentrated elsewhere.

Note that when n is finite, the covariance matrix estimate R̂ is only anapproximation to the true covariance matrix R, which only applies as n→∞. Therefore in the finite data case, the principal eigenvectors of R̂ areonly an approximation to the true signal subspace basis, with the resultthat there will be some degree of noise leaking into the estimated signalsubspace. Thus the denoising process we have described in this section isnot exact; however, in most cases in practice the level of noise is suppressedconsiderably.

42

0 10 20 30 40 50 60 70 80 90 100

time

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

evec

tor

com

pone

nts

First 3 evector components of jittered Gaussian pulse

Figure 13. The 3 principal eigenvectors, shown as functions of time within an ERP interval.Each successive eigenvector has one additional point where the first derivative is zero.

0 10 20 30 40 50 60 70 80 90 100

time

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Gau

ssia

n pu

lse

Original, noisy, and recovered pulses

Figure 14. A comparison between the original (jittered) noise–free waveform (dotted, red),the waveform corrupted by coloured noise, (blue, dash-dot), and the denoised version,shown in (black, dashed).

43

2.7 Matrix Norms

Now that we have some understanding of eigenvectors and eigenvalues, wecan now present the matrix norm. The matrix norm is related to the vectornorm: it is a function which maps <m×n into <. A matrix norm must obeythe same properties as a vector norm. Since a norm is only strictly definedfor a vector quantity, a matrix norm is defined by mapping a matrix into avector. This is accomplished by post multiplying the matrix by a suitablevector. Some useful matrix norms are now presented:

Matrix p-Norms: A matrix p-norm is defined in terms of a vector p-norm.The matrix p-norm of an arbitrary matrix A, denoted ||A||p, is defined as

||A||p = supx6=0

||Ax||p||x||p

(56)

where “sup” means supremum; i.e., the largest value of the argument overall values of x 6= 0. Since a property of a vector norm is ||cx||p = |c| ||x||p forany scalar c, we can choose c in (56) so that ||x||p = 1. Then, an equivalentstatement to (56) is

||A||p = max||x||p=1

||Ax||p . (57)

For the specific case where p = 2 for A square and symmetric, it followsfrom (57) and Sect. 2.3.2 that ||Ax||2 = λ1. More generally, it is shown inthe next lecture for an arbitrary matrix A that

||A||2 = σ1 (58)

where σ1 is the largest singular value of A. This quantity results from thesingular value decomposition, to be discussed in the next chapter.

Matrix norms for other values of p, for arbitrary A, are given as

||A||1 = max1≤j≤n

m∑i=1

|aij | (maximum column sum) (59)

and

||A||∞ = max1≤i≤m

n∑j=1

|aij | ( maximum row sum). (60)

44

Frobenius Norm: The Frobenius norm is the 2-norm of the vector ob-tained by concatenating all the rows (or columns) of the matrix A:

||A||F =

m∑i=1

n∑j=1

|aij |21/2

2.7.1 Properties of Matrix Norms

1. Consider the matrix A ∈ <m×n and the vector x ∈ <n. Then,

||Ax||p ≤ ||A||p ||x||p (61)

This property follows by dividing both sides of the above by ||x||p, andapplying (56).

2. If Q and Z are orthonormal matrices of appropriate size, then

||QAZ||2 = ||A||2 (62)

and||QAZ||F = ||A||F (63)

Thus, we see that the matrix 2–norm and Frobenius norm are invariantto pre– and post– multiplication by an orthonormal matrix.

3. Further,||A||2F = tr

(ATA

)(64)

where tr(·) denotes the trace of a matrix, which is the sum of itsdiagonal elements.

45

2.8 Problems

1. Consider a real skew–symmetric matrix (i.e., one for which AT =−A). Prove its eigenvalues are pure imaginary, and the eigenvectorsare mutually orthogonal.

2. Consider 2 matrices A and B. Under what conditions is AB = BA?

3. Prove that no other basis yields smaller error in the 2-norm sense thanthat given by (46). Hint: We will see later that the eigenvalues of acovariance matrix are all non–negative.

4. We are given a matrix A whose eigendecomposition is A = V ΛV T .Find the eigenvalues and eigenvectors of the matrix C = BAB−1 interms of those of A, where B is any invertible matrix. C is referredto as a similarity transform of A.

5. Consider an arbitrary random process x[k] of duration m. The se-quence f [k] of duration n <= m operates on x[k] to give an outputsequence y[k] according to

y[k] =∑i

x[k − i]f [i]

where x[k] = 0 for k > m, k < 1.

(a) Show (in detail) that this operation (convolution) can be ex-pressed as a matrix–vector multiplication.

(b) Using the x-data in file assig2Q5 2019.mat on the website, findf[k] of length n = 10 so that ||y||2 is minimized, subject to ||f ||2 =1.

6. On the course website you will find a file assig2Q6 2019.mat, whichcontains a matrix X ∈ Rm×n of data corresponding to the example ofSect. 2.6. Each column is a time–jittered Gaussian pulse corruptedby coloured noise. Here, m = 100 and n = 1000, as per the example.Using matlab, produce a denoised version of the signal represented bythe first column of X.

7. On the course website you will find a .mat file assig2Q7 2019.mat. Itcontains a matrix X whose columns contain two superimposed Gaus-sian pulses with additive noise. Using methods discussed in the course,estimate the position of the peaks of the Gaussian pulses.

46

Documents

ECE 712 Lecture Notes: Matrix Computations for Signal ...reilly/ece712/ch2.pdf · ECE 712 Lecture Notes: Matrix Computations for Signal Processing James P. Reilly c Department of