Intermediate Linear Algebra · If you don’t like Lang’s book, I also like Gilbert Strang’s Linear Algebra and its Applications. To be fair, I’ve only used the third edition

Intermediate Linear Algebra

Version 2.0

Christopher Griffin

« 2016-2017

Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

http://creativecommons.org/licenses/by-nc-sa/3.0/us/

Contents

List of Figures v

About This Document ix

Chapter 1. Vector Space Essentials 11. Goals of the Chapter 12. Fields and Vector Spaces 13. Matrices, Row and Column Vectors 54. Linear Combinations, Span, Linear Independence 65. Basis 96. Dimension 11

Chapter 2. More on Matrices and Change of Basis 171. Goals of the Chapter 172. More Matrix Operations: A Review 173. Special Matrices 194. Matrix Inverse 195. Linear Equations 206. Elementary Row Operations 217. Computing a Matrix Inverse with the Gauss-Jordan Procedure 258. When the Gauss-Jordan Procedure does not yield a Solution to Ax = b 279. Change of Basis as a System of Linear Equations 2910. Building New Vector Spaces 32

Chapter 3. Linear Transformations 371. Goals of the Chapter 372. Linear Transformations 373. Properties of Linear Transforms 394. Image and Kernel 415. Matrix of a Linear Map 446. Applications of Linear Transforms 477. An Application of Linear Algebra to Control Theory 49

Chapter 4. Determinants, Eigenvalues and Eigenvectors 571. Goals of the Chapter 572. Permutations 573. Determinant 584. Properties of the Determinant 635. Eigenvalues and Eigenvectors 65

iii

6. Diagonalization and Jordan’s Decomposition Theorem 69

Chapter 5. Orthogonality 731. Goals of the Chapter 732. Some Essential Properties of Complex Numbers 733. Inner Products 734. Orthogonality and the Gram-Schmidt Procedure 765. QR Decomposition 786. Orthogonal Projection and Orthogonal Complements 807. Orthogonal Complement 828. Spectral Theorem for Real Symmetric Matrices 849. Some Results on ATA 87

Chapter 6. Principal Components Analysis and Singular Value Decomposition 891. Goals of the Chapter 892. Some Elementary Statistics with Matrices 893. Projection and Dimensional Reduction 914. An Extended Example 935. Singular Value Decomposition 97

Chapter 7. Linear Algebra for Graphs and Markov Chains 1031. Goals of the Chapter 1032. Graphs, Multi-Graphs, Simple Graphs 1033. Directed Graphs 1054. Matrix Representations of Graphs 1075. Properties of the Eigenvalues of the Adjacency Matrix 1096. Eigenvector Centrality 1107. Markov Chains and Random Walks 1148. Page Rank 1179. The Graph Laplacian 121

Chapter 8. Linear Algebra and Systems of Differential Equations 1271. Goals of the Chapter 1272. Systems of Differential Equations 1273. A Solution to the Linear Homogenous Constant Coefficient Differential Equation 1304. Three Examples 1325. Non-Diagonalizable Matrices 135

Bibliography 137

iv

List of Figures

1.1 The subspace R2 is shown within the subspace R3. 4

2.1 (a) Intersection along a line of 3 planes of interest. (b) Illustration that theplanes do not intersect in any common line. 28

2.2 The vectors for the change of basis example are shown. Note that v is expressedin terms of the standard basis in the problem statement. 31

2.3 The intersection of two sub-spaces in R3 produces a new sub-space of R3. 33

2.4 The sum of two sub-spaces of R2 that share only 0 in common recreate R2. 34

3.1 The image and kernel of fA are illustrated in R2. 42

3.2 Geometric transformations are shown in the figure above. 48

3.3 A mass moving on a spring is governed by Hooke’s law, translated into thelanguage of Newtonian physics as mx− kx = 0. 49

3.4 A mass moving on a spring given a push on a frictionless surface will oscillateindefinitely, following a sinusoid. 51

5.1 The orthogonal projection of the vector u onto the vector v. 80

5.2 The common plane shared by two vectors in R3 is illustrated along with thetriangle they create. 81

5.3 The orthogonal projection of the vector u onto the vector v. 81

5.4 A vector v generates the linear subspace W = span(v). It’s orthogonalcomplement W⊥ is shown when v ∈ R3. 83

6.1 An extremely simple data set that lies along a line y− 4 = x− 3, in the directionof 〈1, 1〉 containing point (3, 4). 91

6.2 The one dimensional nature of the data is clearly illustrated in this plot of thetransformed data z. 92

6.3 A scatter plot of data drawn from a multivariable Gaussian distribution. Thedistribution density function contour plot is superimposed. 95

6.4 Computing Z = WTYT creates a new uncorrelated data set that is centered at0. 96

6.5 The data is shown projected onto a linear subspace (line). This is the bestprojection from 2 dimensions to 1 dimension under a certain measure of best. 96

v

6.6 A gray scale version of the image found at http://hanna-barbera.wikia.

com/wiki/Scooby-Doo_(character)?file=Scoobydoo.jpg. CopyrightHannah-Barbara used under the fair use clause of the Copyright Act. 101

6.7 The singular values of the image matrix corresponding to the image in Figure6.6. Notice the steep decay of the singular values. 102

6.8 Reconstructed images from 15 and 50 singular values capture a substantialamount of detail for substantially smaller transmission sizes. 102

7.1 It is easier for explanation to represent a graph by a diagram in which verticesare represented by points (or squares, circles, triangles etc.) and edges arerepresented by lines connecting vertices. 103

7.2 A self-loop is an edge in a graph G that contains exactly one vertex. That is, anedge that is a one element subset of the vertex set. Self-loops are illustrated byloops at the vertex in question. 105

7.3 (a) A directed graph. (b) A directed graph with a self-loop. In a directed graph,edges are directed; that is they are ordered pairs of elements drawn from thevertex set. The ordering of the pair gives the direction of the edge. 106

7.4 A walk (a) and a cycle (b) are illustrated. 107

7.5 A connected graph (a) and a disconnected graph (b). 107

7.6 The adjacency matrix of a graph with n vertices is an n × n matrix with a 1at element (i, j) if and only if there is an edge connecting vertex i to vertex j;otherwise element (i, j) is a zero. 108

7.7 A matrix with 4 vertices and 5 edges. Intuitively, vertices 1 and 4 should havethe same eigenvector centrality score as vertices 2 and 3. 113

7.8 A Markov chain is a directed graph to which we assign edge probabilities so thatthe sum of the probabilities of the out-edges at any vertex is always 1. 115

7.9 An induced Markov chain is constructed from a graph by replacing every edgewith a pair of directed edges (going in opposite directions) and assigning aprobability equal to the out-degree of each vertex to every edge leaving thatvertex. 118

7.10 A set of triangle graphs. 121

7.11 A simple social network. 124

7.12 A graph partition using positive and negative entries of the Fiedler vector. 125

8.1 The solution to the differential equation can be thought of as a vector of fixedunit rotation about the origin. 133

8.2 A plot of representative solutions for x(t) and y(t) for the simple homogeneouslinear system in Expression 8.25. 133

8.3 Representative solution curves for Expression 8.39 showing sinusoidal exponentialgrowth of the system. 134

vi

http://hanna-barbera.wikia.com/wiki/Scooby-Doo_(character)?file=Scoobydoo.jpg


8.4 Representative solution curves for Expression 8.42 showing exponential decay ofthe system. 135

vii

About This Document

This is a set of lecture notes. They are given away freely to anyone who wants to usethem. You know what they say about free things, so you might want to get yourself a book.I like Serge Lang’s Linear Algebra, which is part of the Springer Undergraduate Texts inMathematics series. If you don’t like Lang’s book, I also like Gilbert Strang’s Linear Algebraand its Applications. To be fair, I’ve only used the third edition of that book. The neweraddition seems more like a tome, while the third edition was smaller and to the point.

The lecture notes were intended for SM361: Intermediate Linear Algebra, which is abreadth elective in the Mathematics Department at the United States Naval Academy. SinceI use these notes while I teach, there may be typographical errors that I noticed in class, butdid not fix in the notes. If you see a typo, send me an e-mail and I’ll add an acknowledgement.There may be many typos, that’s why you should have a real textbook. (Because realtextbooks never have typos, right?)

The material in these notes is largely based on Lang’s excellent undergraduate linearalgebra textbook. However, the applications are drawn from multiple sources outside ofLang. There are a few results that are stated but not proved in these notes:

• The formula det(AB) = det(A)det(B),• The Jordan Normal Form Theorem, and• The Perron-Frobenius theorem.

Individuals interested in using these notes as the middle part of a three-part Linear Algebrasequence should seriously consider proving these results in an advanced linear algebra courseto complete the theoretical treatment begun here.

In order to use these notes successfully, you should have taken a course in matrices(elementary linear algebra). I review a substantial amount of the material you will need, butit’s always good to have covered prerequisites before you get to a class. That being said, Ihope you enjoy using these notes!

ix

CHAPTER 1

Vector Space Essentials

1. Goals of the Chapter

(1) Review fields and vector spaces.(2) Provide examples of vector spaces and sub-spaces.(3) Introduce matrices and matrix/vector notation.(4) Discuss linear combinations and linear independence(5) Define basis. Prove uniqueness of dimension (finite case).

2. Fields and Vector Spaces

Definition 1.1 (Group). A group is a pair (S, ◦) where S is a set and ◦ : S × S → S isa binary operation so that:

(1) The binary operation ◦ is associative; that is, if s1, s2 and s3 are in S, then (s1 ◦s2) ◦ s3 = s1 ◦ (s2 ◦ s3).

(2) There is a unique identity element e ∈ S so that for all s ∈ S, e ◦ s = s ◦ e = s.(3) For every element s ∈ S there is an inverse element s−1 ∈ S so that s ◦ s−1 =

s−1 ◦ s = e.

If ◦ is commutative, that is for all s1, s2 ∈ S we have s1 ◦ s2 = s2 ◦ s1, then (S, ◦) is called acommutative group (or abelian group).

Example 1.2. This course is not about group theory. If you’re interested in groups inthe more abstract sense, it’s worth considering taking Abstract Algebra. One of the simplestexamples of a group is the set of integers Z under the binary operation of addition.

Definition 1.3 (Sub-Group). Let (S, ◦) be a group. A subgroup of (S, ◦) is a group(T, ◦) so that T ⊆ S. The subgroup (T, ◦) shares the identify of the group (S, ◦).

Example 1.4. Consider the group (Z,+). If 2Z is the set of even integers, then (2Z,+)is a subgroup of (Z,+) because that even integers are closed under addition.

Definition 1.5 (Field). A field (or number field) is a tuple (S,+, ·, 0, 1) where:

(1) (S,+) is a commutative group with unit 0,(2) (S \ {0}, ·) is a commutative group with unit 1(3) The operation · distributes over the operation + so that if a1, a2, and a3 are elements

of F , then a1 · (a2 + a3) = a1 · a2 + a1 · a3.

Example 1.6. The archetypal example of a field is the field of real numbers R withaddition and multiplication playing the expected roles. Another common field is the field ofcomplex numbers C (numbers of the form a + bi with i =

√−1 the imaginary unit) with

their addition and multiplication rules defined as expected.

1

Exercise 1. Why is Z not a field under ordinary addition and multiplication? Is Q, theset of rational numbers, a field under the usual addition and multiplication operations?

Definition 1.7 (Vector Space). A vector space is a tuple V = (〈F,+, ·, 0, 1〉, V,+, ·)where

(1) 〈F,+, ·, 0, 1〉 is a field (with its own addition and multiplication operators defined)called the set of scalars,

(2) V is a set called the set of vectors,(3) + : V × V → V is an addition operator defined on the set V , and(4) · : F× V → V .

Further the following properties hold for all vectors v1, v2 and v3 in V and scalars in s, s1and s2 in F:

(1) (V,+) is a commutative group (of vectors) with identity element 0.(2) Multiplication of vectors by a scalar distributes over vector addition; i.e., s(v1 +

v2) = sv1 + sv2.(3) Multiplication of vectors by a scalar distributes over field addition; i.e., (s1+s2)·v1 =

s1v1 + s2v2.(4) Multiplication of a vector by a scalar respects the fields multiplication; i.e., (s1 ·s2) ·

v1 = s1 · (s2 · v1).(5) Scalar identify is respected in the multiplication of vectors by a scalar; i.e., 1 · v1 =

v1.

Remark 1.8. In general the set of vectors V is not distinguished from the vector spaceV . Thus, we will write v ∈ V to mean v is a vector in the vector space V .

Definition 1.9 (Cartesian Product of a Field). Let F be a field. Then:

Fn = F× F× · · · × F︸︷︷︸n

is the set of n-tuples of elements of F. If + is the field addition operation, we can defineaddition on Fn component-wise by:

(a1, . . . , an) + (b1, . . . , bn) = (a1 + b1, . . . , an + bn)

The zero element for addition is then 0 = (0, 0, . . . , 0) ∈ Fn.

Lemma 1.10. The set Fn forms a group under component-wise addition with zero element0.

Theorem 1.11. Given a field F, Fn is a vector space over F when vector-scalar multi-plication is defined so that:

c · (a1, . . . , an) = (ca1, . . . , can)

Exercise 2. Prove Lemma 1.10 and consequently Theorem 1.11.

Remark 1.12. Generally speaking, we will not explicitly call out all the different opera-tions, vector sets and fields unless it is absolutely necessary. When referring to vector spacesover the field F with vectors Fn (n ≥ 1) we will generally just say the vector space Fn tomean the set of tuples with n elements from F over the field of scalars F.

2

Example 1.13. The simplest (and most familiar) example of a vector space has as itsfield R with addition and multiplication defined as expected and as its set of vectors n-tuplesin Rn (n ≥ 1) with component-wise vector addition defined as one would expect.

Remark 1.14. The vector space Rn is sometimes called Euclidean n space, because allthe rules of Euclidean geometry work in this space.

Remark 1.15. When most people think about an archetypal vector space, they do thinkof Rn. You can also think of Cn, the vector space of n-tuples of complex numbers. Thisvector space is very useful in Quantum Mechanics - where everything in sight seems to be acomplex number.

Example 1.16 (Function Space). Let F be the set of all functions from R to R. That is,if f ∈ F , then f : R→ R. Suppose we define (f + g)(x) = f(x) + g(x) for all f, g ∈ F andif c ∈ R, then (cf)(x) = cf(x) when f ∈ F . The constant function ϑ(x) = 0 is a the zero inthe group (F ,+, ϑ). Then F is a vector space over the field R and this is an example of afunction space. Here, the functions are the vectors and reals are the scalars.

Remark 1.17. Vector spaces can become very abstract (we’ll see some examples aswe move along). For example, function spaces can be made much more abstract than theprevious example. For now though it is easiest to remember that vector spaces behave likevectors of real numbers with some appropriate additions and multiplications defined. Ingeneral, all you need to define a vector space is a field (the scalars) a group (the vectors)and a multiplication operation (scalar-vector multiplication) that connects the two and thatsatisfies all the properties listed in Definition 1.7.

Definition 1.18 (Subspace). If V = (〈F,+, ·, 0, 1〉, V,+, ·) is a vector space and U ⊆ Vwith U = (〈F,+, ·, 0, 1〉, U,+, ·) also a vector space then, U is called a subspace of V . Notethat U must be closed under + and ·.

Example 1.19. If we consider R3 as a vector space over the reals (as usual) then it hasas a subspace several copies of R2. The easiest is to consider the subset of vectors:

U = {(x, y, 0) : x, y ∈ R}Clearly U is closed under the addition and scalar multiplication of the original vector space.This is illustrated in Figure 1.1.

Remark 1.20. Proving a set of vectors W is a subspace of a given vector space Vrequires checking three things: (i) Is 0 in W? (ii) If W closed under vector addition? (iii)Is W closed under scalar-vector multiplication? All other properties of a vector space followautomatically.

Example 1.21 (Vector Space of Real-Polynomials). Recall the function space examplefrom Example 1.16. If we confine our attention simply to functions that are polynomials ofa single variable with real coefficient. Denote this set by P . This set of functions is closedunder addition and scalar multiplication. The zero function is in P . Thus, it is a subspaceof F .

Exercise 3. Show that the set of all polynomials of a single variable with real coefficientsand degree at most k is a subspace of P , which we’ll denote P [xk].

3

}Figure 1.1. The subspace R2 is shown within the subspace R3.

Remark 1.22. Just as we do not differentiate the vector space Fn from the cartesianproduct group F, from now on, we will just say that a vector v is an element of a vectorspace V , rather than the set V .

2.1. A Finite Field.

Remark 1.23. So far we have discussed fields that should be familiar:

(1) (R,+, ·, 0, 1), the field of real numbers or(2) (C,+, ·, 0, 1), the field of complex numbers.

Fields do not necessarily have to have an infinite number of elements. The simplest exampleif GF(2), the Galois Field with 2 elements. We’ll leave a proper introduction to Galois Fieldsfor an Algebra class. We can however, discuss the two-element field.

Definition 1.24. The field GF(2) consists of the following:

(1) The set of two elements {0, 1},(2) the addition operation +,(3) the multiplication operation ·,(4) the additive unit 0 and(5) the multiplicative unit 1.

The addition and multiplication tables are:

+ 0 1

0 0 11 1 0

+ 0 1

0 0 01 0 1

Remark 1.25. You should check that the operations are, in fact, commutative and thatmultiplication distributes over addition.

Remark 1.26. This particular field has an intimate relation to Computer Science throughBoolean Logic. Suppose that 0 means false (off) while 1 means true (on). Then additionplays the role of exclusive or. The exclusive or operator works in English as follows: eitherit’s raining or it’s sunny (it cannot be both raining and sunny). Thus:

(1) It’s neither sunny nor rainy is not true (false) since it must be either sunny or rainy:0 + 0 = 0.

4

(2) It’s rainy and not sunny is true (acceptable): 1 + 0 = 1(3) It’s not rainy and sunny is also true (acceptable): 0 + 1 = 1(4) It’s raining and sunny is not true: 1 + 1 = 0.

By the same token multiplication plays the role of and. This works in English as follows:Tom is a boy and Tom is tall, means Tom must be both a boy and tall. Thus:

(1) Tom is neither a boy nor tall is not true: 0 · 0 = 0.(2) Tom is a boy who is not tall is not true: 1 · 0 = 0(3) Tom is not a boy but is tall is not true: 0 · 1 = 0(4) Tom is a boy and tall is true: 1 · 1 = 1.

Remark 1.27. As it turns out, there are finite fields with a number of elements equal toeach prime power, but no finite field with a number of elements that is composite. That is,there is no finite field with exactly 6 elements.

3. Matrices, Row and Column Vectors

Remark 1.28. We will cover matrices in depth and, if you’re using these notes, you’vehad a course in matrices. This section is just going to establish some notational consistencyfor the rest of the notes and enable concrete examples. It is not a complete overview ofmatrices.

Definition 1.29 (Matrix). An m× n matrix is a rectangular array of values (scalars),drawn from a field. If F is the field, we write Fm×n to denote the set of m×n matrices withentries drawn from F .

Example 1.30. Here is an example of a 2× 3 matrix drawn from R2×3:

A =

[3 1 7

2

2√

2 5

]

Or a 2× 2 matrix with entries drawn from C2×2:

B =

[3 + 2i 7

6 3− 2i

]

Remark 1.31. We will denote the element at position (i, j) of matrix A as Aij. Thus,in the example above, when:

A =

[3 1 7

2

2√

2 5

]

then A2,1 = 2.

Definition 1.32 (Matrix Addition). If A and B are both in Fm×n, then C = A + B isthe matrix sum of A and B in Fm×n and

(1.1) Cij = Aij + Bij for i = 1, . . . ,m and j = 1, . . . , n

Here + is the field operation addition.

Example 1.33.

(1.2)

[1 23 4

]+

[5 67 8

]=

[1 + 5 2 + 63 + 7 4 + 8

]=

[6 810 12

]

5

Definition 1.34 (Scalar-Matrix Multiplication). If A is a matrix from Fm×n and c ∈ F,then B = cA = Ac is the scalar-matrix product of c and A in Fm×n and:

(1.3) Bij = cAij for i = 1, . . . ,m and j = 1, . . . , n

Example 1.35. Let:

B =

[3 + 2i 7

6 3− 2i

]

Then we can multiply the scalar i ∈ C by B to obtain:

i

[3 + 2i 7

6 3− 2i

]=

[i(3 + 2i) 7i

6i i(3− 2i)

]=

[−2 + 3i 7i

6i 2 + 3i

]

Definition 1.36 (Row/Column Vector). A 1 × n matrix is called a row vector, and am × 1 matrix is called a column vector. For the remainder of these notes, every vector willbe thought of column vector unless otherwise noted. A column vector x in Rn×1 (or Rn)is: x = 〈x1, . . . , xn〉.

Remark 1.37. It should be clear that any row of matrix A could be considered a rowvector in Rn and any column of A could be considered a column vector in Rm. Also, anyrow/column vector is nothing more sophisticated than tuples of numbers. You are free tothink of these things however you like. Notationally, column vectors are used through thesenotes.

Example 1.38. We can now think of the vector space R2 over the field R as beingcomposed of vectors in R2×1 (column vectors) with the field of real numbers. Let 0 = 〈0, 0〉.Then (R2×1,+,0) is a group and that’s sufficient for a set of vectors in a vector space.

Exercise 4. Show that (R2×1,+,0) is a group.

4. Linear Combinations, Span, Linear Independence

Definition 1.39. Suppose V is a vector space over the field F. Let v1, . . . ,vm be vectorsin ∈ V and let α1, . . . , αm ∈ F be scalars. Then

(1.4) α1v1 + · · ·+ αmvm

is a linear combination of the vectors v1, . . . ,vm.

Clearly, any linear combination of vectors in V is also a vector in V .

Definition 1.40 (Span). Let V be a vector space and suppose W = {v1, . . . ,vm} is aset of vectors in V , then the span of W is the set:

(1.5) span(W ) = {y ∈ V : y is a linear combination of vectors in W}

Proposition 1.41. Let V be a vector space and suppose W = {v1, . . . ,vm} is a set ofvectors in V. Then span(W ) is subspace of V.

Proof. We must check 3 things:(1) The zero vector is in span(W ): We know that:

0 = 0 · v1 + · · ·+ 0 · vm6

is a linear combination of the vectors in W and therefore 0 ∈ span(W ). Here 0 ∈ F is thezero-element in the field and 0 is the zero vector in the vector space.(2) The set of vectors span(W ) is closed under vector addition: Consider two vectorsv and w in span(W ). Then we have:

v = α1v1 + · · ·+ αmvm+ w = β1v1 + · · ·+ βmvm

(α1 + β1)v1 + · · ·+ (αm + βm)vm

is a linear combination of the vectors in W and therefore v + w ∈ span(W ). We observedthat this is true because scalar-vector multiplication must obey a distributive property.(3) The set span(W ) is closed under scalar-vector multiplication: If:

v = α1v1 + · · ·+ αmvm

and r ∈ F, then:

rv = r (α1v1 + · · ·+ αmvm) = (rα1)v1 + · · ·+ (rαm)vm

This is a linear combination of the vectors in W and so in the span(W ). We observed thiswas because scalar-vector multiplication respects scalar multiplication. Therefore, span(W )is a subspace of V . �

Definition 1.42 (Linear Independence). Let v1, . . . ,vm be vectors in V . The vectorsv1, . . . ,vm are linearly dependent if there exists α1, . . . , αm ∈ F, not all zero, such that

(1.6) α1v1 + · · ·+ αmvm = 0

If the set of vectors v1, . . . ,vm is not linearly dependent, then they are linearly independentand Equation 1.6 holds just in case αi = 0 for all i = 1, . . . , n. Here 0 is the zero-vector inV and 0 is the zero-element in the field.

Exercise 5. Consider the vectors v1 = 〈0, 0〉 and v2 = 〈1, 0〉. Are these vectors linearlyindependent? Explain why or why not.

Example 1.43. In R3, consider the vectors:

v1 =

110

, v2 =

101

, v3 =

011

We can show these vectors are linearly independent: Suppose there are values α1, α2, α3 ∈ Rsuch that

α1v1 + α2v2 + α3v3 = 0

Then: α1

α1

0

+

α2

0α2

0α3

α3

=

α1 + α2

α1 + α3

α2 + α3

=

000

Thus we have the system of linear equations:

α1 +α2 = 0

α1 + α3 = 0

α2 + α3 = 0

7

From the third equation, we see α3 = −α2. Substituting this into the second equation, weobtain two equations:

α1 + α2 = 0

α1 − α2 = 0

This implies that α1 = α2 and 2α1 = 0 or α1 = α2 = 0. Therefore, α3 = 0 and thus thesevectors are linearly independent.

Remark 1.44. It is worthwhile to note that the zero vector 0 makes any set of vectorsa linearly dependent set.

Exercise 6. Prove the remark above.

Example 1.45. Consider the vectors:

v1 =

123

, v2 =

456

Determining linear independence requires us to this yields the equation:

α1

123

+ α2

456

=

000

or the system of equations:

α1 + 4α2 = 0

2α1 + 5α2 = 0

3α1 + 6α2 = 0

Thus α1 = −4α2. Substituting this into the second and third equations yield:

−3α2 = 0

−6α2 = 0

Thus α2 = 0 and consequently α1 = 0. Thus, the vectors are linearly independent.

Example 1.46. Consider the vectors:

v1 =

[12

], v2 =

[34

], v3 =

[56

]

As before, we can derive the system of equations:

α1 + 3α2 + 5α3 = 0

2α1 + 4α2 + 6α3 = 0

We have more unknowns than equations, so we suspect there may be many solutions to thissystem of equations. From the first equation, we see: α1 = −3α2 − 5α3. Consequently wecan substitute this into the second equation to obtain:

−6α2 − 10α3 + 4α2 + 6α3 = −2α2 − 4α3 = 0

Thus, α2 = −2α3 and α1 = 6α3 − 5α3 = α3, which we obtain by substituting the expressionfor α2 into the expression for α1. It appears that α3 can be anything we like. Let’s set

8

α3 = 1. Then α2 = −2 and α1 = 1. We can now confirm that this set of values creates alinear combination of v1, v2 and v3 equal to 0:

1 ·[12

]− 2 ·

[34

]+

[56

]=

[1− 6 + 52− 8 + 6

]=

[00

]

Thus, the vectors are not linearly independent and they must be linearly dependent.

Exercise 7. Show that the vectors

v1 =

123

, v2 =

456

, v3 =

789

are not linearly independent. [Hint: Following the examples, create a system of equationsand show that there is a solution not equal to 0.]

Example 1.47. Consider the vector space of polynomials P . We can show that the setof vectors {3x, x2 − x, 2x2 − x} is linearly dependent. As before, we write:

α1(3x) + α2(x2 − x) + α3(2x

2 − x) = ϑ(x) = 0

We can collect terms on x and x2:

(α2 + 2α3)x2 + (3α1 − α2 − α3) = 0

This leads (again) to a system of linear equations:

3α1−α2 − α3 = 0

α2 + 2α3 = 0

We compute α2 = −2α3. Substituting into the first equation yields: 3α1 + α3 = 0 orα1 = −1

3α3. Set α3 = 1 and α1 = −1

3and α2 = −2.

Exercise 8. Consider the vectors space of polynomials with degree at most 2, P [x2].Suppose we change the quadratic polynomial a2x

2 + a1x + a0 into the vector: 〈a0, a1, a2〉.Recast the previous example in terms of these vectors and show that the resulting system ofequations is identical to the one derived.

5. Basis

Definition 1.48 (Basis). Let B = {v1, . . . ,vm} be a set of vectors in V . The set B iscalled a basis of V if B is a linearly independent set of vectors and every vector in V is inthe span of B. That is, for any vector w ∈ V we can find scalar values α1, . . . , αm ∈ F suchthat

(1.7) w =m∑

i=1

αivi

Example 1.49. We can show that the vectors:

v1 =

110

, v2 =

101

, v3 =

011

9

form a basis of R3. We already know that the vectors are linearly independent. To showthat R3 is in their span, chose an arbitrary vector in Rm: 〈a, b, c〉. Then we hope to findcoefficients α1, α2 and α3 so that:

α1v1 + α2v2 + α3v3 =

abc

Expanding this, we must find α1, α2 and α3 so that:α1

α1

0

+

α2

0α2

+

0α3

α3

=

abc

A little effort (in terms of algebra) will show that:

(1.8)

α1 =1

2(a+ b− c)

α2 =1

2(a− b+ c)

α3 =1

2(−a+ b+ c)

Thus the set {v1,v2,v3} is a basis for R3.

Exercise 9. Why are the vectors

v1 =

123

, v2 =

456

, v3 =

789

not a basis for R3.

Lemma 1.50. Suppose B = {v1, . . . ,vm} be a basis for a vector space V over a field F.Suppose that v ∈ V and:

α1v1 + · · ·+ αmvm = v = β1v1 + · · ·+ βmvm

Then αi = βi for i = 1, . . . ,m.

Proof. Trivially:

(α1v1 + · · ·+ αmvm)− (β1v1 + · · ·+ βmvm) = 0

This can be rewritten:

(α1 − β1)v1 + · · ·+ (αm − βm)vm = 0

The fact that B is a basis implies that αi − βi = 0 for i = 1, . . . ,m. This completes theproof.

�

Remark 1.51 (Coordinate Form). Lemma 1.50 shows that given a vector space V and abasis B for that vector space, we can assign to any vector v ∈ V a unique set of coordinates〈α1, . . . , αm〉.

10

Remark 1.52. You are most familiar with this in Rn, where the vectors are usuallyidentical to their coordinates when we use a standard basis consisting of {e1, . . . , en} whereei ∈ Rnis:

ei =

0, 0, . . .︸︷︷︸

i−1

, 1, 0, . . . , 0︸︷︷︸n−i−1

Thus, the coordinates of the vector v = (v1, . . . , vn) are exactly 〈v1, . . . , vn〉. It does notalways have to work this way, as we show in the next example.

Example 1.53. Consider the basis

v1 =

110

, v2 =

101

, v3 =

011

for R3. Note first that the coordinates of v1 with respect to the basis B = {v1,v2,v2}, wehave 〈1, 0, 0〉. That is, the coordinates of the first basis vector always look like the coordinatesof the first standard basis vector. This can be confusing; what this means is that the vectors(the mathematical objects drawn with arrows) are independent of the basis, which simplygives you a way to assign coordinates to them. To continue the example, lets express thestandard basis vector e1 ∈ R3 in coordinates with respect to the basis B.

To do this, we must find α1, α2, and α3 so that:

(1.9) α1

110

+ α2

101

+ α3

011

=

100

Using Equation 1.8, can substitute a = 1, b = 0 and c = 0 to obtain:

α1 =1

2α2 =

1

2α3 = −1

2

Thus, the coordinate representation for the standard basis vector e1 ∈ R3 with respect tothe basis B is

⟨12, 12,−1

2

⟩.

Exercise 10. Find coordinate representations for e2 and e3 in R3 with respect to basisB.

6. Dimension

Definition 1.54 (Maximal Linearly Independent Subset). Let W = {v1, . . . ,vn} be aset of vectors from a vector space V . A set {v1, . . . ,vm} is a maximal linearly independentsubset of W if for any vi with r > m, the set {v1, . . . ,vm,vr} is linearly dependent.

Lemma 1.55. Let W = {v1, . . . ,vn} be a set of vectors from a vector space V with theproperty that span(W ) = V (i.e., every vector in V can be expressed as a linear combinationof the vectors in W ). If B = {v1, . . . ,vm} is a maximal linearly independent subset of W ,then B is a basis.

11

Proof. Choose an arbitrary vector v ∈ V . There is a set of scalars α1, . . . , αn so that:

(1.10) v =n∑

i=1

αivi

If αi = 0 for i = m+1, . . . , n then we have expressed v as a linear combination of the elementsof B. Therefore, assume this is not the case and suppose that αr 6= 0 for r > m. The factthat {v1, . . . ,vm,vr} is linearly dependent means that there are scalars β1, . . . , βm, βr notall 0 so that:

βrvr +m∑

i=1

βivi = 0

We know that βr 6= 0 (otherwise B would not be linearly independent) and therefore wehave:

(1.11) vr +m∑

i=1

−βiβr

vi = 0

because the scalars are drawn from a field F for which there is a multiplicative inverse for non-zero elements (i.e, β−1r = 1/βr). We can then replace vr in Expression 1.10 with Expression1.11 for each αr 6= 0 with r > m. The result is an expression of v as a linear combination ofvectors in B. Thus, span(B) = V . This completes the proof. �

Lemma 1.56. Let {v1, . . . ,vm+1} be a linearly dependent set of vectors in V and letW = {v1, . . . ,vm} be a linearly independent set. Further assume that vm+1 6= 0. Assumeα1, . . . , αm+1 are a set of scalars, not all zero, so that

(1.12)m+1∑

i=1

αivi = 0

For any j ∈ {1, . . . ,m} such that αj 6= 0, if we replace vj in the set W with vm+1, then thisnew set of vectors is linearly independent.

Proof. We know that αm+1 cannot be zero, since we assumed that W is linearly inde-pendent. Since vm+1 6= 0, we know there is at least one other αi (i = 1, . . . ,m) not zero.Without loss of generality, assume that α1 6= 0 (if not, rearrange the vectors to make thistrue).

We can solve for vm+1 using this equation to obtain:

(1.13) vm+1 =m∑

i=1

− αiαm+1

vi

Suppose, without loss of generality, we replace v1 by vm+1 in W . We now proceed bycontradiction. Assume this new set is linearly dependent. There there exists constantsβ2, . . . , βm, βm+1, not all zero, such that:

(1.14) β2v2 + · · ·+ βmvm + βm+1vm+1 = 0.

12

Again, we know that βm+1 6= 0 since the set {v2, . . . ,vm} is linearly independent because Wis linearly independent. Then using Equation 1.13 we see that:

(1.15) β2v2 + · · ·+ βmvm + βm+1

(m∑

i=1

− αiαm+1

vi

)= 0.

We can rearrange the terms in this sum as:

(1.16)

(β2 −

βm+1α2

αm+1

)v2 + · · ·+

(βm −

βm+1αmαm+1

)vm −

α1

αm+1

v1 = 0

The fact that α1 6= 0 and βm+1 6= 0 and αm+1 6= 0 means we have found γ1, . . . , γm, not allzero, such that γ1v1 + · · · + γmvm = 0, contradicting our assumption that W was linearlyindependent. This contradiction completes the proof. �

Corollary 1.57. If B = {v1, . . . ,vm} is a basis of V and vm+1 is another vector suchthat:

(1.17) vm+1 =m∑

i=1

− αiαm+1

vi

with the property that α1 6= 0, then B′ = {v2, . . . ,vm+1} is also a basis of V.

Proof. The fact that B′ is linearly independent is established in Lemma 1.56, sinceclearly the set {v1, . . . ,vm+1} is linearly dependent because B is a basis for V .

Choose any vector v ∈ V , then there are scalars β1, . . . , βm so that:

v =m∑

i=1

βivi

From Equation 1.17 we can write:

v1 =1

α1

vm+1 −m∑

i=2

αiα1

vi

Then:

v = β1

(1

α1

vm+1 −m∑

i=2

αiα1

vi

)+

m∑

i=2

βivi =m∑

i=2

(βi −

αiα1

)vi +

β1α1

vm+1

Thus we have expressed an arbitrary vector v as a linear combinations of the elements of B′;therefore B′ is a basis of V . �

Remark 1.58. Lemma 1.56 and its corollary are sometimes taken together and calledthe exchange lemma. It says something interesting. If B is a basis of V with m elementsand vm+1 is another, non-zero, vector in V , we can swap vm+1 for any vector vj in B as longas when we express vm+1 as a linear combination of vectors in B the coefficient of vj is notzero. That is, since B is a basis of V we can express:

vm+1 =m∑

i=1

αivi

As long as αj 6= 0, then we can replace vj with vm+1 and still have a basis of V .

13

Exercise 11. Consider the bases:

B =

100

,

010

,

001

for R3. If v = 〈1, 1, 0〉, which elements of B can be replaced by v to obtain a new basis B′?Theorem 1.59. Suppose that B = {v1, . . . ,vm} is a basis of V and W = {w1, . . . ,wn}

is a set of n vectors from V with n > m. Then W is linearly dependent.

Proof. We proceed by induction. The fact that B is a basis means that we can replacesome vector in B by w1 to obtain a new basis B(1). Without loss of generality, assume thatit is v1; this can be made true by rearranging the order of B, if needed. Now, assume wecan do this up to k < m times to obtain basis B(k) = {w1, . . . ,wk,vk+1, . . . ,vm}. We showwe can continue to k + 1.

The fact that we can replace some vector in B(k) by a vector in W to obtain a new basisB(k+1) is clear from Lemma 1.56 and its corollary, but the nature of this exchange is whatwe must manage. Suppose we choose wk+1. There are now two possibilities:

Case 1: There is no way to replace vi (for i ∈ {k + 1, . . . ,m}) with wk+1 and maintainlinear independence. This means that when we express:

wk+1 =k∑

i=1

αiwi +m∑

i=k+1

αivi

That αi = 0 for i = k + 1, . . . ,m. Thus, wk+1 is a linear combination of w1, . . . ,wk and Wis linearly dependent. Induction can stop at this point.

Case 2: There is some vi (for i ∈ {k + 1, . . . ,m}) satisfying the assumptions of theexchange lemma. Then B(k+1) = {w1, . . . ,wk,wk+1, vk+2 . . . ,vm} assuming (as needed) areordering of the elements vk+1, . . . ,vm.

By induction, we have shown that either W is linearly dependent or the {w1, . . . ,wm}forms a basis for V , which means that W must be linearly dependent since (e.g.) wm+1 canbe expressed as a linear combination of {w1, . . . ,wm}. This completes the proof. �

Theorem 1.60. If B and B′ are two bases for the vector space V, then |B| = |B′|.Exercise 12. Prove Theorem 1.60 under the assumption that the sets are finite size.

Definition 1.61 (Dimension). The dimension of a vector space is the cardinality of anyof its bases. If V is a vector space, we write this as dim(V).

Example 1.62. The dimension of Fk is k. This can be seen using the standard basisvectors.

Theorem 1.63. Let V be a vector space with base field F and dimension n. If B ={v1, . . . ,vn} is a set of linearly independent vectors, then B is a basis for V.

Proof. The set B is maximal in size, therefore it must constitute a basis by Theorem1.59. �

Corollary 1.64. Let V be a vector space with dimension n and suppose that W ={v1, . . . ,vm} with m < n is a set of linearly independent vectors. Then there are vectorsvm+1, . . . ,vn so that B = {v1, . . . ,vn} forms a basis for V.

14

Proof. Clearly, W cannot be a basis for V , thus there is at least one vector vm+1 ∈ Vthat cannot be expressed as a linear combination of the vectors in W . Thus, the set W ′ ={v1, . . . ,vm+1} is linearly independent. We can repeat this argument to construct B. By theprevious theorem, B must be a basis for V . �

Exercise 13. Show that if W is a subspace of a vector space V and dim(W) = dim(V),then W = V .

Exercise 14. Show that the vector space P [xk] consisting of all polynomials of a singlevariable with real coefficients and degree at most k has dimension k+1. [Hint: Look atExercise 8 and apply the same idea here.]

Remark 1.65. Consider the vector space P of all polynomials on a single variable withreal coefficients. This space does not have a finite dimension. Instead, it is an infinitedimensional vector space, which can be defined rigorously, if required. We will not definethis rigorously for these notes.

Exercise 15. Let F be a field and V be a vector space over F. Show that the set {0} isa subspace of V with dimension 0.

Exercise 16. Consider the set of vectors C (the complex numbers). When taken C isalso used as the scalar field, clearly this vector space has dimension 1. Show that when Ris used as the scalar field with vectors C, the dimension of the resulting vector space is 2,thus illustrate that the dimension is affected by the choice of the field. [Hint: A basis isjust a set of vectors. When C is both the field and the vector space, the “vector” 1 is theonly basis element needed because (e.g.) the pure imaginary vector i can be constructedby multiplying the scalar i by the vector 1. Suppose we don’t have any imaginary scalarsbecause we’re using R as the scalar field. Find exactly two “vectors” in C that can be usedto generate all the complex numbers (vectors).]

15

CHAPTER 2

More on Matrices and Change of Basis


(1) Review matrix operations(2) Fundamental Theorem of Linear Algebra(3) Change of Basis Theorem(4) Direct sum spaces(5) Product spaces

2. More Matrix Operations: A Review

Remark 2.1. In the last chapter, we introduced matrices and some rudimentary opera-tions. In a sense, this was simply to have some basic notation to work with column vectorsand coordinate representations. In this section we introduce (review) additional matrixoperations and notations. Most of these should be familiar to the reader.

Remark 2.2 (Some Matrix Notation). The following notation occurs in various sub-fieldsof mathematics and it is by no means universal. It is, however, convenient.

Let A ∈ Fm×n for some appropriate base field F. The jth column of A can be written asA·j, where the · is interpreted as ranging over every value of i (from 1 to m). Similarly, theith row of A can be written as Ai·. Note, these are column and row vectors respectively.

Definition 2.3 (Dot Product). Recall that if x,y ∈ Fn are two n-dimensional vectors,then the dot product (scalar product) is:

(2.1) x · y =n∑

i=1

xiyi

where xi is the ith component of the vector x.

Definition 2.4 (Matrix Multiplication). If A ∈ Rm×n and B ∈ Rn×p, then C = AB isthe matrix product of A and B and

(2.2) Cij = Ai· ·B·j

Note, Ai· ∈ R1×n (an n-dimensional vector) and B·j ∈ Rn×1 (another n-dimensional vector),thus making the dot product meaningful.

Example 2.5.

(2.3)

[1 23 4

] [5 67 8

]=

[1(5) + 2(7) 1(6) + 2(8)3(5) + 4(7) 3(6) + 4(8)

]=

[19 2243 50

]

17

Exercise 17. Prove that matrix multiplication distributes over addition. That is, ifA ∈ Fm×n and B,C ∈ Fn×p, then:

A (B + C) = AB + AC

We will use this fact repeatedly.

Definition 2.6 (Matrix Transpose). If A ∈ Rm×n is a m×n matrix, then the transposeof A dented AT is an m× n matrix defined as:

(2.4) ATij = Aji

Example 2.7.

(2.5)

[1 23 4

]T=

[1 32 4

]

Remark 2.8. The matrix transpose is a particularly useful operation and makes it easyto transform column vectors into row vectors, which enables multiplication. For example,suppose x is an n × 1 column vector (i.e., x is a vector in Fn) and suppose y is an n × 1column vector. Then:

(2.6) x · y = xTy

Exercise 18. Let A,B ∈ Rm×n. Use the definitions of matrix addition and transposeto prove that:

(2.7) (A + B)T = AT + BT

[Hint: If C = A + B, then Cij = Aij + Bij, the element in the (i, j) position of matrix C.This element moves to the (j, i) position in the transpose. The (j, i) position of AT + BT isATji + BT

ji, but ATji = Aij. Reason from this point.]

Exercise 19. Let A,B ∈ Rm×n. Prove by example that AB 6= BA; that is, matrixmultiplication is not commutative. [Hint: Almost any pair of matrices you pick (that can bemultiplied) will not commute.]

Exercise 20. Let A ∈ Fm×n and let, B ∈ Rn×p. Use the definitions of matrix multipli-cation and transpose to prove that:

(2.8) (AB)T = BTAT

[Hint: Use similar reasoning to the hint in Exercise 18. But this time, note that Cij = Ai··B·j,which moves to the (j, i) position. Now figure out what is in the (j, i) position of BTAT .]

Definition 2.9. Let A and B be two matrices with the same number of rows (soA ∈ Fm×n and B ∈ Fm×p). Then the augmented matrix [A|B] is:

(2.9)

a11 a12 . . . a1n b11 b12 . . . b1pa21 a22 . . . a2n b21 b22 . . . b2p...

. . ....

.... . .

...am1 am2 . . . amn bm1 bm2 . . . bmp

Thus, [A|B] is a matrix in Rm×(n+p).

18

Example 2.10. Consider the following matrices:

A =

[1 23 4

], b =

[78

]

Then [A|B] is:

[A|B] =

[1 2 73 4 8

]

Exercise 21. By analogy define the augmented matrix[AB

]. Note, this is not a fraction.

In your definition, identify the appropriate requirements on the relationship between thenumber of rows and columns that the matrices must have. [Hint: Unlike [A|B], the numberof rows don’t have to be the same, since your concatenating on the rows, not columns. Thereshould be a relation between the numbers of columns though.]

3. Special Matrices

Definition 2.11 (Identify Matrix). The n× n identify matrix is:

(2.10) In =

1 0 . . . 00 1 . . . 0...

. . ....

0 0 . . . 1

Here 1 is the multiplicative unit in the field F from which the matrix entries are drawn.

Definition 2.12 (Zero Matrix). The n × n zero matrix an n × n consisting entirely of0 (the zero in the field).

Exercise 22. Show that (Fn×n,+,0) is a group with 0 the zero matrix.

Exercise 23. Let A ∈ Fn×n. Show that AIn = InA = A. Hence, I is an identify for thematrix multiplication operation on square matrices. [Hint: Do the multiplication out longhand.]

Definition 2.13 (Symmetric Matrix). Let M ∈ Fn×n be a matrix. The matrix M issymmetric if M = MT .

Definition 2.14 (Diagonal Matrix). A diagonal matrix is a (square) matrix with theproperty that Dij = 0 for i 6= j and Dii may take any value in the field on which D isdefined.

Remark 2.15. Thus, a diagonal matrix has (usually) non-zero entries only on its maindiagonal. These matrices will play a critical roll in our analysis.

4. Matrix Inverse

Definition 2.16 (Invertible Matrix). Let A ∈ Fn×n be a square matrix. If there is amatrix A−1 such that

(2.11) AA−1 = A−1A = In

then matrix A is said to be invertible (or nonsingular) and A−1 is called its inverse. If A isnot invertible, it is called a singular matrix.

19

Proposition 2.17. Suppose that A ∈ Fn×n. If there are matrices B,C ∈ Fn×n such thatAB = CA = In, then B = C = A−1.

Proof. We can compute:

AB = In =⇒ CAB = CIn =⇒ InB = C =⇒ B = C

�

Proposition 2.18. Suppose that A ∈ Fn×n and both B ∈ Fn×n and C ∈ Fn×n areinverses of A, then B = C.

Exercise 24. Prove Proposition 2.18.

Remark 2.19. Propositions 2.17 and 2.18 show that the inverse of a square matrix isunique and there is not difference between a left inverse and a right inverse. It is worthnoting this is not necessarily true in general m× n matrices.

Remark 2.20. The set n × n invertible matrices over R is denoted GL(n,R). It formsa group called the general linear group under matrix multiplication with In the unit. Thegeneral linear group over a field F is defined analogously.

Proposition 2.21. If both A and B are invertible in Fn×n, then AB is invertible and(AB)−1 = B−1A−1.

Proof. Compute:

(AB)(B−1A−1) = ABB−1A−1 = AInA−1 = AA−1 = In

�

Exercise 25. Prove that if A1, . . . ,An ∈ Fn×n are invertible, then (A1, . . . ,Am)−1 =A−1m · · ·A−11 for m ≥ 1.

5. Linear Equations

Remark 2.22. Recall that matrices can be used as a short hand way to represent linearequations. Consider the following system of equations:

(2.12)

a11x1 + a12x2 + · · ·+ a1nxn = b1

a21x1 + a22x2 + · · ·+ a2nxn = b2...

am1x1 + am2x2 + · · ·+ amnxn = bm

Then we can write this in matrix notation as:

(2.13) Ax = b

where Aij = aij for i = 1, . . . ,m, j = 1, . . . , n and x is a column vector in Fn with entries xj(j = 1, . . . , n) and b is a column vector in Fm with entries bi (i = 1 . . . ,m).

Proposition 2.23. Suppose that A ∈ Fn×n and b ∈ Fn×1. If A is invertible, then theunique solution to the system of equations:

Ax = b

is x = A−1b.

20

Proof. If x = A−1b, then Ax = AA−1b = b and thus x is a solution to the system ofequations. Suppose that y ∈ Fn×1 is a second solution. Then Ax = b = Ay. This impliesthat x = A−1b = y and thus x = y. �

Remark 2.24. The practical problem of solving linear systems has been considered forthousands of years. Only in the last 200 years has a method (called Gauss-Jordan) elimi-nation been codified in the West. This method was known in China at least by the secondcentury CE. In the next section, we discuss this approach and its theoretical ramifications.

Proposition 2.25. Consider the special set of equations Ax = 0 with A ∈ Fm×n. Thatis, the right-hand-side is zero for every equation. Then any solution x to this system ofequations has the property that: Ai· · x = 0 for ; i.e., each row of A is orthogonal to thesolution x.


6. Elementary Row Operations

Definition 2.26 (Elementary Row Operation). Let A ∈ Fm×n be a matrix. Recall Ai·is the ith row of A. There are three elementary row operations :

(1) (Scalar Multiplication of a Row) Row Ai· is replaced by αAi·, where α ∈ F andα 6= 0.

(2) (Row Swap) Row Ai· is swapped with Row Aj· for i 6= j.(3) (Scalar Multiplication and Addition) Row Aj· is replaced by αAi· + Aj· for α ∈ F

and i 6= j.

Example 2.27. Consider the matrix:

A =

[1 23 4

]

defined over the field of real numbers. In an example of scalar multiplication of a row by aconstant, we can multiply the second row by 1/3 to obtain:

B =

[1 21 4

3

]

As an example of scalar multiplication and addition, we can multiply the second row by(−1) and add the result to the first row to obtain:

C =

[0 2− 4

31 4

3

]=

[0 2

31 4

3

]

We can then use scalar multiplication and multiply the first row by (3/2) to obtain:

D =

[0 11 4

3

]

We can then use scalar multiplication and addition to multiply the first row by (−4/3)add it to the second row to obtain:

E =

[0 11 0

]

21

Finally, we can swap row 2 and row 1 to obtain:

I2 =

[1 00 1

]

Thus using elementary row operations, we have transformed the matrix A into the matrixI2.

Theorem 2.28. Each elementary row operation can be accomplished by a matrix multi-plication.

Sketch of Proof. We’ll show that scalar multiplication and row addition can be ac-complished by a matrix multiplication. In Exercise 27, you’ll be asked to complete the prooffor the other two elementary row operations.

Let A ∈ Fm×n. Without loss of generality, suppose we wish to multiply row 1 by α ∈ Fand add it to row 2, replacing row 2 with the result. Let:

(2.14) E =

1 0 0 . . . 0α 1 0 . . . 0...

.... . . 0

0 0 0 . . . 1

This is simply the identity Im with an α in the (2, 1) position instead of 0. Now considerEA. Let A·j = [a1j, a2j, . . . , amj]

T be the jth column of A. Then :

(2.15)

1 0 0 . . . 0α 1 0 . . . 0...

.... . . 0

0 0 0 . . . 1

a1ja2j...amj

=

a1jα(a1j) + a2j

...amj

That is, we have taken the first element of A·j and multiplied it by α and added it to thesecond element of A·j to obtain the new second element of the product. All other elementsof A·j are unchanged. Since we chose an arbitrary column of A, it’s clear this will occur ineach case. Thus EA will be the new matrix with rows the same as A except for the secondrow, which will be replaced by the first row of A multiplied by the constant α and addedto the second row of A. To multiply the ith row of A and add it to the jth row, we wouldsimply make a matrix E by starting with Im and replacing the ith element of row j withα. �

Exercise 27. Complete the proof by showing that scalar multiplication and row swap-ping can be accomplished by a matrix multiplication. [Hint: Scalar multiplication should beeasy, given the proof above. For row swap, try multiplying matrix A from Example 2.27 by:[

0 11 0

]

and see what comes out. Can you generalize this idea for arbitrary row swaps?]

Remark 2.29. Matrices of the kind we’ve just discussed are called elementary matrices.Theorem 2.28 will be important when we study efficient methods for solving linear program-ming problems. It tells us that any set of elementary row operations can be performed byfinding the right matrix. That is, suppose I list 4 elementary row operations to perform on

22

matrix A. These elementary row operations correspond to for matrices E1, . . . ,E4. Thus thetransformation of A under these row operations can be written using only matrix multipli-cation as B = E4 · · ·E1A. This representation is much simpler for a computer to keep trackof in algorithms that require the transformation of matrices by elementary row operations.

Definition 2.30 (Row Equivalence). Let A ∈ Fm×n and let B ∈ Fm×n. If there is asequence of elementary matrices E1, . . . ,Ek so that:

B = Ek · · ·E1A

then A and B are said to be row equivalent.

Proposition 2.31. Every elementary matrix is invertible and its inverse is an elemen-tary matrix.

Sketch of Proof. As before, we’ll only consider a single case. Consider the matrix;

E =

1 0 0 . . . 0α 1 0 . . . 0...

.... . . 0

0 0 0 . . . 1

,

which multiplies row 1 by α and adds it to row 2. Then we can compute the inverse as:

E−1 =

1 0 0 . . . 0−α 1 0 . . . 0

......

. . . 00 0 0 . . . 1

,

Multiplying the two matrices shows they yield the identity. The resulting inverse is anelementary matrix by inspection. �

Remark 2.32. The fact that the elementary matrices are invertible is intuitively clear.An elementary matrices perform an action on a matrix and this action can be readily undone.That’s exactly what the inverse is doing.

Exercise 28. Compute the inverses for the other two kinds of elementary matrices.

Remark 2.33. The process we’ve illustrated in Example 2.27 is an instance of Gauss-Jordan elimination and can be used to find the to solve systems a system linear equations.This process is summarized in Algorithm 1.

Definition 2.34 (Pivoting). In Algorithm 1 when Aii 6= 0, the process performed inSteps 4 and 5 is called pivoting on element (i, i).

Lemma 2.35 (Correctness). Suppose that Algorithm 1 terminates with [In|b], where bis the result of the elementary row operations. Then b = A−1b. Therefore, Algorithm 1terminates with the unique solution to the system of equations Ax = b, if it exists.

Proof. The elementary row operations can be written as the product of elementarymatrices on the left-hand-side of matrix X. Thus, the operation algorithm yields:

(2.16) En · · ·E1X = [En · · ·E1A|En · · ·E1b]

23

Gauss-Jordan EliminationSolving Ax = b

(1) Let A ∈ Fn×n, b ∈ Fn×1. Let X = [A|b].(2) Let i := 1(3) If Xii = 0, then use row-swapping on X to replace row i with a row j (j > i) so that

Xii 6= 0. If this is not possible, then A, there is not a unique solution to Ax = b.(4) Replace Xi· by (1/Xii)Xi·. Element (i, i) of X should now be 1.

(5) For each j 6= i, replace Xj· by−Xji

XiiXi· + Xj·.

(6) Set i := i+ 1.(7) If i > n, then A has been replaced by In and b has been replaced by A−1b in X. If

i ≤ n, then goto Line 3.

Algorithm 1. Gauss-Jordan Elimination for Systems

If En · · ·E1A = In, then by the uniqueness of the inverse, A−1 = En · · ·E1 and b = A−1b.This must be a solution to the original system of equations. �

Remark 2.36. We have now, effectively, proved the fundamental theorem of MatrixAlgebra, which we state, but do not prove in detail (because we’ve already done it).

Theorem 2.37. Suppose that A ∈ Fn×n. The following are equivalent:

(1) A is invertible.(2) The system of equations Ax = b has a unique solution for any b ∈ Fn×1.(3) The matrix A is row-equivalent to In.(4) Both the matrix A and A−1 can be written as the product of elementary matrices.

�

Exercise 29. Use Exercise 25 and Proposition 25 to explicitly show that if A−1 is theproduct of elementary matrices, then so is A.

Remark 2.38. The previous results can be obtained using elementary column operations,instead of elementary row operations.

Definition 2.39 (Elementary Column Operation). Let A ∈ Fm×n be a matrix. RecallA·j is the jth column of A. There are three elementary column operations :

(1) (Scalar Multiplication of a Column) Column A·j is replaced by αA·j, where α ∈ Fand α 6= 0.

(2) (Column Swap) Column A·j is swapped with Column A·k for j 6= k.(3) (Scalar Multiplication and Addition) Column A·k is replaced by αA·j + A·k for

α ∈ F and j 6= k.

Proposition 2.40. For each elementary column operation, there is an invertible elemen-tary matrix E ∈ Fn×n so that AE has the same effect as performing the column operation.


24

7. Computing a Matrix Inverse with the Gauss-Jordan Procedure

Remark 2.41. The Gauss-Jordan algorithm can be modified easily to compute the in-verse of a matrix A, if it exists. Simply replace [A|b] with X = [A|In]. If the algorithmterminates normally, then X will be replaced with [In|A−1].

Example 2.42. Again consider the matrix A from Example 2.27. We can follow thesteps in Algorithm 1 to compute A−1.

Step 1:

X :=

[1 2 1 03 4 0 1

]

Step 2: i := 1Step 3 and 4 (i = 1): A11 = 1, so no swapping is required. Furthermore, replacing

X1· by (1/1)X1· will not change X.Step 5 (i = 1): We multiply row 1 of X by −3 and add the result to row 2 of X to

obtain:

X :=

[1 2 1 00 −2 −3 1

]

Step 6: i := 1 + 1 = 2 and i = n so we return to Step 3.Steps 3 (i = 2): The new element A22 = −2 6= 0. Therefore, no swapping is required.Step 4 (i = 2): We replace row 2 of X with row 2 of X multiplied by −1/2.

X :=

[1 2 1 00 1 3

2−1

2

]

Step 5 (i = 2): We multiply row 2 of X by −2 and add the result to row 1 of X toobtain:

X :=

[1 0 −2 10 1 3

2−1

2

]

Step 6 (i = 2): i := 2 + 1 = 3. We now have i > n and the algorithm terminates.

Thus using Algorithm 1 we have computed:

A−1 =

[−2 132−1

2

]

Exercise 31. Does the matrix:

A =

1 2 34 5 67 8 9

have an inverse? [Hint: Use Gauss-Jordan elimination to find the answer.]

Exercise 32 (Bonus). Implement Gauss-Jordan elimination in the programming lan-guage of your choice. Illustrate your implementation by using it to solve the previous exercise.[Hint: Implement sub-routines for each matrix operation. You don’t have to write them asmatrix multiplication, though in Matlab, it might speed up your execution. Then use thesesubroutines to implement each step of the Gauss-Jordan elimination.]

25

7.1. Equations over GF(2).

Remark 2.43. We have mentioned that all matrices are defined over a field F. It istherefore logical to note that linear equations Ax = b can be over an arbitrary field, notjust R or C.

Example 2.44. Consider the system of equations over GF(2):

x1 + x2 + x3 = 1

x1 + x3 = 0

x1 + x2 = 0

This set of equations has a single solution x1 = x2 = x3 = 1. We can show this by computingthe inverse of the coefficient matrix in GF(2). Remember, all arithmetic is done in GF(2).

Step 1:

X :=

1 1 1 1 0 01 1 0 0 1 01 0 1 0 0 1

Step 2: Add the first row to the second row:

X :=

1 1 1 1 0 00 0 1 1 1 01 0 1 0 0 1

Step 3: Add the first row to the third row:

X :=

1 1 1 1 0 00 0 1 1 1 00 1 0 1 0 1

Step 4: Swap Row 2 and 3

X :=

1 1 1 1 0 00 1 0 1 0 10 0 1 1 1 0

Step 5: Add Row 2 and Row 3 to Row 1

X :=

1 0 0 1 1 10 1 0 1 0 10 0 1 1 1 0

Thus the coefficient matrix A has an inverse and multiplying:

x = A−1b =

1 1 11 0 11 1 0

100

=

111

as expected. It is worth noting that if we remove the third equation to obtain the system oftwo equations:

x1 + x2 + x3 = 1

x1 + x3 = 0

26

We have at least two solutions: x1 = x2 = x3 = 1 and x1 = x3 = 0 and x2 = 1. We discusscases like this in the next section.

8. When the Gauss-Jordan Procedure does not yield a Solution to Ax = b

Remark 2.45. If Algorithm 1 does not terminate with X := [In|A−1b], then supposethe algorithm terminates with X := [A′|b′]. There is at least one row in A′ with all zeros.That is A′ has the form:

(2.17) A′ =

1 0 . . . 00 1 . . . 0...

. . ....

0 0 . . . 0...

. . ....

0 0 . . . 0

In this case, there are two possibilities:

(1) For every zero row in A′, the corresponding element in b′ is 0. In this case, thereare an infinite number of alternative solutions to the system of equations Ax = b.

(2) There is at least one zero row in A′ whose corresponding element in b′ is not zero.In this case, there are no solutions to the system of equations Ax = b.

Example 2.46. Consider the system of equations in R:

x1 + 2x2 + 3x3 = 7

4x1 + 5x2 + 6x3 = 8

7x1 + 8x2 + 9x3 = 9

This yields matrix:

A =

1 2 34 5 67 8 9

and right hand side vector b = 〈7, 8, 9〉. Applying Gauss-Jordan elimination in this caseyields:

(2.18) X :=

1 0 −1 −193

0 1 2 203

0 0 0 0

Since the third row is all zeros, there are an infinite number of solutions. An easy way tosolve for this set of equations is to let x3 = t, where t may take on any value in R. Variablex3 is called a free variable. Then, row 2 of Expression 2.18 tells us that:

(2.19) x2 + 2x3 =20

3=⇒ x2 + 2t =

20

3=⇒ x2 =

20

3− 2t

We then solve for x1 in terms of t. From row 1 of Expression 2.18 we have:

(2.20) x1 − x3 = −19

3=⇒ x1 − t = −19

3=⇒ x1 = t− 19

327

Thus every vector in the set:

(2.21) X =

{(t− 19

3,20

3− 2t, t

): t ∈ R

}

is a solution to Ax = b. This is illustrated in Figure 2.1(a).

Intersection

(a) (b)

Figure 2.1. (a) Intersection along a line of 3 planes of interest. (b) Illustrationthat the planes do not intersect in any common line.

Conversely, suppose we have the problem:

x1 + 2x2 + 3x3 = 7

4x1 + 5x2 + 6x3 = 8

7x1 + 8x2 + 9x3 = 10

The new right hand side vector is b = [7, 8, 20]T . Applying Gauss-Jordan elimination in thiscase yields:

(2.22) X :=

1 0 −1 00 1 2 00 0 0 1

Since row 3 of X has a non-zero element in the b′ column, we know this problem has nosolution, since there is no way that we can find values for x1, x2 and x3 satisfying:

(2.23) 0x1 + 0x2 + 0x3 = 1

This is illustrated in Figure 2.1(b).

Remark 2.47. In general, the number of free variables will be equal to the number ofrows of the augmented matrix that are all zero. The following theorem now follows directlyfrom analysis of the Gauss-Jordan procedure.

28

Theorem 2.48. Let A ∈ Fn×n and b ∈ Fn×1. Then exactly one of the following hold:

(1) The system of linear equations Ax = b has exactly one solution given by x = A−1b.(2) The system of linear equations Ax = b has more than one solution and the number

of free variables defining those solutions is equal to the number of zero rows in theaugmented matrix after Gauss-Jordan elimination.

(3) The system of linear equations Ax = b has no solution.

�

Remark 2.49. We will return to these systems of linear equations in the next chapter,but characterize them in a more general way in the language of vector spaces.

Exercise 33. Solve the problem

x1 + 2x2 = 7

3x1 + 4x2 = 8

using Gauss-Jordan elimination.

9. Change of Basis as a System of Linear Equations

Remark 2.50. It may seem a little murky how all this matrix arithmetic is going to playinto our rather theoretical discussion of bases from the previous section. Consider Remark1.51 and Example 1.53 in which we expressed the standard basis vector e1 ∈ R3 in terms ofa new basis B = {〈1, 1, 0〉, 〈1, 0, 1〉, 〈0, 1, 1〉}. We observed that the new coordinates of thestandard basis vector e1 = 〈1, 0, 0〉 were

⟨12, 12,−1

2

⟩.

To make that computation, we solved a system of linear equations (Equation 1.9), whichcan be written in matrix form as:

1 1 01 0 10 1 1

α1

α2

α3

=

100

Notice we had:

v1 =

110

, v2 =

101

, v3 =

011

for the basis vectors and these have become the columns of the matrix. Thus, we can writethis more compactly as:

[v1 v2 v3

]α = e1

Remark 2.51. We can now tackle this problem in more generality. Let V be an n-dimensional vector space over a field F. Let B = {v1, . . . ,vn} be a basis for V and letB′ = {w1, . . . ,wn} be a second basis for V . Suppose we are given a vector v ∈ V and weknow the coordinates for v in basis B are 〈β1, . . . , βn〉. That is:

v =n∑

j=1

βjvj

29

We will write:

[v]B =

β1...βn

to mean that 〈β1, . . . , βn〉 are the coordinates of v expressed in basis B.We seek to compute coordinates for v with respect to B′; i.e., we want to find [v]B′ . One

way to do this, is simply to solve explicitly for the necessary coordinates for v using B′.However, this is not a general solution. We’d have to repeat this process each time we wantto convert from one coordinate system (basis) to another. Instead, suppose that we executethis procedure for just the basis elements in B (just as we did in Example 1.53). Then foreach i ∈ {1, . . . , n} we compute:

n∑

i=1

αijwi = vj

That is, the coordinates of basis vector vj ∈ B written in terms of the basis B′ are 〈α1j, . . . , αnj〉.WARNING: Here is where it gets a little complicated with notation! For each basisvector vj, we’ve generated a column vector 〈α1j, . . . , αnj〉 because we’re solving the linearequations:

[w1 w2 · · · vn

]

α1j

α2j...αnj

= vj

We can now expression the original vector v in terms of the basis B′ by noting that:

v =n∑

j=1

βjvj =n∑

j=1

βj

(n∑

i=1

αijwi

)

For a fixed i, the coefficient of wi is∑

j βjαij. Thus, the vector of coordinates for v in termsof B′ are:

[v]B′ =

∑nj=1 βjα1j∑nj=1 βjα2j

...∑nj=1 βjαnj

There is a simpler way to write this (using matrices)! Let:

(2.24) ABB′ =

α11 α12 · · · α1n

......

. . ....

αn1 αn2 · · · αnn

=

[α1 · · · αn

],

where αj = 〈α1j, . . . , αnj〉 is the column vector of the coordinates of vj expressed in basisB′. We can now prove the change of basis theorem.

30

Figure 2.2. The vectors for the change of basis example are shown. Note that vis expressed in terms of the standard basis in the problem statement.

Theorem 2.52. Let B = {v1, . . . ,vn} and B′ = {w1, . . . ,wn} be two bases for an n-dimensional vector space V over a field F . Suppose that:

n∑

i=1

αijwi = vj

for j = 1, . . . , n and let ABB′ be defined as in Equation 2.24. Then if [v]B = 〈β1, . . . , βn〉 = β,then:

[v]B′ = ABB′β

Proof. The proof is by computation:α11 α12 · · · α1n

......

. . ....

αn1 αn2 · · · αnn

β1...βn

=

∑nj=1 βjα1j

...∑nj=1 βjαnj

= [v]B′ ,

as required. �

Corollary 2.53. The matrix ABB′ is invertible and furthermore AB′B = A−1BB′

Exercise 34. Sketch a proof of Corollary 2.53. [Hint: Remember that ABB′ [wi]B =[wi]B′ . Therefore, [wi]B = A−1BB′ [wi]B′ . Assert this is sufficient based on the argument in thepreceding remark.]

Example 2.54. If this is confusing, don’t worry. Change of basis is very confusingbecause it’s notationally intense. Let’s try a simple example in R2. Let B = {〈1, 1〉, 〈1,−1〉}and let B′ = {〈−1, 0〉, 〈0,−1〉}. We are intentionally not using the standard basis here.Assume we have the vector 〈2, 0〉, expressed in the standard basis. This is shown in Figure2.2. Assume we are transforming from B to B′. We must express v in terms of B first. It’seasy to check that:

1 ·[11

]+ 1 ·

[1−1

]=

[20

]

Thus:

[v]B =

[11

],

31

with color added for emphasis. Now, to express 〈1, 1〉 in B′ we are seeking 〈α1,1, α2,1〉 satis-fying:

[−1 00 −1

] [α1,1

α2,1

]=

[11

]

The solution is (clearly) α1,1 = α2,1 = −1. Thus, [〈1, 1〉]B′ = 〈−1,−1〉. For the second basisvector in B, we must compute:

[−1 00 −1

] [α1,2

α2,2

]=

[1−1

]

The solution is (clearly) α1,2 = −1, α2,2 = 1. Thus, [〈1,−1〉]B′ = 〈−1, 1〉. Notice each time,we are forming a matrix whose columns are the basis vectors of B′ and the right hand sideis the basis vector from B in question.

We can now form ABB′ using α1,1, α2,1, α1,2, and α2,2:

ABB′ =

[−1 −1−1 1

]

We can use ABB′ to compute [v]B′ using [v]B:

[v]B′ = ABB′ [v]B =

[−1 −1−1 1

] [11

]=

[−20

]

Fortunately, this is exactly what we’d expect. To express the vector 〈2, 0〉 in terms of B′we’d have:

−2 ·[−10

]+ 0 ·

[0−1

]=

[20

].

10. Building New Vector Spaces1

Remark 2.55. The proof of the next proposition is left as an exercise in checking thatthe requirements of a vector space are met.

Proposition 2.56. Let V be a vector space over a field F and suppose that W and Uare both subspaces of V. Then W ∩ U = {v ∈ V : v ∈ W and v ∈ W} is a subspace of V.


Remark 2.57. It may seem that Proposition 2.56 is non sequitur after our discussions,but it is consistent with our work on linear equations. Any linear equation:

∑

j

aijxj = 0

defines a (hyper)plane in Fn. Here, aij, bj ∈ F. That is:

Hi =

{x ∈ Fn :

∑

j

aijxj = 0

}

1This material can be delayed until discussion of Orthognality.

32

This is a subset of vectors in Fn and it can be shown that it is in fact a subspace of Fn overthe field Fn. A system of linear equations Ax = b simply yields the set of vectors in theintersection of the subspaces. That is, any solution to Ax = 0 is in the space:

H =⋂

i

Hi

Thus by Proposition 2.56 it must be a subspace of Fn.

Figure 2.3. The intersection of two sub-spaces in R3 produces a new sub-space of R3.

Exercise 36. Suppose

Hi =

{x ∈ Fn :

∑

j

aijxj = 0

}

Show that Hi is a subspace of Fn.

Remark 2.58. There are other ways to build new subspaces of a vector space fromexisting subspaces than simple intersection.

Definition 2.59 (Sum). Let V be a vector space over a field F. Suppose W1 and W2

are two subspaces of V . The sum of W1 and W2, written W1 +W2 is the set of all vectorsin V with form v1 + v2 for v1 ∈ W1 and v2 ∈ W2.

Lemma 2.60. The set of vectors W1 +W2 from Definition 2.59 is a subspace of V.

Proof. The fact that 0 ∈ W1,W2 and 0 = 0 + 0 implies that 0 ∈ W1 + W2. Ifv1,v2 ∈ W1 and w1,w2 ∈ W2, then:

(v1 + w1) + (v2 + w2) = (v1 + v2) + (w1 + w2) ∈ W1 +W2

thus showing W1 +W2 is closed under vector addition. Finally, choose any a ∈ F. We knowthat if v ∈ W1 and w ∈ W2, then av ∈ W1 and aw ∈ W2. Thus a(v + w) ∈ W1 +W2. ThusW1 +W2 is a subspace of V . �

Definition 2.61. A vector space V is a direct sum of subspaces W1 and W2, writtenW1 ⊕W2 = V if for each v ∈ V there are two unique elements u ∈ W1 and w ∈ W2 suchthat v = u + w.

33

(a) (b)

Figure 2.4. The sum of two sub-spaces of R2 that share only 0 in common recreate R2.

Theorem 2.62. Let V be a vector space with two subspaces W1 and W2 such that V =W1 +W2. If W1 ∩W2 = {0}, then W1 ⊕W2 = V.

Proof. Choose any vector v ∈ V . Since V = W1 +W2 there are two elements u ∈ W1

and w ∈ W2 so that v = u + w. Now suppose there are two other elements u′W1 andw′ ∈ W2 so that v = u′ + w′. Then:

(u + w)− (u′ + w′) = (u− u′) + (w −w′) = 0.

Thus:

u− u′ = w′ −w

But, u − u′ ∈ W1 and w′ − w ∈ W2. The fact that u − u′ = w′ − w holds implies thatw′ − w ∈ W1 and u − u′ ∈ W2. The only element these subspaces share in common is 0,the: u− u′ = 0 = w′ −w. Therefore, u = u′ and w = w′. This completes the proof. �

Example 2.63. Theorem 2.62 is illustrated in Figure 2.4.

Theorem 2.64. Let V be a vector space with a subspace W1. Then there is a subspaceW2 such that V =W1 ⊕W2.

Proof. Assume dim(V) = n. If W1 = V , let W2 = {0}. Otherwise, et B1 ={v1, . . . ,vm} be any basis for W1. By the same reasoning as in Corollary 1.64, we canbuild vectors vm+1, . . . ,vn not in B1 so that B = {v1, . . . ,vn} form a basis for V . LetW2 = span({vm+1, . . . ,vn}). At once we see that W1 ∩ W2 = {0}. The result follows atonce from Theorem 2.62. �

Theorem 2.65. Suppose that V is a vector space with subspaces W1 and W2 and V =W1 ⊕W2. Then dim(V) = dim(W1) + dim(W2).

34

Proof. Let B1 = {v1, . . . ,vm} be a basis for W1 and B2 = {w1, . . . ,wp} be a basis forW2. Every vector in v ∈ V can be uniquely expressed as:

v =m∑

i=1

αivi +

p∑

j=1

βjwj

Thus, {v1, . . . ,vm,w1, . . . ,wp} must form a basis for V . The fact that dim(V) = dim(W1)+dim(W2) follows at once. �

Proposition 2.66. Let V1 and V2 be (any) two vector spaces over a common field F.Let V = V1 × V2 = {(v1,v2) : v1 ∈ V1 and v2 ∈ V2}. Suppose 0i ∈ Vi is the zero vector inVi, (i = 1, 2). Let 0 = (01,02) ∈ V. If we define

(2.25) (v1,v2) + (u1,u2) = (v1 + u1,v2 + u2)

and for any a ∈ F we have: a(v1,v2) = (av1, av2), then V is a vector space over F. �

Corollary 2.67. If V = V1 × V2, then dim(V) = dim(V1) + dim(V2).

Exercise 37. Verify that V = V1 × V2 is a vector space over F. Prove Corollary 2.67.

Definition 2.68. The vector space V = V1×V2 is called the product space of V1 and V2.

Example 2.69. We’ve already seen a number of examples. Clearly, C2 is the productspace of C and C over the field C. However, you could easily have the product space ofC× R over the field R.

Remark 2.70. Product spaces and direct sums generalize to any number of vector spacesor vector subspaces. In general we’d write:

V =n∏

i=1

Vi = V1 × · · · × Vn.

Similarly we’d have:

V =⊕

i = 1nWi =W1 ⊕ · · · ⊕Wn.

to mean that every element of V is a unique sum of elements of the subspaces W1, . . . ,Wn.

35

CHAPTER 3

Linear Transformations


(1) Introduce Linear Maps (or Linear Transforms)(2) Discuss Image and Kernel(3) Prove the dimension theorem(4) Matrix of a Linear Transform(5) Applications of Linear Transformations

2. Linear Transformations

Definition 3.1 (Linear Transformation). Let V and W be two vector spaces over thesame base field F. A linear transformation is a function f : V → W such that if v1 and v2

are vectors in V and α is a scalar in F, then:

(3.1) f(αv1 + v2) = αf(v1) + f(v2)

Remark 3.2. A linear transformation is sometimes called a linear map.

Exercise 38. Show that Equation 3.1 can be written as two equations:

f(v1 + v2) = f(v1) + f(v2)

f(αv1) = αf(v1)

Exercise 39. Show that of f : V → W is a linear transformation, then f(0) = 0 ∈ W .

Example 3.3. Let a be any real number and consider the vector space R (over itself).Then f(v) = av is a linear transformation.

Example 3.4. Generalizing from the previous example, let F be an arbitrary field andconsider the space Fn. If A ∈ Fm×n, then the function fA(v) = Av for v ∈ Fn (written asa column vector) is a linear transformation from Fn to Fm. Put more simply, matrix multi-plication by an m× n matrix transforms n-dimensional (column) vectors to m dimensional(column) vectors.

To see this, note that if v1,v2 ∈ Fn×1 and α ∈ F, then:

fA(αv1 + v2) = A (αv1 + v2) = αAv1 + Av2 = αfA(v1) + fA(v2),

by simple matrix multiplication.

Remark 3.5 (Notation). Throughout the remainder of these notes, if A ∈ Fm×n, we willuse fA to denote the linear transform fA(v) = Av for some v ∈ Fn.

37

Example 3.6. Consider the vector space P [x2] of polynomials with real coefficients anddegree at most 2 over the field of real numbers. We can show that differentiation is a lineartransformation from P [x2] to P [x]. Let v be a vector in P [x2]. Then we know:

v = ax2 + bx+ c

for some appropriately chosen a, b, c ∈ R. The derivative transformation is defined as:

(3.2) D(v) = 2ax+ b

and we see at once that D(v) ∈ P [x]. To see that it’s a linear map note that if v1 =a1x

2 + b1x+ c and v2 = a2x2 + b2x+ c2 are both polynomials in P [x2] and α ∈ R, then:

αv1 + v2 = (αa1 + a2)x2 + (αb1 + b2)x+ (αc1 + c2)

after factorization. Then:

D(αv1+v2) = 2(αa1+a2)x+(αb1+b2) = α(2a1x+b1)+(2a2x+b2) = αD(v1)+D(v2).

Thus, polynomial differentiation is a linear transformation.

Example 3.7. Continuing the previous example, differentiation can even be accom-plished as a matrix multiplication. Recall the standard basis for P [x2] is {x2, x, 1}, so thata coordinate vector 〈a, b, c〉 is equivalent to the polynomial ax2 + bx+ c. Let:

D =

[2 0 00 1 0

]

Then writing an arbitrary polynomial (vector) in coordinate form: v = 〈a, b, c〉 we see:

Dv =

[2 0 00 1 0

]abc

=

[2ab

]

This new vector is equivalent to the polynomial 2ax + b in P [x] when the standard basis{x, 1} is used for that space. Thus, polynomial differentiation can be expressed as matrixmultiplication.

Exercise 40. Find the matrix that computes the derivative mapping from P [xn] toP [xn−1] where n ≥ 1.

Definition 3.8 (One-to-One or Injective). Let V and W be two vector spaces over thesame field F. Let f : V → W be a linear transformation from V to W . The function f iscalled one-to-one (or injective) if for any vectors v1 and v2 in V : if f(v1) = f(v2), thenv1 = v2.

Definition 3.9 (Onto or Surjective). Let V and W be two vector spaces over the samefield F. Let f : V → W be a linear transformation from V to W . The function f is calledonto (or surjective) if for any vectors w ∈ W there is a vector v ∈ V such that f(v) = w.

Example 3.10. The derivative map from P [x2] to P [x] is surjective but not injective.To see this, note that the polynomials ax2 + bx+ c1 and ax2 + bx+ c2 both map to 2ax+ bin P [x] under D, even when c1 6= c2. Thus, D is not injective.

On the hand, given any ax+b in P [x], the quadratic polynomial 12ax2+bx maps to ax+b

under D. Thus D is surjective.

38

Exercise 41. Show that if A ∈ Fn×n is invertible, then the linear mapping fA : Fn → Fndefined by f(v) = Av is injective and surjective.

Definition 3.11. A linear map that is both injective and surjective is bijective or iscalled a bijection.

Definition 3.12 (Inverse). Suppose f : V → W is a bijection from vector space V tovector space W , both defined over common base field F. The inverse map, f−1 :W → V isdefined as:

(3.3) f−1(w) = v ⇐⇒ f(v) = w

The fact that f is a bijection means that for each w in W there is a unique v ∈ V so thatf(v) = w, thus f−1(w) is uniquely defined for each w in W .

Proposition 3.13. If f is a linear transformation from vector space V to vector spaceW, both defined over common base field F and f is a bijection then f−1 is also a bijectivelinear transform of W to V.


Remark 3.14. Functions can be injective, surjective or bijective without being lineartransformations. We focus on linear transformations because they are the most useful to us.

Definition 3.15 (Isomorphism). Two vector spaces V and W over a common base fieldF are isomorphic if there is a bijective linear transform from V to W . The function is thencalled an isomorphism.

Example 3.16. Consider the subspace V of R3 composed of all vectors of the form〈t, 2t, 3t〉 where t ∈ R. This defines a line passing through the origin in 3-space (R3). Letf : 〈t, 2t, 3t〉 7→ t, so: f : V → R.

(1) This function is onto, since for each t ∈ R we have f(〈t, 2t, 3t) = t.(2) This function is one-to-one, since if v1,v2 ∈ V and f(v1) = f(v2) = t, then v1 =

v2 = 〈t, 2t, 3t〉.Verifying that f is a linear transformation is left as an exercise. Therefore, V is isomorphicto R.

Exercise 43. Verify that the function f in Example 3.16 is a linear transformation.

Exercise 44. Prove that P [x2] is isomorphic to R3.

3. Properties of Linear Transforms

Definition 3.17 (Composition). Let U , V andW be three vector spaces over a commonbase field F. If f : U → V and g : V → W , then g ◦ f : U → W is defined as:

(3.4) (g ◦ f)(u) = g(f(u)) = w ∈ Wwhere u ∈ U .

Proposition 3.18. The composition operation is associative. That is: if U , V, W andX are four vector spaces over a common base field F. If f : U → V, g : V → W, andh :W → X , then:

h ◦ g ◦ f = (h ◦ g) ◦ f = h ◦ (g ◦ f)

39


Theorem 3.19. Let U , V and W be three vector spaces over a common base field F. Letf : U → V and g : V → W be linear transformations. Then (f ◦ g) : U → W is a lineartransformation.

Proof. Let u1,u2 ∈ U and let α ∈ F. Then:

(g ◦ f)(αu1 + u2) = g(f(αu1 + u2)) = g(αf(u1) + f(u2)) =

αg(f(u1)) + g(f(u2)) = α(g ◦ f)(u1) + (g ◦ f)(u2)

This completes the proof. �

Definition 3.20 (Identity/Zero Map). Let V be a vector space over a field F. Thefunction ι : V → V defined by ι(v) = v is the identity function. Let W be a second vectorspace over the field F. The function ϑ : V → W defined by ϑ(v) = 0 ∈ W (the zero vector)is the zero function.

Lemma 3.21. Both the identity function and the zero function are linear transformations.

Definition 3.22 (Automorphism). Let f : V → V be an isomorphism from a vectorspace V over field F to itself. Then f is called an automorphism.

Theorem 3.23. Let AutF(V) be the set of all automorphisms of V over F. Then (AutF(V), ◦, ι)forms a group.

Proof. Function composition is associative, by Proposition 3.18. Furthermore, we knowfrom Theorem 3.19 that AutF(V) must be closed under composition. The identity map ιacts like a unit since:

(ι ◦ f)(v) = ι(f(v)) = f(v) = f(ι(v)) = (f ◦ ι)(v)

for any v ∈ V . Finally, since each element of AutF(V) is a bijection, it must be invertibleand thus every element of AutF(V) has an inverse. Thus AutF(V) is a group. �

Definition 3.24 (Addition). Let V and W be vector spaces over the base field F iff, g : V → W , then:

(3.5) (f + g)(v) = f(v) + g(v)

where v ∈ V .

Lemma 3.25. Let V and W be vector spaces over the base field F if f, g : V → W arelinear transformations, then (f + g) is a linear transformation of V to W.

Proof. Let v1,v2 ∈ V and let α ∈ F. Then:

(f + g)(αv1 + v2) = f(αv1 + v2) + g(αv1 + v2) =

αf(v1) + f(v2) + αg(v1) + g(v2) = α (f(v1) + g(v1)) + (f(v2) + g(v2)) =

α(f + g)(v1) + (f + g)(v2)

�

40

Exercise 46. Prove Lemma 3.21

Lemma 3.26. Let L(V ,W) be the set of all linear transformations from V to W, twovector spaces that share base field F. Then (L(V ,W),+, ϑ) is a group.

Proof. Lemma 3.25 shows that L(V ,W) is closed under addition. Addition is clearlyassociative and the additive identity is obviously ϑ, the zero function, which is itself a lineartransformation by Lemma 3.21. The additive inverse of f ∈ L(V ,W) is obviousy −f , definedby (−f)(v) = −f(v) for any v ∈ V . This is a linear transformation if and only if f is alinear transformation. Thus, L(V ,W) is a group. �

Remark 3.27. The following theorem is easily verified.

Theorem 3.28. Let L(V ,W) be the set of all linear transformations from V to W, twovector spaces that share base field F. Then L(V ,W) is a vector space over the basefield F,with scalar-vector multiplication defined as (αf)(v) = αf(v) for any f ∈ L(V ,W). �

4. Image and Kernel

Definition 3.29 (Kernel). Let V andW be two vector spaces over the same field F. Letf : V → W be a linear transformation from V to W . Then the kernel of f , denoted Ker(f)is defined as:

(3.6) Ker(f) = {v ∈ V : f(v) = 0 ∈ W}Example 3.30. Let A ∈ Fm×n for some positive integers m and n and a field F. The

kernel of the linear map fA(v) = Av is the solution set of the system of linear equations:

Ax = 0

To make this more concrete, consider the matrix:

A =

[1 22 4

]

with elements from R. In this case, we have the matrix equation for the kernel of fA:[1 22 4

] [x1x2

]=

[00

]

The set of solutions is:

Ker(fA) = {〈−2t, t〉 : t ∈ R}This is illustrated in Figure 3.1.

Exercise 47. Show that if A ∈ Fn×n is invertible, then Ker(fA) = {0}.Exercise 48. Show that the kernel of D on P [xn] is the set of all constant real polyno-

mials.

Definition 3.31 (Image). Let V andW be two vector spaces over the same field F. Letf : V → W be a linear transformation from V to W . Then the image of f , denoted Im(f)is defined as:

(3.7) Im(f) = {w ∈ W : ∃v ∈ V (w = f(v))}That is, the set of all w ∈ W so that there is some v ∈ V so that f(v) = w.

41

Remark 3.32. Clearly if f is a surjective linear mapping from V toW , then Im(f) =W .

Example 3.33. Let A ∈ Fm×n be a matrix and fA be the linear transformation sendingv ∈ Fn to Av in Fm; i.e., fA(v) = Av. If:

A =[a1 · · · an

]

where aj = A·j, then Im(fA) = span({a1, . . . , an}). To be more concrete, let

A =

[1 22 4

]

We already know that the columns of A are linearly dependent, so we’ll focus exclusively onthe first column. We have:

Im(fA) = span (〈1, 2〉, 〈2, 4〉) = {〈t, 2t〉 : t ∈ R}We illustrate both the image and the kernel of fA in Figure 3.1. Note: since fA : R2 → R2,

��-10 -5 5 10

-10

-5

5

10

ImageKernel

Figure 3.1. The image and kernel of fA are illustrated in R2.

we are showing the image and kernel in the same plot. In general, the image is a subset ofthe W when the linear transform maps vector space V toW ; i.e., W does not have to be thesame as V .

Remark 3.34. For the remainder of this section, we will assume that V andW are vectorspaces over a common base field F. Furthermore, we will assume that f : V → W is a lineartransformation.

Theorem 3.35. The kernel of f is a subspace of V.

Proof. In Exercise 39 it was shown that f(0) = 0 ∈ W , thus 0 ∈ Ker(f). Suppose thatv1,v2 ∈ Ker(f). Then:

f(v1 + v2) = f(v1) + f(v2) = 0.

Thus v1 + v2 ∈ Ker(f). Finally, if v ∈ Ker(f) and α ∈ F, then:

f(αv) = αf(v) = 0

42

Consequently Ker(f) is closed under vector addition and scalar-vector multiplication andcontains 0 ∈ V Thus it is a subspace. �

Theorem 3.36. The image of f is a subspace of W. �

Exercise 49. Prove Theorem 3.36. [Hint: The proof is almost identical to the proof ofTheorem 3.35.]

Theorem 3.37. Let f : V → W be a linear transformation. The function f is a injectiveif and only if Ker(f) = {0}.

Proof. (⇐) Suppose that Ker(f) = {0}. Let v1,v2 ∈ V have the property that f(v1) =f(v2). Then:

0 = f(v1)− f(v2) = f(v1 − v2)

It follows that v1 − v2 ∈ Ker(f) and thus: v1 − v2 = 0. Thus v1 = v2. Therefore f isinjective.

(⇒) Suppose f is injective. We know from Exercise 39 that 0 ∈ Ker(f). Therefore,Ker(f) = {0}. �

Corollary 3.38. If f is surjective and Ker(f) = {0}, then f is an isomorphism. �

Exercise 50. Suppose A ∈ Fn×n is invertible. Prove that fA is an isomorphism fromFn to itself.

Lemma 3.39. Suppose Ker(f) = {0}, then dim(Im(f)) = dim(V) and if {v1, . . . ,vn} isa basis for V, then {f(v1), . . . , f(vn)} is a basis for Im(f).

Proof. We know from Theorem 3.37 f is injective. Let w ∈ Im(f). Then there is aunique v ∈ V such that f(v) = w. But there are scalars α1, . . . , αn such that:

v =n∑

i=1

αivi

Then:

w = f(v) =n∑

i=1

αf(vi)

Thus {v1, . . . ,vn} spans Im(f). Suppose that:n∑

i=1

αif(vi) = 0,

then:

f(α1v1 + · · ·+ αnvn) = 0

The fact that f is injective implies that:

α1v1 + · · ·+ αnvn = 0

But, {v1, . . . ,vn} is a basis and thus α1 = · · · = αn = 0. Thus, {f(v1), . . . , f(vn)} is linearlyindependent and therefore a basis. �

43

Exercise 51. Conclude a fortiori that if f is injective and {v1, . . . ,vn} is linearly inde-pendent, then so is {f(v1), . . . , f(vn)}.

Theorem 3.40. The following relationship between subspace dimensions holds:

(3.8) dim(V) = dim(Ker(f)) + dim(Im(f))

Proof. Suppose that dim(V) = n. If Ker(f) = {0}, then dim(Ker(f)) = 0 and byLemma 3.39, dim(Im(f)) = n and the result follows at once.

Assume Ker(f) 6= {0} and suppose that dim(Ker(f)) = m < n. Let {v1, . . . ,vm}be a basis for Ker(f) in V . By Corollary 1.64, there are vectors {vm+1, . . . ,vn} so that{v1, . . . ,vn} form a basis for V . We will show that {f(vm+1), . . . , f(vn)} forms a basis forIm(f). Let w ∈ Im(f). Then there is some v ∈ V so that f(v) = w. We can write:

v =n∑

i=1

αivi

for some α1, . . . , αn ∈ F. Taking the linear transform of both sides yields:

w = f(v) =n∑

i=1

αif(vi).

However, f(vi) = 0 for i = 1, . . . ,m and thus we can write:

w =n∑

i=m+1

αif(vi).

Thus, {f(vm+1), . . . , f(vn)} spans Im(f). Now suppose that:n∑

i=m+1

αif(vi) = 0.

Then:

f(αm+1vm+1 + · · ·+ αnvn) = 0

But, vm+1, . . . ,vn are in a basis for V with v1, . . . ,vm, the basis for Ker(f). Thus:

(1) αm+1vm+1 + · · ·+ αnvn is not in Ker(f), which means that:(2) αm+1vm+1 + · · · + αnvn = 0 ∈ V . In this case, αm+1 = · · · = αn = 0, otherwise{v1, . . . ,vn} could not form a basis for V .

Thus {f(vm+1), . . . , f(vn)} is linearly independent and so it must be a basis for Im(f). There-fore we have shown that when dim(V) = n and dim(Ker(f)) = m < n, then dim(Im(f)) =m− n. Thus n = m+ (n−m) and Equation 3.8 is proved. �

5. Matrix of a Linear Map

Proposition 3.41. The set of m× n matrix Fm×n with ordinary matrix addition formsa vector space over the field F.


44

Lemma 3.42. Let V be a vector space with dimension n over field F and let B ={v1, . . . ,vn} be a basis for V. Let πB : V → Fn be the coordinate mapping function sothat if v ∈ V and:

v =n∑

i=1

αivi

for α1, . . . , αn ∈ F, then πB(v) = 〈α1, . . . , αn〉. Then πB is an isomorphism of V to Fn.

Proof. The fact that πB is injective is proved in Lemma 1.50. Surjectivity is clear fromconstruction. The fact that it is a linear transformation is clear from ordinary arithmetic inFn. �

Theorem 3.43. Let V be a vector space with dimension n over field F and let W bea vector space with dimension m over F. Let B = {v1, . . . ,vn} be a basis for V and letB′ = {w1, . . . ,wm} be a basis for W. Suppose that g : V → W is a linear transformation.Then the matrix Ag

BB′ ∈ Fm×n has jth column defined by:

(3.9) AgBB′·j = πB′(g(vj))

defines the linear transformation fAg

BB′from Fn to Fm with the property that if w = g(v),

then:

(3.10) AgBB′πB(v) = πB′(w)

Remark 3.44. Before proving this theorem, let us discuss what’s going on here. First,recall that if f(v) = Av for A ∈ Fm×n and v ∈ Fn, we denote this linear transformation asfA, which is why we use g in the statement of the theorem.

Secondly, the mappings can be visualized using a commutative diagram:

V πB- FnfAgBB′- Fm

W

π−1B′

�

g

-

Given a vector space V and basis B, V can be mapped to Fn. This space can be linearlytransformed to a subspace of Fm by matrix multiplication with Ag

BB′ . The resulting subspacemaps to a subspace of W , which (we claim) is identical to the subspace generated by theaction of the linear transform g on V , which produces vectors in W .

Proof of Theorem 3.43. For any basis vector vj ∈ B, we know that πB(vj) = ej =〈0, . . . , 1, 0 . . .〉; here the 1 appears in the jth position. Then:

(3.11) AgBB′πB(vj) = Ag

BB′ej = AgBB′·j = πB′(g(vj))

by definition, thus this transformation works for each basis vector in V . Let v be an arbitraryvector in V , then:

v =n∑

j=1

αjvj

45

for some α1, . . . , αn ∈ F. In particular, πB(v) = 〈α1, . . . , αn〉 = α. But then:

AgBB′πB(v) = Ag

BB′α =n∑

j=1

αjAgBB′·j =

n∑

j=1

αjπB′(g(vj)) =

πB′ (f(α1v1 + · · ·+ αnvn)) = πB′(g(v))

by Equation 3.11 and the properties of linear transformations. �

Example 3.45. We will construct the matrix D corresponding to the linear transforma-tion D already discussed on P [x2]. Let x2, x, 1 be B the basis we will use for P [x2]. We knowthat D : P [x2]→ P [x]. Thus, we B′ = {x, 1} be the basis to be used for P [x].

We will consider each basis element of P [x2] in turn:

x2D−→ 2x

πB′−−→[20

]

xD−→ 1

πB′−−→[01

]

1D−→ 0

πB′−−→[00

]

Arranging these into the columns of a matrix, we obtain:

D =

[2 0 00 1 0

]

Now, we can use matrix multiplication to differentiate quadratic polynomials:

ax2 + bx+ cπB−→

abc

fD−→

[2ab

]π−1B′−−→ 2a+ b

Remark 3.46. The following corollaries are stated without proof.

Corollary 3.47 (First Corollary of Theorem 3.43). Given two vector spaces V and Wof dimension n and m (respectively) over a common base field F and two bases B and B′ of

these spaces (respectively), the mapping ϕBB′ : L(V ,W)→ Fm×n defined by ϕBB′ : f 7→ AfBB′

is a vector space isomorphism.

Corollary 3.48 (Second Corollary of Theorem 3.43). Let V be a vector space overF with dimension n. Then AutF(V) is isomorphic to GL(n,F) the set of invertible n × nmatrices with elements from F.

Exercise 53. Prove the previous two corollaries.

Remark 3.49. It should be relatively that each linear transform maps to a unique ma-trix assuming a given set of bases for the respectively vector spaces and every matrix canrepresent some linear transform. Thus, there is a one-to-one and onto mapping of matrices tolinear transforms and this mapping must respect matrix addition and scalar multiplication.Therefore, it is an isomorphism.

Definition 3.50 (Rank). The rank of a matrix A ∈ Fm×n is equal to the number oflinearly independent columns of A and is denoted rank(A).

46

Proposition 3.51. Let A ∈ Fm×n. Then ϕ : A 7→ fA ∈ L(Fn,×,Fm) where fA(x) =Ax is an isomorphism between Fm×n and L(Fn,Fm). Furthermore, rank(A) = dim(Im(A)).

Proof. The isomorphism between Fm×n and L(Fn,Fm) is a consequence of Corollary3.47. Let A be composed of columns a1, . . . , an. Suppose that A has k linearly independentcolumns and without loss of generality suppose they are the columns 1, . . . k. Then for everyother index K > k we can write:

aK =k∑

i=1

αiai

Let x = 〈x1, . . . , xn〉. Then:

(3.12) fA(x) =k∑

i=1

βiaixi

Thus, if y ∈ Im(fA) is a linear combination of the k linearly independent vectors a1, . . . , ak,which form a basis for Im(fA). This completes the proof. �

Definition 3.52 (Nullity). If A ∈ Fm×n, then the nullity of A is dim(Ker(fA)). Thatis, it is the dimension of the vector space of solutions to the linear system Ax = 0.

Remark 3.53. Consequently, Theorem 3.40 is sometimes called the rank-nullity theorembecause it relates the rank of a matrix to its nullity. Restricting to matrices is now sensi-ble in the context of Theorem 3.43 because every linear transformation is simply a matrixmultiplication in an appropriate coordinate space.

Remark 3.54. If A ∈ Rm×n, then space spanned by the columns of A is called thecolumn space. The space spanned by the rows is called the row space and the kernel of thecorresponding linear transform fA is called the null space.

6. Applications of Linear Transforms

Remark 3.55. Linear transforms on Rn are particularly useful in graphics and mechan-ical modeling. We’ll focus on a few transforms that have physical meaning in two and threedimensional Euclidean space.

Definition 3.56 (Scaling). Any vector in Rn can be scaled by a factor of α by multiplyingby the matrix A = αIn.

Example 3.57. Consider the vector 〈1, 1〉. If scaled by a factor of 2 we have:[2 00 2

] [11

]=

[22

]

This is illustrated in Figure 3.2.

Definition 3.58 (Shear). A shearing matrix is an elementary matrix that correspondsto the addition of a multiple of one row to another. In two dimensional space, it is most

47

easily modeled by the matrices:

A1 =

[1 s0 1

]

A2 =

[1 0s 1

]

Example 3.59. Shearing is visually similar to a slant operation. For example, take thematrix A1 with s = 2 and apply it to 〈1, 1〉 to obtain:

[1 20 1

] [11

]=

[31

]

See Figure 3.2.

Definition 3.60 (Rotation 2D). The two dimensional rotation matrix defined by:

Rθ =

[cos θ − sin θsin θ cos θ

]

Rotates any vector v ∈ R2 by θ radians in the counter-clockwise direction.

Example 3.61. Rotating the vector 〈1, 1〉 by counter-clockwise rotation by π/12 radianscan be accomplished with the multiplication:[

cos π12− sin π

12sin π

12cos π

12

] [11

]=

[cos π

12− sin π

12cos π

12+ sin π

12

]

0.5 1.0 1.5 2.0 2.5 3.0

0.5

1.0

1.5

2.0

Original Vector Shear

Scale

Rotate

Figure 3.2. Geometric transformations are shown in the figure above.

Remark 3.62. Rotation in 3D can be accomplished using three separate matrices, onefor roll, pitch and yaw. In particular, if we define the coordinates in the standard x, yand z directions and agree on the right-hand-rule definition of counter-clockwise, then the

48

following three matrices rotate vectors counter-clockwise by θ radians around the x, y andz axes respectively:

Rxθ =

1 0 00 cos θ − sin θ0 sin θ cos θ

Ry

θ =

cos θ 0 − sin θ0 1 0

sin θ 0 cos θ

Rzθ =

cos θ − sin θ 0sin θ cos θ 0

0 0 1

Combinations of these matrices will produce different effects. These matrices are useful inmodeling aerodynamic motion in computers.

Remark 3.63. Other transformations are possible, for example reflection can be accom-plished in a relatively straight-forward manner as well as projection onto a lower-dimensionalspace. All of these are used to manipulate polygons for the creation of high-end computergraphics including video game design and simulations used in high-fidelity flight simulators.

7. An Application of Linear Algebra to Control Theory

Differential equations are frequently used to model the continuous motion of an objectin space. For example, the spring equation describes the behavior of a mass on a spring:

(3.13) mx+ kx = 0

This is illustrated in Figure 3.3. We can translate this differential equation (Equation 3.13)

x

m

Figure 3.3. A mass moving on a spring is governed by Hooke’s law, translated intothe language of Newtonian physics as mx− kx = 0.

into a system of differential equations by setting v = x. Then v = x and we have:

x = v

v = − kmx

This can be written as a matrix differential equation:

(3.14)

[xv

]=

[0 1− km

0

] [xv

]

49

If we write x = 〈x, v〉, then we can write:

x = Qx,

where

Q =

[0 1− km

0

]

Suppose we apply the Euler’s Method to solve this differential equation numerically, thenwe would have:

x(t+ ε)− x(t)

ε= Qx,

which leads to the difference equation:

(3.15) x(t+ ε) = x(t) + εQx = Ix + εQx

We can factor this equation so that:

(3.16) x(t+ ε) = Ax(t)a

where A = I + εQ.

Example 3.64. Suppose m = k = 1 and ε = 0.001 and let x0 = 0, v0 = 1. The resultingdifference equations in long form are:

x(t+ 0.001) = x(t) + 0.001v(t)

v(t+ 0.001) = v(t)− 0.001x(t)

We can compare the resulting difference equation solution curve (which is comparativelyeasy to compute) with the solution to the differential equation. This is shown in Figure3.4. Notice with ε small, for simple systems (like the one given) there is little error inapproximating the continuous time system with a discrete time system. The error may growin time, so care must be taken.

7.1. Controllability. Suppose at each time step, we may exert a control on the massso that its motion is governed by the equation:

x = Qx + Bu

Here B ∈ R2,p where without loss of generality we may assume p ≤ 2 and u = 〈u1, . . . , up〉 isa vector of time-varying control signals. This problem is difficult to analyze in the continuoustime case, but simpler in the discrete time case.

(3.17) x(t+ ε) = Ax(t) + Bu(t)

Suppose the goal is to drive x(t) to some value x∗. For example, suppose we wish to stopmass at x = 0, so then v = 0 as well.

Definition 3.65 (Complete State Controllable). The state x is complete state control-lable in Equation 3.17 if given any initial condition x(0) = x0 and any desired final conditionx∗ there is a control function (policy) u(t) defined on a finite time interval [0, tf ] so thatx(tf ) = x∗.

Remark 3.66. Put simply, control means we can drive the system to any state we wantand do it in finite time.

50

��

0 1 2 3 4 5 6

-1.0

-0.5

0.0

0.5

1.0

t

x Exact Solution

Difference Equation Solution

(a) Position

��

0 1 2 3 4 5 6

-1.0

-0.5

0.0

0.5

1.0

t

v Exact Solution

Difference Equation Solution

(b) Velocity

Figure 3.4. A mass moving on a spring given a push on a frictionless surface willoscillate indefinitely, following a sinusoid.

Theorem 3.67. Suppose that x ∈ Rn, u ∈ Rp, A ∈ Rn×n and B ∈ Rn×p. Define theaugmented matrix:

(3.18) C =[B AB · · · An−1B

]

and for a vector v ∈ Rnp let:

fC(v) = Cv

be the linear transformation from Rnp → Rn defined by matrix multiplication (as usual).Then the state dynamics in Equation 3.17 are completely controllable if (and only if):

(3.19) dim (Im(fC)) = n

Or equivalently, rank(C) = n.

Proof. Suppose that x(0) = x0. Then:

(3.20) x(ε) = Ax0 + Bu(0)

By induction,

(3.21) x(kε) = Akx0 +k−1∑

i=0

AiBu((k − 1− i)ε)

In particular, when k = n, then:

(3.22) x(nε) = Anx0 + An−1Bu(0) + An−2Bu(ε) + An−3u(2ε) + · · ·+ Bu((n− 1)ε)

51

Rearranging, we obtain:

(3.23) x(nε)−Anx0 =[B AB · · · An−1B

]

u((n− 1)ε)u((n− 2)ε)

...u(ε)u(0)

Here:

u =

u((n− 1)ε)u((n− 2)ε)

...u(ε)u(0)

is a np× 1 vector of unknowns Rnp. Setting x(nε) = x∗ ∈ Rn. There is at least one solutionto this system of equations if and only if dim (Im(fC)) = n or alternatively rank(C) = nsince:

(3.24) C ∼[In N

]

by elementary row operations. Here ∼ denotes row equivalence. This completes the proof.�

Example 3.68. Suppose we can adjust the velocity of the moving block on the springso that:

x(t+ ε) = x(t) + εv(t)

v(t+ ε) = v(t)− ε kmx(t) + u(t)

Written in matrix form this is:

(3.25)

[x(t+ ε)v(t+ ε)

]=

[1 ε−ε k

m1

] [x(t)v(t)

]+

[01

]u(t)

Notice, B ∈ R1×2 and n = 2, thus:

AB =

[1 ε−ε k

m1

] [01

]=

[ε1

]

Consequently:

(3.26) C =

[0 ε1 1

]

This is non-singular just in case ε > 0, which is is necessarily. We can compute:

A2x0 =

[1 ε−ε k

m1

]2 [01

]=

[2ε

1− ε2 km

]

We know that x∗ = 〈0, 0〉. Therefore, we can compute:

(3.27) x∗ −A2x0 =

[00

]−[

2ε1− ε2 k

m

]=

[−2ε

ε2 km− 1

]=

[0 ε1 1

] [u(1)u(0)

]= Cu

52

Computing C−1 and multiplying yields the solution:

(3.28) u =

[1 + ε k

m−2

]

Meaning:

u(0) = −2

u(1) = 1 + εk

mUsing these values, we can compute:

x(1) = ε v(1) = −1

x(2) = 0 v(2) = 0

as required.

7.2. Observability. Suppose the system in question is now only observed through animperfect measurement system so that we see:

(3.29) y = Cx

The observed output is y ∈ Rp, while the actual output is x ∈ Rn. We do not know x(0).Can we still control the system?

Definition 3.69 (Observability). A system is observable if the state at time t can becomputed even in the presence of imperfect observability; i.e., C 6= n.

Assume u = 0. Then in discrete time:

y(0) = Cx(0)(3.30)

y(ε) = CAx(0)(3.31)

...(3.32)

y((n− 1)ε) = Cx((n− 1)ε) = CA(n−1)x(0)(3.33)

This can be written as the matrix equation:

(3.34)

y(0)y(ε)

...y((n− 1)ε)

=

CCA

...CA(n−1)

x(0)

Let:

(3.35) O =

CCA

...CA(n−1)

∈ Rnp×n

This is the observability matrix containing np rows and n columns. Notice that Equation3.34 has a solution if and only if:

(3.36) n = dim(fO) = rank(O)

53

To see this note, this simply says we can express the left hand side using all n columns ofthe right-hand-side. Thus we have the theorem:

Theorem 3.70. Suppose that x ∈ Rn, A ∈ Rn×n and C ∈ Rn×p. Define the augmentedmatrix:

(3.37) O =

CCA

...CA(n−1)

and for a vector v ∈ Rn let:

fO(v) = Ov

be the linear transformation from Rn → Rnp defined by matrix multiplication (as usual).Then the state dynamics in Equation 3.17 are completely observable if (and only if):

(3.38) dim (Im(fO)) = n

Or equivalently, rank(O) = n.

Remark 3.71. Controllability and observability can be used together to derive a controlwhen x0 is unknown.

(1) Take n− 1 observations of the uncontrolled system.(2) Infer x0

(3) Compute x((n− 1)ε).(4) Translate time so that x0 = x((n− 1)ε).(5) Compute the control function as before.

Example 3.72. Suppose that:

(3.39) C =

[1 22 1

]

and consider our example. From before we have:[

1 ε−ε k

m1

]

Thus we compute:

CA =

[1− 2kε

mε+ 2

2− kεm

2ε+ 1

]

Consequently:

(3.40) O =

1 22 1

1− 2kεm

ε+ 22− kε

m2ε+ 1

54

Applying Gauss-Jordan elimination, we see that:

O ∼

1 00 10 00 0

Thus it has rank 2. If we started with x0 = 0 and v0 = 1, then we would have: x(ε) = ε andv(ε) = 1. Our observed values would be y(0) = 〈2, 1〉 and y(ε) = 〈ε+ 2, 2ε+ 1〉. This leadsto the equation:

(3.41)

21

ε+ 22ε+ 1

=

1 22 1

1− 2kεm

ε+ 22− kε

m2ε+ 1

[x0v0

]

In reality, it is sufficient to solve the first two equations:

2 = x0 + 2v0

1 = 2x0 + v0

to obtain x0 = 0 and v0 = 1. Thus, control starts at time t = ε at the point where x = εand v = 1.

Exercise 54. Compute the control from the point x0 = ε and v0 = 1.

Remark 3.73. It turns out that continuous time systems work in precisely the sameway; the proofs are just much harder and computing a control function is non-trivial; i.e., itis more complex than solving a system of linear equations.

55

CHAPTER 4

Determinants, Eigenvalues and Eigenvectors


(1) Discuss Permutations(2) Introduce Determinants(3) Introduce Eigenvalues and Eigenvectors(4) Discuss Diagonalization(5) Discuss the Jordan Normal Form

2. Permutations1

Definition 4.1 (Permutation / Permutation Group). A permutation on a set V ={1, . . . , n} of n elements is a bijective mapping f from V to itself. A permutation group ona set V is a set of permutations with the binary operation of functional composition.

Example 4.2. Consider the set V = {1, 2, 3, 4}. A permutation on this set that maps1 to 2 and 2 to 3 and 3 to 1 can be written as: (1, 2, 3)(4) indicating the cyclic behaviorthat 1 → 2 → 3 → 1 and 4 is fixed. In general, we write (1, 2, 3) instead of (1, 2, 3)(4) andsuppress any elements that do not move under the permutation.

For the permutation taking 1 to 3 and 3 to 1 and 2 to 4 and 4 to 2 we write (1, 3)(2, 4)and say that this is the product of (1, 3) and (2, 4). When determining the impact of apermutation on a number, we read the permutation from right to left. Thus, if we want todetermine the impact on 2, we read from right to left and see that 2 goes to 4. By contrast,if we had the permutation: (1, 3)(1, 2) then this permutation would take 2 to 1 first and then1 to 3 thus 2 would be mapped to 3. The number 1 would be first mapped to 2 and thenstop. The number 3 would be mapped to 1. Thus we can see that (1, 3)(1, 2) has the sameaction as the permutation (1, 2, 3).

Definition 4.3 (Symmetric Group). Consider a set V with n elements in it. Thepermutation group Sn contains every possible permutation of the set with n elements.

Example 4.4. Consider the set V = {1, 2, 3}. The symmetric group on V is the set S3

and it contains the permutations:

(1) The identity: (1)(2)(3)(2) (1,2)(3)(3) (1,3)(2)(4) (2,3)(1)(5) (1,2,3)(6) (1,3,2)

1This section is used purely to understand the general definition of the determinant of a matrix.

57

Proposition 4.5. For each n, |Sn| = n!.

Exercise 55. Prove Proposition 4.5

Definition 4.6 (Transposition). A permutation of the form (a1, a2) is called a transpo-sition.

Theorem 4.7. Every permutation can be expressed as the product of transpositions.

Proof. Consider the permutation (a1, a2, . . . , an). We may write:

(4.1) (a1, a2, . . . , an) = (a1, an)(a1, an−1) · · · (a1, a2)Observe the effect of these two permutations on ai. For i 6= 1 and i 6= n, then readingfrom right to left (as the permutation is applied) we see that ai maps to a1, which readingfurther right to left is mapped to ai+1 as we expect. If i = 1, then a1 maps to a2 and thereis no further mapping. Finally, if i = n, then we read left to right to the only transpositioncontaining an and see that an maps to a1. Thus Equation 4.1 holds. This completes theproof. �

Remark 4.8. The following theorem is useful for our work on matrices in the secondpart of this chapter, but its proof is outside the scope of these notes. The interested readercan see Chapter 2.2 of [Fra99].

Theorem 4.9. No permutation can be expressed as both a product of an even and an oddnumber of transpositions. �

Definition 4.10 (Even/Odd Permutation). Let σ ∈ Sn be a permutation. If σ can beexpressed as an even number of transpositions, then it is even, otherwise σ is odd. Thesignature of the permutation is:

(4.2) sgn(σ) =

{−1 σ is odd

1 σ is even

3. Determinant

Definition 4.11 (Determinant). Let A ∈ Fn×n. The determinant of A is:

(4.3) det(A) =∑

σ∈Sn

sgn(σ)n∏

i=1

Aiσ(i)

Here σ ∈ Sn represents a permutation over the set {1, . . . , n} and σ(i) represents the valueto which i is mapped under σ.

Example 4.12. Consider an arbitrary 2× 2 matrix:

A =

[a bc d

]

There are only two permutations in the set S2: the identity permutation (which is even) andthe transposition (1, 2) which is odd. Thus, we have:

det(A) =

∣∣∣∣a bc d

∣∣∣∣ = A11A22 −A12A21 = ad− bc

This is the formula that one would expect from a course in matrices.

58

Definition 4.13 (Triangular Matrix). A matrix A ∈ Fn×n is upper-triangular if Aij = 0when i > j and lower-triangular if Aij = 0 when i < j.

Lemma 4.14. The determinant of a triangular matrix is the product of the diagonalelements.

Proof. Only the identity permutation can lead to a non-zero product in Equation 4.3.For suppose that σ(i) 6= i. Then for at least one i σ(i) > i and at least one i for whichσ(i) < i and thus,

n∏

i=1

Aiσ(i) = 0

�

Example 4.15. It is easy to verify using Example 4.12 that the determinant of:

A =

[1 20 3

]

is 1 · 3 = 3.

Proposition 4.16. The determinant of any identity matrix is 1. �

Exercise 56. Prove the Proposition 4.16.

Remark 4.17. Like many other definitions in mathematics, Definition 4.11 can be usefulfor proving things, but not very useful for computing determinants. Fortunately there is arecursive formula for computing the determinant, which we provide.

Definition 4.18 ((i, j) Sub-Matrix). Consider the square matrix:

A =

a11 · · · a1j · · · a1na21 · · · a2j · · · a2n...

. . ....

. . ....

an1 · · · anj · · · ann

The (i, j) sub-matrix obtained from Row 1, Column j is derived by crossing out Row 1 andColumn j as illustrated,

a11 · · · a1j · · · a1na21 · · · a2j · · · a2n...

. . ....

. . ....

an1 · · · anj · · · ann

and forming a new matrix (n− 1)× (n− 1) with the remaining elements:

A(1j) =

a21 · · · a2,j−1 a2,j+1 · · · a2n...

. . ....

.... . .

...an1 · · · an,j−1 an,j+1 · · · ann

The sub-matrix A(i,j) is defined analogously.

59

Definition 4.19 (Minor). Let A ∈ Fn×n. Then the (i, j) minor is:

(4.4) A(i,j) = det(A(i,j))

Definition 4.20 (Cofactor). Let A ∈ Fn×n as in Definition 4.18. Then the (i, j) co-factor is:

(4.5) C(i,j) = (−1)i+jA(i,j) = (−1)i+jdet(A(i,j))

Example 4.21. Consider the following matrix:

A =

1 2 34 5 67 8 9

We can compute the (1, 2) minor as:∣∣∣∣∣∣

1 2 34 5 67 8 9

∣∣∣∣∣∣

∣∣∣∣4 67 9

∣∣∣∣ = 4 · 9− 6 · 7 = −6

In this case i = 1 and j = 2 so the co-factor is:

C(1,2) = −1 ·∣∣∣∣4 67 9

∣∣∣∣ = (−1)(−6) = 6

Theorem 4.22 (Laplace Expansion Formula). Let A ∈ Fn×n as in Definition 4.19.Then:

(4.6) det(A) =n∑

j=1

a1jC(1j)

Example 4.23. Before proving this result, we consider an example. We can use Laplace’sFormula to compute:

∣∣∣∣∣∣

1 2 34 5 67 8 9

∣∣∣∣∣∣= 1 · (−1)1+1 ·

∣∣∣∣5 68 9

∣∣∣∣+ 2 · (−1)1+2 ·∣∣∣∣4 67 9

∣∣∣∣+ 3 · (−1)1+3 ·∣∣∣∣4 57 8

∣∣∣∣ =

1 · (−3) + 2 · (−1) · (−6) + 3 · (−3) = −3 + 12 − 9 = 0

Remark 4.24. Laplace’s Formula can be applied to other rows/columns to yield thesame result. It is traditionally stated along the first row.

Proof of Laplace’s Formula. Let σ ∈ Sn map i to j. From Definition 4.11, the σterm of the determinant sum is:

sgn(σ)Aij

∏

i 6=j

Aiσ(i) = sgn(σ)aij∏

i 6=j

Aiσ(i)

where aij is the (i, j) element of A, usually denoted Aij. The elements of the product:∏

i 6=j

Aiσ(i)

60

are in the sub-matrix of A(i,j) (removing Row i and Column j from A). Let Sn(i, j) denotethe sub-group of Sn consisting of permutations that always map i to j. This group can beput in bijection with Sn−1. Therefore the minor A(i,j) is:

(4.7) A(i,j) =∑

σ∈Sn(i,j)

sgn(σ)∏

i 6=j

Aiσ(i)

The bijection between Sn(i, j) and Sn−1 can be written explicitly in the following way. Sup-pose that τ ∈ Sn−1 and σ ∈ Sn(i, j). Define: τ ′ ∈ Sn so that τ ′(i) = τ(i) for i = 1, . . . , n− 1(recall τ is defined on {1, . . . , n− 1}) and τ ′(n) = n. Thus, τ ′ ∈ Sn. Then:

(4.8) σ = (j, j + 1, . . . , n)τ ′(n, n− 1, . . . , i)

in permutation notation. To see this, note that from right to left, i maps to n, which mapsto n under τ ′, which maps to j, as required. The remaining n−1 elements of Sn are mappedto distinct elements depending entirely on τ . Note that the permutations (j, j + 1, . . . , n)and (n, n− 1, . . . , i) can both be written n− j and n− i transpositions respectively. Thus,

sgn(σ) = (−1)n−i+n−jsgn(τ ′) = (−1)2n−i−jsgn(τ) = (−1)−i−jsgn(τ) = (−1)i+jsgn(τ)

But then:

det(A) =n∑

j=1

∑

σ∈Sn(1,j)

sgn(σ)a1j∏

i 6=j

A1σ(i) =

n∑

j=1

a1j∑

σ∈Sn(1,j)

sgn(σ)∏

i 6=j

A1σ(i) =

n∑

j=1

a1j∑

τ∈Sn−1

(−1)1+jsgn(τ)n−1∏

i=1

A(1,j)i,τ(i) =

n∑

j=1

a1j(−1)1+j∑

τ∈Sn−1

sgn(τ)n−1∏

i=1

A(1,j)i,τ(i) =

n∑

j=1

a1j(−1)1+jdet(A(1,j)) =

n∑

j=1

a1jC(1,j)


Exercise 57. Use Laplace’s Formula to show that the determinant of the followingmatrix:

A =

a b cd e fg h i

61

is

det(A) = aei+ bfg + cdh− afh− bdi− ceg

Exercise 58 (Project). A formula like Laplace’s formula can be used to invert a matrix.When used to solve a system of equations, this is called Cramer’s rule. Let A ∈ Fn×n be amatrix and let:

C =

C(1,1) C(1,2) · · · C(1,n)

......

. . ....

C(n,1) C(n,2) · · · C(n,n)

Then:

A−1 =1

det(A)CT

Prove that this formula really does invert the matrix.

Example 4.25. We can use the rule given in Exercise 58 to invert the matrix:

A =

10 12 1412 8 614 6 4

First we’ll use the rule in Exercise 57 to compute the determinant of the matrix as a whole:

det(A) = 10 · 8 · 4 + 12 · 6 · 14 + 14 · 12 · 6− 14 · 8 · 14− 12 · 12 · 4− 10 · 6 · 6 =

10(8 · 4− 6 · 6) + 14(12 · 6 + 12 · 6− 8 · 14)− 12 · 12 · 4 =

20(4 · 4− 6 · 3) + 28(6 · 6 + 6 · 6− 8 · 7)− 144 · 4 =

20 · (−2) + 28(16)− 144 · (4) =

4 (5 · (−2) + 7 · 16− 144) =

4 (−10− 144 + 112) = 4 (−42) = −168

Now compute the matrix co-factors (from Laplace’s Rule):Cross Out First Row

C(1,1) = −11+1 ·∣∣∣∣8 66 4

∣∣∣∣ = 32− 36 = −4

C(1,2) = −11+2 ·∣∣∣∣12 614 4

∣∣∣∣ = 48− 84 = 36

C(1,3) = −11+3 ·∣∣∣∣12 814 6

∣∣∣∣ = 72− 112 = −40

62

Cross Out Second Row

C(2,1) = −12+1 ·∣∣∣∣12 146 4

∣∣∣∣ = 36

C(2,2) = −12+2 ·∣∣∣∣10 1414 4

∣∣∣∣ = −156

C(2,3) = −12+3 ·∣∣∣∣10 1214 6

∣∣∣∣ = 108

Cross Out Third Row

C(3,1) = −13+1 ·∣∣∣∣12 148 6

∣∣∣∣ = −40

C(3,2) = −13+2 ·∣∣∣∣10 1412 6

∣∣∣∣ = 108

C(3,3) = −13+3 ·∣∣∣∣10 1212 8

∣∣∣∣ = −64

The matrix of cofactors is:

C =

C(1,1) C(1,2) C(1,3)

C(2,1) C(2,2) C(2,3)

C(3,1) C(3,2) C(3,3)

Cramer’s rule says:

A−1 =1

det(A)CT =

−1

168

−4 36 −4036 −156 108−40 108 −64

Notice the symmetry in the solution, which you can exploit in your own solution, rather thanworking out every term.

Exercise 59. Prove that if A ∈ Fn×n is symmetric and invertible, then A−1 is symmet-ric.

4. Properties of the Determinant

Remark 4.26. Laplace Expansion can be done on any column or row, with minor com-puted appropriately. The choice of row 1 is for convenience only.

Proposition 4.27. If A ∈ Fn×, then det(A) = det(AT ). �

Exercise 60. Prove Corollary 4.27.

Proposition 4.28. Let A ∈ Fn×n. If A has two adjacent identical columns, thendet(A) = 0.

Proof. We proceed by induction on n. In the case when n = 2, the proof is by calcula-tion. Suppose the statement is true up to n− 1 and suppose that columns k and k + 1 areequal in A. Then in the Laplace expansion, the sub-matrix A(1,j) has two identical adjacent

63

columns just in case j 6= k and j 6= k + 1. Thus C(1,j) = 0. Thus the Laplace expansionreduces to the k and k + 1 terms:

det(A) = a1,k(−1)1+kdet(A(1,k)) + a1,k+1(−1)1+k+1det(A(1,k+1))

but A(1,k) = A(1,k+1) and a1,k = a1,k+1 because column k is identical to column k + 1 but−(−1)1+k = (−1)1+k+1. Thus, det(A) = 0 and the result follows by induction. �

Proposition 4.29. Suppose that A ∈ Fn×n with:

A =[a1 · · · aj + a′j · · · an

]

where each ai ∈ Fn×1 is a column vector. Let:

A1 =[a1 · · · aj · · · an

]

A2 =[a1 · · · a′j · · · an

]

Then det(A) = det(A1) + det(A2).

Proof. Rewrite the Laplace Expansion around the column j. Then we have:

(4.9) det(A) =n∑

i=1

AijC(ij) =n∑

i=1

(aij + a′ij)C(ij) =

n∑

i=1

aijC(ij) +n∑

i=1

a′ijC(ij) = det(A1) + det(A2).


Remark 4.30. The same logic holds for rows.

Proposition 4.31. If A ∈ Fn×n and matrix A′ ∈ Fn×n is constructed from A by multi-plying a column (or row) of A by α ∈ F, then:

(4.10) det(A′) = αdet(A)

Exercise 61. Prove the Proposition 4.31. [Hint: Use the Laplace expansion with thealtered column or row and factor.]

Remark 4.32 (Multi-Linear Function). The two previous results taken together showthat the determinant is what’s known as a multi-linear function when considered as:

det : Fn × Fn × · · ·Fn︸︷︷︸n

→ F

mapping the n vectors in Fn (the columns of the matrix) to F. That is:

det(a1, . . . , αai + a′i, ai+1, . . . , an) = αdet(a1, . . . , an) + det(a1, . . . , a′i, ai+1, . . . , an)

Proposition 4.33. If A ∈ Fn×n and matrix A′ ∈ Fn×n is constructed from A by ex-changing any pair of columns (row), then:

(4.11) det(A′) = −1det(A)

Proof. Exchanging two columns is equivalent to introducing a transpose into the per-mutation σ in Equation 4.3, which swaps the sign of all sgn(σ) computations. �

64

Proposition 4.34. If A ∈ Fn×n and matrix A′ ∈ Fn×n is constructed from A by addinga multiple of column (row) i to any other column (row) j 6= i, then det(A′) = det(A).

Proof. Without loss of generality, assume column i is added to column j with nomultiple. Then A′ has form:

A′ =[a1 · · · aj + ai · · · an

]

Applying Propositions 4.29 and 4.28 we see that det(A′) = det(A) + 0, since the one of thetwo matrices in the multi-linear expansion of the determinant will have a repeated column.In the general case suppose we replace column j with column j plus α ∈ F times column i.Then first replace ai by αai; the determinant is multiplied by α. Applying the same logic asabove. Then replace αai by ai again and the determinant is divided by α. This completesthe proof. �

Remark 4.35. Thus we have shown that elementary column/row operations can onlymodify the determinant of a matrix by either multiplying it by a constant or changing its sign.In particular, elementary row operations cannot make a matrix have non-zero determinant.

Exercise 62. Use Laplace’s Formula to show that if any column or row of a matrix isall zeros, then the determinant must be zero.

Exercise 63. Show that an n × n matrix has non-zero determinant if and only if it iscolumn or row equivalent to the identity matrix. Thus prove the following corollary:

Corollary 4.36. A square matrix A is invertible if and only if det(A) 6= 0.

Remark 4.37. We state but do not prove the final theorem, which is useful in generaland can also be used to prove the preceding corollary. The proof is in [Lan87], for theinterested reader. It is a direct result of the multilinearity of the determinant function.

Theorem 4.38 (Matrix Product). Suppose A,B ∈ Fn×n. Then

det(AB) = det(A)det(B).

�

Exercise 64. Prove Corollary 4.36 using this theorem.

5. Eigenvalues and Eigenvectors

Definition 4.39 (Algebraic Closure). Let F be a field. The algebraic closure of F,denoted F is an extension of F that is (i) also a field and (ii) has every possible root to anypolynomial with coefficients drawn from F.

Remark 4.40. A field F is called algebraically closed if F = F.

Remark 4.41. The following theorem is outside the scope of the course, but is usefulfor our work with eigenvalues.

Theorem 4.42. The algebraic closure of R is C. The field C is algebraically closed.

65

Definition 4.43 (Eigenvalue and (Right) Eigenvector). Let A ∈ Fn×n. An eigenvalue,eigenvector pair (λ,x) is a scalar and n× 1 vector such that:

(4.12) Ax = λx

and x 6= 0. The eigenvalue may be drawn from F and x from Fn.

Lemma 4.44. A value λ ∈ F is an eigenvalue of A ∈ Fn×n if and only if λIn −A is notinvertible.

Proof. Suppose that λ is an eigenvalue of A. Then there is some x ∈ Fn such thatEquation 4.12 holds. Then 0 = (λIn −A)x and x is in the kernel of the linear transform:

fλIn−A(x) = (λIn −A)x

The fact that x 6= 0 implies that fλIn−A is not one-to-one and thus λIn−A is not invertible.Conversely suppose that λIn−A is not invertible. Then there is an x 6= 0 in Ker(fλIn−A).

Thus:

(λIn −A)x = 0

which implies Equation 4.12. Thus λ is an eigenvalue of A. �

Remark 4.45. A left eigenvector is defined analogously with xTA = λxT , when x isconsidered a column vector. We will deal exclusively with right eigenvectors and hence whenwe say “eigenvector” we mean a right eigenvector.

Definition 4.46 (Characteristic Polynomial). If A ∈ Fn×n then its characteristic poly-nomial is the degree n polynomial:

(4.13) det (λIn −A)

Theorem 4.47. A value λ is an eigenvalue for A ∈ Fn×n if and only if it satisfies thecharacteristic equation:

det (λIn −A) = 0

That is, λ is a root of the characteristic polynomial.

Proof. Assume λ is an eigenvalue of A. Then the matrix λIn − A is singular andconsequently the characteristic polynomial is zero by Corollary 4.36.

Conversely, assume that λ is a root of the characteristic polynomial. Then λIn − A issingular by Corollary 4.36 and thus by Lemma 4.44 it is an eigenvalue. �

Remark 4.48. We now see why λ may be in F, rather than F. It is possible the char-acteristic polynomial of a matrix does not have all (or any) of its roots in the field F; thedefinition of algebraic closure ensures that all eigenvalues are contained in the the algebraicclosure of F.

Corollary 4.49. If A ∈ Fn×n, then A and AT share eigenvalues.


66

Example 4.50. Consider the matrix:

A =

[1 00 2

]

The characteristic polynomial is computed as:

det (λIn −A) =

∣∣∣∣λ− 1 0

0 λ− 2

∣∣∣∣ = (λ− 1)(λ− 2)− 0 = 0

Thus the characteristic polynomial for this matrix is:

(4.14) λ2 − 3λ+ 2

The roots of this polynomial are λ1 = 1 and λ2 = 2. Using these eigenvalues, we can computeeigenvectors:

x1 =

[10

](4.15)

x2 =

[01

](4.16)

and observe that:

(4.17) Ax1 =

[1 00 2

] [10

]= 1

[10

]= λ1x1

and

(4.18) Ax2 =

[1 00 2

] [01

]= 2

[01

]λ2x2

as required. Computation of eigenvalues and eigenvectors is usually accomplished by com-puter and several algorithms have been developed. Those interested readers should consult(e.g.) [Dat95].

Remark 4.51. You can use your calculator to return the eigenvalues and eigenvectorsof a matrix, as well as several software packages, like Matlab and Mathematica.

Remark 4.52. It is important to remember that eigenvectors are unique up to scale.That is, if A is a square matrix and (λ,x) is an eigenvalue eigenvector pair for A, then so is(λ, αx) for α 6= 0. This is because:

(4.19) Ax = λx =⇒ A(αx) = λ(αx)

Definition 4.53 (Algebraic Multiplicity of an Eigenvalue). An eigenvalue has algebraicmultiplicity greater than 1 if it is a multiple root of the characteristic polynomial. Themultiplicity of the root is the multiplicity of the eigenvalue.

Example 4.54. Consider the identify matrix I2. It has characteristic polynomial (λ −1)2, which has one multiple root 1 of multiplicity 2. However, this matrix does have twoeigenvectors [1 0]T and [0 1]T .

Exercise 66. Show that every vector in F2 is an eigenvector of I2.

67

Example 4.55. Consider the matrix

A =

[1 52 4

]

The characteristic polynomial is computed as:∣∣∣∣λ− 1 −5−2 λ− 4

∣∣∣∣ = (λ− 1)(λ− 4)− 10 = λ2 − 5λ− 6

Thus there are two distinct eigenvalues: λ = −1 and λ = 6, the two roots of the characteristicpolynomial. We can compute the two eigenvectors in turn. Consider λ = −1. We solve:[

λ− 1 −5−2 λ− 4

] [x1x2

]=

[−2 −5−2 −5

] [x1x2

]=

[00

]

Thus:

−2x1 − 5x2 = 0

We can set x2 = t, a free variable. Consequently the solution is:[x1x2

]=

[52tt

]= t

[521

]

Thus, any eigenvector of λ = −1 is a multiple of the vector 〈5/2, 1〉. For the eigenvalueλ = −6, we have:[

λ− 1 −5−2 λ− 4

] [x1x2

]=

[5 −5−2 2

] [x1x2

]=

[00

]

From this we see that:

−2x1 + 2x2 = 0

or x1 = x2. Thus setting x2 = t, we have the solution:[x1x2

]=

[tt

]= t

[11

]

Thus, any eigenvector of λ = 6 is a multiple of the vector 〈1, 1〉.

Theorem 4.56. Suppose that A ∈ Fn×n with eigenvalues λ1, . . . , λn all distinct (i.e.,λi 6= λj if i 6= j). Then the corresponding eigenvectors {v1, . . . ,vn} are linearly independent.

Proof. We proceed by induction on m to show that {v1, . . . , vm} is linearly independentfor m = 1, . . . , n. In the base case, the set {v1} is clearly linearly independent. Now supposethis is true up to m < n, we’ll show it is true for m+ 1.

Suppose:

(4.20) α1v1 + · · ·+ αm+1vm+1 = 0

Multiplying by λm+1 on the left and right implies that:

(4.21) α1λm+1v1 + · · ·+ αm+1λm+1vm+1 = 0

On the other hand, multiplying by A on the left and the right (and applying the fact thatAvi = λivi, yields:

(4.22) α1λ1v1 + · · ·+ αm+1λm+1vn = 0

68

Subtracting Expression 4.21 from Expression 4.22 yields:m∑

i=1

αi(λi − λm+1)vi = 0

Since λi 6= λm+1 and by induction we know that {v1, . . . ,vm} are linearly independent itfollows that α1, . . . , αm = 0. But then Expression 4.20 reduces to:

αm+1vm+1 = 0

Thus, αm+1 = 0 and the result follows by induction. �

Definition 4.57. Let A ∈ Fn×n with eigenvectors {v1, . . . ,vn}. Then the vectors spaceE = span({v1, . . . ,vn}) is called the eigenspace of A. When the eigenvectors are linearlyindependent, they are an eigenbasis for the space they span.

Remark 4.58. It is worth noting that if vi is an eigenvector of A, then span(vi) is calledthe eigenspace associated with vi.

Corollary 4.59. The eigenvectors of A ∈ Fn×n form a basis for Fn when the eigenvaluesof A are distinct.


6. Diagonalization and Jordan’s Decomposition Theorem

Remark 4.60. In this section, we state but not prove a number of results on eigenvaluesand eigenvectors, with a specialization to matrices with real entries. Many of these resultsextend with minor modifications to complex matrices. Readers who are interested in theproofs can (and should) see e.g., [Lan87].

Definition 4.61 (Diagonalization). Let A be an n×n matrix with entries from field R.The matrix A can be diagonalized if there exists an n × n diagonal matrix D and anothern× n matrix P so that:

(4.23) P−1AP = D

In this case, P−1AP is the diagonalization of A.

Remark 4.62. Clearly if A is diagonalizable, then:

(4.24) A = PDP−1

Theorem 4.63. A matrix A ∈ Rn×n is diagonalizable, if and only if the matrix has nlinearly independent eigenvectors.

Proof. Suppose that A has a set of linearly independent eigenvectors p1, . . . ,pn. Thenfor each pi, (i = 1, . . . , n) there is an eigenvalue λi so that Api = λipi. Let P ∈ Fn×n havecolumns p1, . . . ,pn. Then we can see that:

AP =[λ1p1| · · · |λnpn

]=[p1| · · · |pn

]

λ1 0 · · · 00 λ2 · · · 0...

.... . . · · ·

0 0 · · · λn

= PD,

69

where:

(4.25) D =

λ1 0 · · · 00 λ2 · · · 0...

.... . . · · ·

0 0 · · · λn

Since p1, . . . ,pn are linearly independent, it follows P is invertible and thus A = PDP−1.Conversely, suppose that A is invertible and let D be as in Equation 4.25. Then:

AP = DP

and reversing the reasoning above, each column of P must be an eigenvector of A withcorresponding eigenvalue on the diagonal of D. �

Example 4.64. Consider the following matrix:

(4.26) A =

[0 −11 0

]

To diagonalize A, we compute its eigenvalues and eigenvectors yielding:

λ1 = i

λ2 = −i

for the eigenvalues and:

v1 =

[i1

]v2 =

[−i1

]

where i =√−1 is the imaginary number. Notice this shows that the eigenvalues and

eigenvectors have entries drawn from R = C.We can now compute P and D as:

D =

[−i 00 i

]P =

[−i i1 1

]

It is helpful to note that:

P−1 =

[i2

12−i

212

]

Arithmetic manipulation shows us that:

PD =

[−1 −1−i i

]

Thus:

PDP−1 =

[−1 −1−i i

] [i2

12−i

212

]=

[0 −11 0

]= A

as required. (Remember that i2 = −1.)

70

Remark 4.65. Matrix diagonalization is a basis transform in disguise. Suppose A ∈Rn×n is diagonalizable. Let v1, . . . ,vn be the eigenvectors A and consider the linear trans-form fA(x) = Ax. If x is expressed in the eigenbasis E = {v1, . . . ,vn}, then:

(4.27) fA(x) = Ax = A

(n∑

i=1

αivi

)=

n∑

i=1

αiAvi =n∑

i=1

αiλivi

Thus, computing the linear transformation fA is especially easy in the basis E . Each coor-dinate 〈α1, . . . , αn〉 is simply multiplied by the corresponding eigenvalue. That is:

(4.28) fA(〈α1, . . . , αn〉) = 〈λ1α1, . . . , λnαn〉Now, consider the diagonalization. A = PDP−1. In E , [vi]E = ei; that is, each eigenvectorhas as its coordinates the corresponding standard basis vector. By construction:

(4.29) Pei = vi

for all i = 1, . . . , n. This means that P is a basis transform matrix from E to the standardbasis B. That is: P = AEB. As a consequence:

ABE = A−1EB = P−1

Thus if [x]B ∈ Rn is expressed in the standard basis, we have:

fA(x) = A[x]B = PDP−1[x]B = PD[x]E = Pn∑

i=1

λi[x]Ei =

[n∑

i=1

λi[x]Ei

]

B

Notice the sum∑n

i=1 λi[x]Ei is just Equation 4.28. Thus, the diagonalization is a recipe forfirst transforming a vector in the standard basis into the eigenbasis, then taking the lineartransform fA and then transforming back to the standard basis.

Exercise 68. Use the remark above to prove Theorem 4.63. [Hint: The proof of onedirection is essentially given. Think about the opposite direction.]

Definition 4.66 (Nilpotent Matrix). A matrix N is nilpotent if there is some integerk > 0 so that Nk = 0

Remark 4.67. We generalize the notion of diagonalization in a concept called the JordanNormal Form. The proof of the Jordan Normal Form theorem is outside the scope of theclass, but it can be summarized in the following theorem.

Theorem 4.68. Let A be a square matrix with complex entries (i.e., A ∈ Cn×n). Thenthere exists matrices P, Λ and N so that: (1) Λ is a diagonal matrix with the eigenvaluesof A appearing on the diagonal. (2) N is a nilpotent matrix and (3) P is a matrix whosecolumns are composed of pseudo-eigenvectors and (4):

(4.30) A = P(Λ + N)P−1,

When A is diagonalizable, then N = 0 and P is a matrix whose columns are composed ofeigenvectors.

71

CHAPTER 5

Orthogonality


(1) Introduce general inner products.(2) Define Orthogonality.(3) Introduce the Gram-Schmidt Orthogonalization Procedure.(4) Demonstrate the QR decomposition.(5) Discuss orthogonal projection and orthogonal complement.(6) Prove equality of row and column rank.(7) Prove the Spectral Theorem for Real Symmetric Matrices and related results.

2. Some Essential Properties of Complex Numbers

Remark 5.1. In this chapter and the ensuring chapters, we restrict our the base field Fto be either R or C. We review a few critical facts about complex numbers, which heretoforehave been used in a very cursory way.

Definition 5.2 (Complex Conjugate). Let z = a + ib be a complex number. Theconjugate of z, denoted z = a− ib.

Proposition 5.3. If w, z ∈ C, then w + z = w + z.

Definition 5.4. The magnitude of a complex number z = a+ ib is |z| =√a2 + b2.

Remark 5.5. Note the magnitude of a complex number is both real and non-negative.

Proposition 5.6. If z ∈ C, then |z|2 = zz.

Exercise 69. Prove Propositions 5.3 and 5.6.

3. Inner Products

Definition 5.7 (Inner Product). An inner product on a vector space V over F is amapping: 〈·, ·〉 : V × V → F such that:

(1) Conjugate Symmetry: 〈x,y〉 = 〈y,x〉(2) Linearity in Argument 1: 〈αx1 + x2,y〉 = α〈x1,y〉+ 〈x2,y〉.(3) Positive definiteness: 〈x,x〉 ≥ 0 and 〈x,x〉 = 0 if and only if x = 0.

Remark 5.8 (Notional Point). There is now a possibility for confusion. This set of notesuses the notation a = 〈a1, . . . , an〉 for the vector:

a =

a1...an

73

However, now 〈a1, a2〉 might be the inner product of the vectors a1 and a2. It is up to thereader to know the difference. However to help remember vectors and matrices are alwaysbold, while numbers and scalars are always not bold.

Remark 5.9. Inner products are sometimes called scalar products.

Example 5.10. The standard dot product defined in Chapter 2 meets the criteria of aninner product over Rn. Here 〈x,y〉 = x · y.

Exercise 70. Let z,w ∈ Cn. If x = 〈z1, . . . , zn〉 and w = 〈w1, . . . , wn〉, where zi, wi ∈ C,note the standard dot product:

z ·w =n∑

i=1

ziwi

does not satisfy the conjugate symmetry rule. Find a variation for the dot product thatworks for complex vector spaces. [Hint: If z = a+ bi and w = c+ di then:

zw = (a+ ib)(c− di) =

(ac+ bd) + i(bc− ad) = (ac+ bd)− i(ad− bc) =

(ac+ bd) + i(ad− bc) = (a− ib)(c+ id) = zw = wz

Notice the last equality holds because C is a field and multiplication commutes. Use thisfact along with Proposition 5.3.]

Definition 5.11 (Norm). Any inner product can be used to induce a norm (length) onthe vector space in the following way:

(5.1) ‖x‖ =√〈x,x〉

This in turn can be used to induce a distance (metric) on the space in which the distancebetween two vectors x and y is simply ‖x− y‖.

Exercise 71. Using the standard dot product on Rn, verify that Equation 5.1 returnsthe n-dimensional analog of the Pythagorean theorem.

Definition 5.12 (Unit Vector). A vector v in a vector space V with inner product 〈·, ·〉is a unit vector if the induced norm of v is 1. That is, ‖v‖ = 1.

Proposition 5.13. If v is a vector in V with an inner product and hence a norm, then:u = v/ ‖v‖ is a unit vector.


Example 5.14. The inner product need not be a simple function. Consider P2[x], thespace of polynomials of degree at most 2 with real coefficients over the field R. We are freeto define an inner product on P2[x] in the following way. Let f, g ∈ P2[x]:

〈f, g〉 =

∫ 1

−1f(x)g(x) dx

We could vary the integral bounds, if we like.

74

Definition 5.15 (Bilinear). An inner product 〈·, ·〉 on a vector space V is bilinear if:

〈x, αy1 + y2〉 = α〈x,y1〉+ 〈x,y2〉Exercise 73. Show that the standard dot product is bilinear on Rn.

Definition 5.16. A matrix A ∈ Rn×n is postive definite if for all x ∈ Rn we havexTAx ≥ 0 and xTAx = 0 if and only if x = 0.

Theorem 5.17. Suppose 〈·, ·〉 is a bilinear inner product on a real vector space V (i.e.,

F = R) with basis B. Then there is a positive definite matrix A〈·,·〉B with the property that:

〈x,y〉 = [x]TBA〈·,·〉B [y]B

Recall [x]B is the coordinate representation of x in B.

Proof. The proof of this theorem is similar to the proof used to show how to build amatrix for an arbitrary linear transform. Let B = {v1, . . . ,vn}. Write:

x =n∑

i=1

αivi

y =n∑

i=1

βivi

Then:

〈x,y〉 =

⟨n∑

i=1

αivi,n∑

i=j

βjvj

⟩=

n∑

i=1

αi

⟨vi,

n∑

i=j

βjvj

⟩=

n∑

i=1

n∑

j=1

αiβj〈vi, vj〉

Define:

aij = 〈vi, vj〉 = 〈vj, vi〉 = aji,

by the symmetry of the inner product. Then:

〈x,y〉 =n∑

i=1

n∑

j=1

αiβjaij

In matrix notation, this can be written as:

n∑

i=1

n∑

j=1

αiβjaij =[α1 α2 · · · αn

]

a11 a12 · · · a1na21 a22 · · · a2n...

.... . .

...an1 an2 · · · ann

β1β2...βn

Since [x]B = 〈α1, . . . , αn〉 and [y]B = 〈β, . . . , β〉, it follows at once that if we define:

A〈·,·〉B =

a11 a12 · · · a1na21 a22 · · · a2n...

.... . .

...an1 an2 · · · ann

then:

〈x,y〉 = [x]TBA〈·,·〉B [y]B

75

The fact that A〈·,·〉B is positive definite follows at once from the positive definiteness of the

inner product. �

Remark 5.18. We can relax the positive definiteness assumption, while retaining thebilinearity condition to obtain a generalized linear product. The previous theorem still holdsin this case, without the positive definiteness result.

4. Orthogonality and the Gram-Schmidt Procedure

Definition 5.19 (Orthogonal Vectors). In a vector space V over field F equipped withinner product 〈·, ·〉, two vectors x and y are called orthogonal if 〈x,y〉 = 0.

Example 5.20. Under the ordinary dot product, the standard basis of Rn is alwayscomposed of orthogonal vectors. For instance, in R2, we have: 〈1, 0〉 · 〈0, 1〉 = 0.

Definition 5.21 (Orthonormal Basis). A basis B = {v1, . . . ,vn} is called orthogonalif for all i 6= j vi is orthogonal to vj and orthonormal if the basis is orthogonal and everyvector in B is a unit vector.

Example 5.22. The standard basis of Rn is orthonormal in the standard dot product,but that is not the only orthonormal basis as we’ll see.

Remark 5.23. Any orthogonal basis B = {v1, . . . ,vn} can be converted to an orthonor-mal basis by letting ui = vi/ ‖vi‖. The new basis B′ = {u1, . . . ,un} is orthonormal.

Theorem 5.24. Let V be a non-trivial n-dimensional vector space over field F with innerproduct 〈·, ·〉. Then V contains an orthogonal basis.

Proof. We’ll proceed by induction. Choose a non-zero vector v1 ∈ V and let W1 =span(v1). That is:

W1 = {v ∈ V : v = αv1, α ∈ F}A basis for W1 is u1 = v1. We can extendW1 to an m-dimensional subspaceWm of V usingCorollary 1.64. Now assume that there is an orthogonal basis {u1, . . . ,um} for Wm. ExtendWm to Wm+1 with vector vm+1. We will show how to build an orthonormal basis for Wm+1.

Define:

um+1 = vm+1 −〈u1,vm+1〉〈u1,u1〉

u1 − · · · −〈um,vm+1〉〈um,um〉

um

It now suffices to prove that {u1, . . . ,um+1} is an orthogonal basis forWm+1. To see this noteby its construction um+1 cannot be a linear combination of {u1, . . . ,um} since vm+1 is nota linear combination of {u1, . . . ,um}. Therefore, the fact that Wm+1 is m + 1 dimensionalimplies that {u1, . . . ,um+1} must be a basis for it. It remains to prove orthogonality. Forany i = 1, . . . ,m, compute:

〈ui,um+1〉 =

⟨ui,vm+1 −

m∑

j=1

〈uj,vm+1〉〈uj,uj〉

uj

⟩=

〈ui,vm+1〉 −〈ui,vm+1〉〈ui,ui〉

〈ui,ui〉 −∑

j 6=i

〈uj,vm+1〉〈uj,uj〉

〈ui,uj〉

76

By assumption 〈ui,uj〉 = 0 for i 6= j. Therefore:

〈ui,um+1〉 = 〈ui,vm+1〉 − 〈ui,vm+1〉 = 0,

as required. The result follows by induction. �

Corollary 5.25. Let V be a non-trivial n-dimensional vector space over field F withinner product 〈·, ·〉. Then V contains an orthonormal basis.

Proof. Normalize the basis identified in the theorem. �

Remark 5.26 (Gram-Schmidt Procedure). The process identified in the induction proofof Theorem 5.24 can be distilled into the Gram-Schmidt Procedure for finding an orthogonalbasis. We summarize the procedure as follows:

Gram-Schmidt ProcedureInput: {v1, . . . ,vn} a basis for V, a vector space with an inner product 〈·, ·〉.

(1) Define:

u1 = v1

(2) For each i ∈ {2, . . . , n} define:

ui = vi −i−1∑

j=1

〈uj ,vi〉〈uj ,uj〉

uj

Output: An orthogonal basis B = {u1, . . . ,un} for V.

Algorithm 2. Gram-Schmidt Procedure

Example 5.27. We can use the Gram-Schmidt procedure to find an orthonormal basisfor R2 assuming we start with the vectors 〈1, 2〉 and 〈2, 1〉 and use the standard dot product.Let v1 = 〈1, 2〉 and v2 = 〈2, 1〉.

Step 1: Define:

u1 = v1 =

[12

]

Step 2: Use the dot product to compute:

〈u1,v2〉 = 〈1, 2〉 · 〈2, 1〉 = 4

〈u1,u1〉 = 〈1, 2〉 · 〈1, 2〉 = 5

Step 3: Compute:

u2 = v2 −〈u1,v2〉〈u1,u1〉

u1 =

[21

]− 4

5

[12

]=

[65−35

]

Exercise 74. Check that the basis found in the previous example is orthogonal.

Example 5.28. Consider P2[x], the space of polynomials of degree at most 2 with realcoefficients over the field R. Let f, g ∈ P2[x] and define:

〈f, g〉 =

∫ 1

−1f(x)g(x) dx

77

We can use this information to find an orthogonal basis for P2[x]. Recall the standard basisfor P2[x] is {x2, x, 1}. This is not orthogonal, since:

〈x2, 1〉 =

∫ 1

−1x2 dx =

x3

3

∣∣∣∣1

−1=

1

3− −1

3=

2

36= 0

On the other hand, we know that:

〈x, 1〉 =

∫ 1

−1x dx =

x2

2

∣∣∣∣1

−1=

1

2− 1

2= 0

and

〈x2, x〉 =

∫ 1

−1x3 dx =

x4

4

∣∣∣∣1

−1=

1

4− 1

4= 0

We will need the following piece of information:

〈x2, x2〉 =

∫ 1

−1x4 dx =

x5

5

∣∣∣∣1

−1=

2

5

Applying the Gram-Schmidt procedure with v1 = x2, v2 = x and v3 = 1 we have:

u1 = v1 = x2

u2 = v2 −〈u1,v2〉〈u1,u1〉

u1 = x− 0 · x2 = x

u3 = v3 −〈u1,v3〉〈u1,u1〉

u1 −〈u2,v3〉〈u2,u2〉

u2 = 1−2325

x2 = 1− 5

3x2

Exercise 75. Find an orthogonal basis for P2[x] assuming:

〈f, g〉 =

∫ 1

0

f(x)g(x) dx

5. QR Decomposition

Remark 5.29. We will use the following lemma as we apply the Gram-Schmidt procedureto derive a new matrix decomposition

Lemma 5.30. Let P ∈ Rn×n be composed of columns of orthonormal vectors. ThenPTP = In.

Remark 5.31. A matrix like the kind in Lemma 5.30 is called an orthogonal matrix.These matrices are generalized by unitary matrices over the complex numbers.

Exercise 76. Prove Lemma 5.30. [Hint: Use the dot product defintion of matrix mul-tiplication and the fact that the columns are orthonormal.]

Definition 5.32 (QR decomposition). If A ∈ Rm×n with m ≥ n, then the QR-decomposition consists of an orthogonal matrix Q ∈ Rm×n and a upper triangular matrixR ∈ Rn×n so that A = QR.

Remark 5.33. Let A ∈ Rm×n with m ≥ n. Algorithm 3 illustrates how to compute theQR decomposition.

78

Gram-Schmidt ProcedureInput: A ∈ Rm×n with m ≥ n.

(1) Use Gram-Schmidt to orthogonalize the columns of A.(2) Construct an orthonormal basis for the column space of A (i.e., normalize the orthogonal

basis you found).(3) Denote Q as the matrix with these orthonormal columns. Notice QTQ = In.(4) If A = QR, then: QTA = QTQR = R.

Output: The QR decomposition of A.

Algorithm 3. QR decomposition

Exercise 77. Prove that R must be upper-triangular. [Hint: The columns of Q areconstructed from the columns of A by the Gram-Schmidt procedure. Use this fact to obtainthe structure of R = QTA.]

Example 5.34. We can find the QR decomposition of:

A =

1 02 20 1

Applying the Gram-Schmidt procedure yields the orthogonal basis:

u1 = 〈1, 2, 0〉u2 =

⟨−4

5, 25, 1⟩

Normalizing those vectors yields:

q1 =

⟨1√5,

2√5, 0

⟩

q2 =⟨− 4

3√5, 23√5,√53

⟩

Then:

Q =

1√5− 4

3√5

2√5

23√5

0√53

Now compute:

R = QTA =

[ √5 4√

5

0 3√5

]

Remark 5.35. Since R is upper triangular and you can verify that QR = A. EfficientQR decomposition can be used to solve systems of equations. If we wish to solve Ax = band we know A = QR. Then:

Ax = QRx = b

=⇒ Rx = QTb

79

Now the problem Rx = QTb can be back-solved efficiently; i.e., since R is upper-triangular,it’s easy to solve the the last variable, then use that to solve for the next to last variable etc.

6. Orthogonal Projection and Orthogonal Complements

Definition 5.36. The operation:

projv(u) =〈u,v〉〈v,v〉

v

is the orthogonal projection of u onto v. We illustrate this in Figure 5.1.

Projv(u)

u

v

Figure 5.1. The orthogonal projection of the vector u onto the vector v.

Remark 5.37. This makes the most geometric sense when the scalar product is the dotproduct and the space is Rn. To see this, we can show that if x,y ∈ Rn, then x · y =‖x‖ ‖y‖ cos(θ), there cos(θ) is the cosine of the angle between x and y in the common planethey share. We can prove this result, if we assume the law of cosines.

Lemma 5.38 (Law of Cosines). Consider a triangle with side lengths a, b, and c and letθ be the angle between the sides of length a and b. Then:

c2 = a2 + b2 − 2ab cos(θ)

�

Theorem 5.39. If x,y ∈ Rn and θ is the angle between x and y in the common planethey share, then

x · y = ‖x‖ ‖y‖ cos(θ)

Proof. The geometry is shown in Figure 5.2 for two vectors in R3. R3. Note:

‖x‖2 =n∑

i=1

x2i ,

where x = 〈x1, . . . , xn〉. A similar result holds for y. On the other hand:

‖x− y‖2 =n∑

i=1

(xi − yi)2 =n∑

i=1

(x2i − 2xiyi + y2i

)=

n∑

i=1

x2i − 2n∑

i=1

xiyi +n∑

i=1

y2i

Simplifying we have:

‖x− y‖2 = ‖x‖2 − 2x · y + ‖y‖2

80

Figure 5.2. The common plane shared by two vectors in R3 is illustrated alongwith the triangle they create.

Using the law of cosines we see:

‖x− y‖ = ‖x‖2 + ‖y‖2 − 2 ‖x‖ ‖y‖ cos(θ)

Therefore:

‖x‖2 − 2x · y + ‖y‖2 = ‖x‖2 + ‖y‖2 − 2 ‖x‖ ‖y‖ cos(θ)

Simplifying this expression yields the familiar fact that:

x · y = ‖x‖ ‖y‖ cos(θ)

�

Remark 5.40. We can now make sense of the term orthogonal projection, at least inRn. Consider the Figure 5.3: We know that: v/‖v‖ is a unit vector. The length of the

Projv(u)

u

vθ

Figure 5.3. The orthogonal projection of the vector u onto the vector v.

constructed hypotenuse in Figure 5.3 is ‖u‖. By trigonometry the length of the base of thetriangle is:

‖u‖ cos(θ) =u · v‖v‖

81

But projv(u) is a vector that points in the direction of v with length the result of projectingu “down” onto v (as illustrated). Therefore:

projv(u) =u · v‖v‖

· v

‖v‖=

u · v‖v‖2

v =u · vv · v

v

The general orthogonal decomposition formula with inner products is simply an extensionof this formula.

Theorem 5.41. Let V be a vector space with inner product 〈·, ·〉 and let W be a subspacewith B = {v1, . . . ,vm} an orthonormal basis for W. If v is a vector in V, then:

projW(v) =m∑

i=1

〈v,vi〉vi

is the orthogonal projection of v onto the subspace W.

Proof. That projW(v) is an element of W is clear by its construction; it is a linearcombination of the basis elements. It now suffices to show that the vector u = v−projW(v)is orthogonal to projW(v). To see this, compute:

〈u,vj〉 = 〈v − projW(v),vj〉 = 〈v,vj〉 − 〈projW(v),vj〉 =

〈v,vj〉 −

⟨m∑

i=1

〈v,vi〉vi,vj

⟩= 〈v,vj〉 −

m∑

i=1

〈v,vi〉〈vi,vj〉

But 〈vi,vj〉 = 0 if i 6= j by basis orthogonality. So we have:

〈u,vj〉 = 〈v,vj〉 − 〈v,vj〉〈vj,vj〉 = 〈v,vj〉 − 〈v,vj〉 ‖vj‖2 = 0,

since every basis vector is a unit vector. Therefore, u is orthogonal to every element in thebasis B and consequently it must be orthogonal to projW(v). �

Exercise 78. Assuming the inner product:

〈f, g〉 =

∫ 1

0

f(x)g(x) dx

for P2[x], compute the orthogonal projection of x2 onto the subspace spanned by the vectorx. Compare this to the Gram-Schmidt procedure.

7. Orthogonal Complement

Definition 5.42. Let W be subspace of a vector space V with inner product 〈·, ·〉. Theorthogonal complement of W is the set:

W⊥ = {v ∈ V : 〈v,w〉 = 0 for all w ∈ W}

Example 5.43. We can illustrate a space and its orthogonal complement in low dimen-sions. Let v be an arbitrary vector in R3. LetW = span(v). ThenW⊥ consists of the planeto which v is normal. This is illustrated in Figure 5.4.

Proposition 5.44. If W is a subspace of a vector space V with inner product 〈·, ·〉, thenW⊥ is a subspace.

Corollary 5.45. W ∩W⊥ = {0}.82

Figure 5.4. A vector v generates the linear subspaceW = span(v). It’s orthogonalcomplement W⊥ is shown when v ∈ R3.

Exercise 79. Prove Proposition 5.44 and its corollary.

Theorem 5.46. If W is a subspace of V a vector space, then any vector v ∈ V can beuniquely written as v = w + u where w ∈ W and u ∈ W⊥. Consequently V =W ⊕W⊥.

Proof. It suffices to show that any vector v ∈ V can be written as the sum of a vector inw ∈ W and a vector u ∈ W⊥. The result will then follow from Theorem 2.62 and Corollary5.45.

Let B = {w1, . . . ,wm} be an orthonormal basis for W . We know one must exist. UsingB, let:

w = projW(v) =m∑

i=1

αiwi,

where the αi are constructed by the inner product as in Theorem 5.41. The vector w isentirely in the subspace W . The vector:

u = v −w

is orthogonal to w as proved in Theorem 5.41. Therefore, it must lie in the space W⊥. It isclear that:

v = w + u.

We may now apply Theorem 2.62 and Corollary 5.45 to see V = W ⊕W⊥. This completesthe proof. �

Corollary 5.47. If W is a subspace of V a (finite dimensional) vector space V, thendim(V) = dim(W) + dim(W⊥).

Proof. This is an application of Theorem 2.65. �

Corollary 5.48. If W is a subspace of V a (finite dimensional) vector space V, then(W⊥)⊥ =W.

83


Remark 5.49. Consider Rn and Rm with the standard inner product (e.g., the dotproduct for Rn). Suppose that A ∈ Rm×n and consider the system of equations:

Ax = 0

Any solution x to this is in the null space of the matrix. More specifically, if fA : Rn → Rm

is defined as usual as fA(x) = Ax then we know the null space is the kernel of fA. Let therows of A be a1, . . . , am. If x ∈ Ker(fA) then:

〈ai,x〉 = 0,

because that is how we defined matrix multiplication in Definition 2.4. Then this impliesthat each row of A is orthogonal to any element in the kernel of fA (the null space of A).

Remark 5.50. For notational simplicity, for the remainder of this chapter, let Ker(A)denote the kernel of fA, the corresponding linear transform; i.e., Ker(A) is the null space ofthe matrix. Also let Im(A) denote the corresponding image of fA.

Theorem 5.51. Suppose that A ∈ Rm×n. If A has rows a1, . . . , am, then

span({a1, . . . , am}) = Ker(A)⊥.

Proof. The discussion in Remark 5.49 is sufficient to show that span({a1, . . . , am}) ⊇Ker(A)⊥. To prove opposite containment, suppose that y ∈ span({a1, . . . , am}). Then:

y = β1a1 + · · ·+ βmam

for some β1, . . . , βm ∈ R. If x ∈ Ker(A), then by (bi)linearity:

〈y,x〉 = β1〈a1,x〉+ · · ·+ βm〈am,x〉 = 0,

since 〈am,x〉 = 0 for i = 1, . . . ,m becasuse Ax = 0. Thus span({a1, . . . , am}) ⊆ Ker(A)⊥.This completes the proof. �

Definition 5.52 (Row Rank). The row rank of a matrix A ∈ Rm×n is the number oflinearly independent rows of A.

Corollary 5.53. The (column) rank of a matrix A ∈ Rm×n is equal to its row rank.

Exercise 81. Use the rank-nullity theorem along with Theorem 5.51 to prove Corollary5.53. [Hint: The dimension of Ker(A) and its orthogonal complement must add up to n.]

8. Spectral Theorem for Real Symmetric Matrices

Remark 5.54. The goal of this section is to prove the Spectral Theorem for Real Sym-metric Matrices.

Theorem 5.55 (Spectral Theorem for Real Symmetric Matrices). Suppose A ∈ Rn×n isa real, symmetric matrix. Then A is diagonalizable. Furthermore, if the diagonalization ofA is:

A = PDP−1

84

then P−1 = PT . Thus:

A = PDPT

As in Theorem 4.68, the columns of P are eigenvectors of A. Furthermore, they form anorthonormal basis for Rn.

Remark 5.56. We will build the result in a series of lemmas and definitions. We requireone theorem that is outside the scope of this class. We present that first.

Theorem 5.57 (Fundamental Theorem of Algebra). : Suppose that an 6= 0 and p(x) =anx

n + an−1xn−1 + · · · + a0 is a polynomial with coefficients in C. Then p(x) has exactly n

(possibly non-distinct) roots in C.

Lemma 5.58. If A ∈ Rn×n is symmetric, it has at least one eigenvector and eigenvalue.

Proof. This is a result of the fundamental theorem of algebra. Every matrix has acharacteristic polynomial and that polynomial has at least one solution. (Note: You can alsoprove this with an optimization argument and Weierstraß Extreme Value Theorem.) �

Lemma 5.59. If A ∈ Rn×n is symmetric, then all its eigenvalues are real.

Proof. Suppose that λ is a complex eigenvalue with a complex eigenvector z. ThenAz = λz. If λ = a + bi, then λ = a − bi. Note λλ = a2 + b2 ∈ R. Furthermore, if λ and µare two complex numbers, it’s easy to show that λµ = λµ. Since A is real (and symmetric)we can conclude that:

Az = Az = λz

Now (Az)T = zTA = λzT . We have:

Az = λz =⇒ zTAz = λzTz

zTA = λzT =⇒ zTAz = λzTz

But then:

λzTz = λzTz

and this implies λ = λ, which means λ must be real. �

Corollary 5.60. If A ∈ Rn×n is symmetric and z ∈ Cn is a complex eigenvector sothat z = x + iy, then either y = 0 or both x and y are real eigenvectors. Consequently, Ahas real eigenvectors.

Proof. We can write:

Az = Az + iAy = λx + iλy = λz

Since λ is real, it follows that either y = 0 and z = x is real or y is a second real eigenvectorof A with eigenvalue λ (i.e., λ may have geometric multiplicity at least 2). In either case, Ahas only real eigenvectors and all complex eigenvectors can be composed of them. �

Definition 5.61. Let A ∈ Rn×n and suppose W is a subspace of Rn with the propertythat if v ∈ W , then Av ∈ W . Then W is called A-invariant.

Lemma 5.62. If A ∈ Rn×n is symmetric and W is an A-invariant subspace of Rn, thenso is W⊥.

85

Proof. Suppose that v ∈ W⊥. For every w ∈ W , w · v = vTw = 0. Let w = Au forsome u ∈ W . Then:

uTAv = vTATu = vTAu = vTw = 0,

by the symmetry of A. Therefore, Av is orthogonal to any arbitrary vector u ∈ W and soAv ∈ W⊥ and W⊥ is A-invariant. �

Lemma 5.63. If A ∈ Rn×n is symmetric and W is an A-invariant subspace of Rn, thenW has an eigenvector of A.

Proof. Using the Gram-Schmidt procedure, we know thatW has an orthonormal basisP = {p1, . . . ,pm}. Let B be the matrix composed of these vectors. In particular for eachpi:

(5.2) Api = β1p1 + · · · βmpm

for some scalars β1, . . . , βm since Api ∈ W because W is A-invariant. In particular, if P isthe matrix whose columns are the basis elements in P , then by Equation 5.2, there is somematrix B (composed of scalars like β1, . . . , βm) so that1:

AP = PB

Taking the transpose and multiplying by P yields:

PTAP = BTAP = BTPTP

But we know that P is a matrix composed of orthonormal vectors. So PTP = Im. Therefore:

PTAP = BT

On the other hand taking the transpose again (and exploiting the symmetry of A) means:

PTAP = B = BT

Therefore B is symmetric. As such, B has a real eigenvector/eigenvalue pair (λ,u). Then:Bu = λu. Then PBu = λPu is necessarily inW because P is a matrix whose columns forma basis for W and Pu is just a linear combination of those columns. Let v = Pu. Then:

Av = APu = PBu = λPu = λv

Therefore v = Pu is an eigenvector of A inW with eigenvalue λ. This proves the claim. �

Remark 5.64. We now can complete the proof of the spectral theorem for real symmetricmatrices.

Proof. (Spectral Theorem for Real Symmetric Matrices) Suppose A ∈ Rn×n is a realsymmetric matrix. Then it has at least one real eigenvalue/eigenvector pair (λ1,v1). LetW1 = span(v1). If W1 = Rn, we are done. Otherwise, assume we have constructed morthonormal eigenvectors v1, . . . ,vm. Let W = span(v1, . . . ,vm). This space is necessarilyA-invariant since every vector can be expressed in a basis of orthonormal eigenvectors (whichare A-invariant). We have proved that W⊥ is A-invariant and it must have an eigenvectorvm that is orthogonal to v1, . . . ,vm. Normalizing it, we have constructed a new larger set oforthonormal vectors v1, . . . ,vm,vm+1. The result now follows by induction: the eigenvectors

1It’s an excellent exercise to convince yourself of this.

86

of A form an orthogonal basis of Rn and this basis can be normalized to an orthonormalbasis. �

9. Some Results on ATA

Proposition 5.65. Let A ∈ Rm×n. Then: Ker(A) = Ker(ATA)

Proof. Let x ∈ Ker(A), then Ax = 0 and consequently ATAx = 0. Therefore x ∈Ker(ATA).

Conversely, let x ∈ Ker(ATA). We know that:

Ker(B) = Im(BT )⊥

for any matrix B. Let y = Ax. Then y ∈ Im(A). We know ATy = 0. Therefore:

y ∈ Ker(AT ) = Im(A)⊥

Thus y ∈ Im(A) ∩ Im(A)⊥, which implies y = 0. Thus, x ∈ Ker(A). This completes theproof. �

Exercise 82. Use the rank-nullity theorem to prove that rank(A) = rank(ATA).

Remark 5.66. The following results will be useful when we study the Singular ValueDecomposition in the next chapter.

Lemma 5.67. If A ∈ Rm×n, then ATA ∈ Rn×n is a symmetric matrix.


Theorem 5.68. Let A ∈ Rm×n. Every eigenvalue of ATA is non-negative.

Proof. By Lemma 5.67, ATA is a real symmetric n×n matrix and therefore its eigen-vectors form an orthonormal basis of Rn. Suppose (λ,v) is an eigenvalue/eigenvector pairof ATA. Without loss of generality, assume ‖v‖ = 1. Then:

‖Av‖2 = 〈Av,Av〉

Here the inner product is the regular dot product and so we may write:

〈Av,Av〉 = (Av)T Av = vTATAv

Notice:

ATAv = λv

Therefore:

‖Av‖2 = vTλv = λ〈v,v〉 = λ ‖v‖2 = λ

Therefore, λ ≥ 0 since ‖Av‖2 ≥ 0. �

Theorem 5.69. Let A ∈ Rm×n and suppose that ATA has eigenvectors v1,×,vn,with corresponding eigenvalues λ1 ≥ · · · ≥ λn. If λ1, . . . , λr > 0, then rank(A) = r andAv1, . . . ,Avr for an orthogonal basis for Im(A).

87

Proof. Choose two eigenvectors vi and vj. Then:

〈Avi,Avj〉 = vTi ATAvj = vTi λjvj = λ〈vi,vj〉 = 0,

because vi and vj are orthogonal by assumption. Therefore Av1, . . . ,Avr are orthogonaland by extension linearly independent.

Now, suppose that y = Ax for some x ∈ Rn. The vectors v1,×,vn form a basis for Rn

and hence there are α1, . . . , αn such that:

x =n∑

i=1

αivi

Compute:

y = A

(n∑

i=1

αivi

)=

n∑

i=1

αiAvi =n∑

i=1

αλivi =r∑

i=1

αλivi,

because λr+1, . . . , λn = 0. Thus, Av1, . . . ,Avr must form a basis for Im(A) and consequentlyrank(A) = r. This completes the proof. �

Remark 5.70.

Example 5.71. The matrix:

A =

[1 2 34 5 6

]

has rank 2. The matrix:

ATA =

17 22 2722 29 3627 36 45

also has rank 2. This must be the case by

88

CHAPTER 6

Principal Components Analysis and Singular ValueDecomposition1


(1) Discuss covariance matrices(2) Show the meaning of the eigenvalues/eigenvectors of the covariance matrix.(3) Introduce Principal Components Analysis (PCA).(4) Illustrate the link between PCA and regression.(5) Introduce Singular Value Decomposition (SVD).(6) Illustrate how SVD can be used in image analysis.

Remark 6.1. Techniques that rely on matrix theory underly a substantial part of moderndata analysis. In this section we discuss a technique for performing dimensional reduction;i.e., taking data with many dimensions and simplifying it so that it has fewer dimensions.This approach can frequently help scientists and engineers understand critical factors gov-erning a system.

2. Some Elementary Statistics with Matrices

Definition 6.2 (Mean Vector). Let x1, . . . ,xn be n observations of a vector valuedprocess (e.g., samples from several sensors) so that for each i, xi ∈ R1×m that is xi is an mdimensional row vector. Then the mean (vector) of the observations is the vector:

(6.1) µ =1

n

n∑

i=1

xi

Where µ ∈ R1×m.

Definition 6.3 (Covariance Matrix). Let X ∈ Rn×m be the matrix formed from theobservations x1, . . . ,xn where the ith row of X is xi. Let M ∈ Rn×m be the matrix of n rowseach equal to µ. The covariance matrix of X is:

(6.2) C =1

n− 1(X−M)T (X−M) .

The matrix C ∈ Rm,m.

Lemma 6.4. The mean of X−M is 0.

Exercise 84. Prove Lemma 6.4

1This chapter assumes some familiarity with concepts from statistics.

89

Example 6.5. Suppose:

X =

1 23 45 6

Then:

µ =[3 4

]

which is just the vector of column means and

M =

3 43 43 4

Consequently:

X−M =

−2 −20 02 2

and the covariance matrix is:

C =1

2

[−2 0 2−2 0 2

]−2 −20 02 2

=

[4 44 4

]

Remark 6.6. The covariance matrix will not always consist of a single number. However,the following theorem is true, which we will not prove.

Exercise 85. Show that element Cii (i = 1, . . . ,m) is a (maximum likelihood) estimatorsfor the variance of the data from column i (taken from Sensor i).

Lemma 6.7. The covariance matrix C is a real-valued, symmetric matrix.

Remark 6.8. The covariance matrix essentially expresses the way the dimensions (sen-sor) are co-variate (i.e., vary with) each other. Higher covariance means there is a greatercorrelation between the numbers from one sensor and the numbers from a different sensor.

As we will see, the relationships between the dimensions (and their strengths) can becaptured by the eigenvalues and eigenvectors of the matrix C.

Remark 6.9. The eigenvectors and their corresponding eigenvalues provide informationabout the linear relationships that can be found within the covariance matrix and as a resultwithin the original data itself.

To see this, it is helpful to think of a matrix A ∈ Rm×m as a mathematical objectthat transforms any vector in x ∈ Rm×1 into a new vector in Rm×1 by multiplying Ax.This operation, rotates and stretches x in m-dimensional space. However, when x is aneigenvector, there is no rotation, only stretching. In a very powerful sense, eigenvectorsalready point in a preferred direction of the transformation. The magnitude of the eigenvaluecorresponding to that eigenvector provides information about the power of that direction.

The m eigenvectors, provide a new coordinate system that is, in a certain sense, properfor the matrix.

90

Example 6.10. Consider the covariance matrix:

C =

[4 44 4

]

Given a vector x = 〈x1, x2〉, we have:

Cx =

[4 44 4

] [x1x2

]=

[4x1 + 4x24x1 + 4x2

]

So this transformation pushes vectors in the direction of the vector 〈1, 1〉 (because the firstand second elements of Cx are identical). Clearly 〈1, 1〉 is an eigenvector of C with eigenvalue8. This matrix has a second (non-obvious) eigenvector 〈−1, 1〉 with eigenvalue 0. Thus allthe power in C lies in the direction of 〈1, 1〉.

Incidentally, 〈−1, 1〉 is the second eigenvalue precisely because it is an orthogonal vectorto 〈1, 1〉. Thus the two eigenvectors form an orthogonal pair of vectors. If we scale them sothey each have length 1, we obtain the orthonormal pair of eigenvectors:

w1 =

⟨1√2,

1√2

⟩w2 =

⟨− 1√

2,

1√2

⟩

Let us now look at a plot of the data in X from Example 6.5 that was used to compute C(see Figure 6.1). The data is shown plotted with the line point in the direction of 〈1, 1〉 (and

Figure 6.1. An extremely simple data set that lies along a line y − 4 = x − 3, inthe direction of 〈1, 1〉 containing point (3, 4).

its negation) and containing point (3, 4), the column means. Thus, the data lies preciselyalong the direction of most power for the covariance matrix.

3. Projection and Dimensional Reduction

Proposition 6.11. Let Y = X−M be modified data matrix with mean 0. Let yi = Yi.

be the ith row of modified data. Then yTi can be expressed as a vector with basis w1, . . . ,wm,the eigenvectors of C making up the columns of W by solving:

(6.3) Wz = yTi

so that:

(6.4) z = WTyTi

91

Proof. If we express yTi in the basis {w1, . . . ,wm} we write:

(6.5) yTi = z1w1 + z2w2 + · · ·+ zmwm

By the Principal Axis Theorem, wi ∈ Rm×1 for i = 1, . . .m. Furthermore, w1, . . . ,wm forma basis for |Rm. Since yTi ∈ Rm×1 it follows that z1, . . . , zm must be real. Equation 6.5 canbe re-written in matrix form as:

yTi = Wz

We know W−1 = WT . Thus:

z = WTyTi

�

Example 6.12. Recall from our example we had:

Y = X−M =

−2 −20 02 2

Our eigenvectors for C are 〈1, 1〉 and 〈−1, 1〉. We can expression the rows of Y in terms ofthese eigenvectors. Notice:

〈−2,−2〉 = −2√

2 ·⟨

1√2,

1√2

⟩+ 0 ·

⟨−1√

2,

1√2

⟩

〈0, 0〉 = 0 ·⟨

1√2,

1√2

⟩+ 0 ·

⟨−1√

2,

1√2

⟩

〈2, 2〉 = 2√

2 ·⟨

1√2,

1√2

⟩+ 0 ·

⟨−1√

2,

1√2

⟩

Thus, the transformed points are (−2√

2, 0), (0, 0), and (2√

2, 0) and now we can truly seethe 1-dimensionality of the data (see Figure 6.2). Notice that we do not really need the

Figure 6.2. The one dimensional nature of the data is clearly illustrated in thisplot of the transformed data z.

second eigenvector at all (it has eigenvalue 0). We could simplify the problem by using onlythe coefficients of the first eigenvector: {−2

√2, 0, 2

√2}. This projects the transformed data

zi onto the x-axis.

92

This projection process can be accomplished by using only the first column of W inEquation 6.4 (the first row of WT ). Note:

W =

[1/√

2 −1/√

2

1/√

2 1/√

2

]WT =

[1/√

2 1/√

2

−1/√

2 1/√

2

]

The first row of WT is:

WT1 =

[1/√

2 1/√

2]

Thus we could write:

(6.6) z′1 = WT1 y1 =

[1/√

2 1/√

2]·[−2−2

]= −4/

√2 = −2

(2/√

2)

= −2√

2

as expected. The remaining values 0 and 2√

2 can be constructed with the other two rowsof Y.

Notice, the 1 in y1 refers to the first data point. The 1 in WT1 refers to the fact we used

only 1 of the 2 eigenvectors. If we kept k out of m, we would right WTk .

We can transform back to the original data by taking the 1-dimensional data {−2√

2, 0, 2√

2}and multiplying it by W1 (this undoes multiplying by WT

1 because WTW = I2). To seethis note:

yT1 = W1z′1 =

[1/√

2

1/√

2

]·(

2√

2)

=

[−2−2

]

We can get back to X by adding M again.

Remark 6.13. In the simple example, we could reconstruct the original data exactlyfrom the 1-dimensional data because the original data itself sat on a line (a 1-dimensionalthing). In more complex cases, this will not be the case. However, we can use exactlythis approach to reduce the dimension and complexity of a given data set. Doing this iscalled principal components analysis (PCA). Steps for executing this process is provided inAlgorithm 4.

Exercise 86. Show that WTYT will result in a matrix that contains n column vectorsthat are z1, . . . , zn. Thus the expression:

Y′ =(WkW

Tk YT

)T

is correct.

4. An Extended Example

Example 6.14. We give a more complex example for a randomly generated sample ofdata in 2-dimensional space. We will project the data onto a 1-dimensional sub-space (line).The data are generated using a two-dimensional Gaussian distribution with mean 〈2, 2〉 andcovariance matrix:

(6.7) Σ =

[0.2 −0.7−0.7 4

]

The data are shown below for completeness:

93

Principal Components Analysis

Input: X ∈ Rn×m a data matrix where each column is a different dimension (sensor) and eachrow is a sample (replicants). There are m dimensions (sensors) and n samples.

(1) Compute the column mean vector µ ∈ R1×m.(2) Compute the mean matrix M ∈ Rn×m where each row of M is a copy of µ.(3) Compute Y = X−M. This matrix has column mean 0.(4) Compute the covariance matrix C = 1

(n−1)YTY

(5) Compute (λ1,w1), . . . , (λm,wm) the eigenvalue/eigenvector pairs of C in descendingorder of eigenvalue; i.e., λ1 ≥ λ2 ≥ · · · ≥ λm. Let W be the matrix whose columns arethe eigenvectors in order w1,w2, . . . ,wm.

(6) If a k < m dimensional representation (projection) of the data is desired, computeWk ∈ Rm×k consisting of the first k columns of W.

(7) Compute Yk =(WkW

TkY

T)T

. This operation will compute transformations andprojects for all rows of Y simultaneously.

(8) Compute Xk = Yk + M.

Output: Xk the reduced data set that lies entirely in a k-dimensional hyer-plane of Rm.

Algorithm 4. Principal Components Analysis with Back Projection

{{1.1913, 4.05873}, {1.53076, 5.27513}, {1.85309, 3.23638},

{1.99963, 1.10533}, {1.79767, 3.86304}, {2.25872, 2.10381},

{1.50469, 4.12347}, {2.13699, 1.84288}, {1.20712, 5.59894},

{1.43594, 2.86787}, {1.50379, 4.92747}, {1.50437, 3.88367},

{2.22937, -0.0357163}, {1.60838, 4.80397}, {2.52479, 0.287635},

{1.84461, 3.33785}, {2.75705, -1.566}, {1.58677, 3.91416},

{2.10225, 0.689372}, {1.99164, 2.03114}, {2.3703, 1.48555},

{2.25813, -0.2236}, {2.76285, 0.0886777}, {2.16664, 2.36102},

{1.87554, -0.133408}, {2.52679, -0.492959}, {2.27623, -0.130207},

{2.7388, -1.36069}, {2.15687, 1.29411}, {1.90101, 0.671318},

{2.02191, 2.60927}, {1.46282, 1.63502}, {2.13333, 0.958677},

{1.86464, 3.07403}, {1.84389, 3.45468}, {1.60883, 3.33228},

{2.51706, 1.44357}, {1.05347, 7.59858}, {1.898, 1.00438},

{1.50151, 3.41193}, {2.05665, 2.41876}, {1.79544, 1.48661},

{2.23181, 1.63454}, {1.2492, 3.67311}, {1.82897, 1.41699},

{1.72701, 4.46551}, {1.64191, 6.38833}, {2.47254, -0.427444},

{2.15246, 4.79382}, {2.16991, 1.48283}, {2.2715, 2.54674},

{2.08859, 2.58774}, {1.98126, 1.38378}, {1.69199, 2.68088},

{1.25897, 5.48203}, {1.69802, 2.20615}, {2.4989, -2.0593},

{2.11843, 0.643992}, {1.96406, -0.664882}, {2.16071, 1.09063},

{1.83942, 3.84346}, {1.35287, 5.54837}, {1.32731, 3.55062},

{2.08264, 2.49115}, {2.12898, 0.818264}, {1.61345, 3.2065},

{2.11461, 2.96489}, {2.15123, 2.82889}, {2.07051, 1.76971},

{1.77957, 1.3183}, {2.2917, 1.90551}, {1.75408, 4.31078},

{2.25497, 0.88574}, {1.91065, 2.12505}, {2.11302, -0.318024},

{0.974176, 4.73707}, {1.84714, 1.75565}, {1.73322, 2.78468},

94

{2.40627, 0.140563}, {2.17967, 1.2649}, {1.43098, 3.16606},

{1.91726, 1.74352}, {2.31406, 0.9825}, {1.693, 1.69997},

{2.09722, 2.70155}, {2.31961, 0.120007}, {1.89179, -0.463541},

{1.35839, 3.59431}, {2.29766, 0.141463}, {1.504, 2.51007},

{1.65115, 3.86479}, {1.4336, 6.26426}, {1.488, 4.14824},

{1.66953, 3.85173}, {2.82247, 1.30438}, {1.66312, 2.1701},

{1.8374, 2.87124}, {2.39617, -2.14277}, {2.2254, 0.222176},

{2.73372, 0.194578}}

A scatter plot of the data with a contour plot of the probability density function for themultivariable Gaussian distribution is shown in Figure 6.3.

Figure 6.3. A scatter plot of data drawn from a multivariable Gaussian distribu-tion. The distribution density function contour plot is superimposed.

The mean vector can be computed for this data set as:

µ = 〈1.93236, 2.18539〉Note it is close to the true mean of the distribution function used to generate the data, whichwas 〈2, 2〉. We compute Y from X and µ and construct the covariance matrix:

C =1

99YTY =

[0.161395 −0.592299−0.592299 3.69488

]

The eigenvalues for this matrix are: λ1 = 3.79152 and λ2 = 0.0647539. Notice λ1 is muchlarger than λ2 because the data is largely stretched out along a line with negative slope. Thecorresponding eigenvectors are:

w1 =

[−0.1610330.986949

]w2 =

[−0.986949−0.161033

]

This yields:

W =

[−0.161033 −0.9869490.986949 −0.161033

]

95

We can compute the transformed data Z = WTYT and we see that it has been adjustedso that the data X is now uncorrelated and centered at (0, 0). This is shown in Figure 6.4.Since there is so much more power in the first eigenvector, we can reduce the dimension of

Figure 6.4. Computing Z = WTYT creates a new uncorrelated data set that iscentered at 0.

the data set without losing a substantial amount of information. This will project the dataonto a line. We use:

W1 =

[−0.1610330.986949

]

Then compute:

X1 = W1WT1 YT + M

The result is shown in Figure 6.5.

Figure 6.5. The data is shown projected onto a linear subspace (line). This is thebest projection from 2 dimensions to 1 dimension under a certain measure of best.

96

Remark 6.15. It is worth noting that this projection in many dimensions (k > 1) createsdata that exists on a hyperplane, the multi-dimensional analogy to a line. Furthermore, thereis a specific measurement in which this projection is the best projection. The discussion ofthis is outside the scope of these notes, but it is worth noting that this projection is notarbitrary. (See Theorem 6.30 for a formal analogous statement.)

Remark 6.16. It is also worth noting that the projection Xk =(WkW

Tk YT

)Tis some-

times called the Kosambi-Karhunen-Loeve transform.

5. Singular Value Decomposition

Remark 6.17. Let A ∈ Rm×n. Recall from Theorem 5.68, every eigenvalue of ATA isnon-negative. We use this theorem and its proof in the following discussion.

Definition 6.18 (Singular Value). Let A ∈ Rm×n. Let λ1 ≥ λ2 ≥ · · · ≥ λn be theeigenvalues of ATA. If λ1, . . . , λr > 0, then the singular values of A are σ1 =

√λ1, . . . , σr =√

λr.

Lemma 6.19. Let A ∈ Rm×n and v1, . . . ,vn be the orthonormal eigenvectors of ATAwith corresponding eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn. If λ1, . . . , λr > 0, then for i = 1, . . . , r:

‖Avi‖ = σi

Proof. Recall from the proof of Theorem 5.68 that if v is an eigenvector of ATA witheigenvalue λ¡ then:

‖Av‖2 = λ

Use this result with λi and vi. It follows that:

‖Avi‖2 = λi = σ2i .

Therefore:

‖Avi‖ = σi

�

Proposition 6.20. Let A ∈ Rm×n. Then dim(Im(fA)) is the number of (non-zero)singular values.

Exercise 87. Prove Proposition 6.20. [Hint: See Theorem 5.69].

Remark 6.21. Recall we proved the the eigenvalues of ATA are non-negative in Theorem5.68. Therefore the singular values of A always exist.

Definition 6.22 (Singular Value Decomposition). Let A ∈ Rm×n. The singular valuedecomposition consists of a orthogonal matrix U ∈ Rm×m, a positive matrix Σ ∈ Rm×n

entries Σij = 0 if i 6= j and a orthogonal matrix V ∈ Rn×n so that:

(6.8) A = UΣVT

Theorem 6.23. For any matrix A ∈ Rm×n, the singular value decomposition exists.

97

Proof. Recall from Theorem 5.69 that the eigenvectors v1, . . . ,vn of ATA form anorthogonal basis of Rn. Without loss of generality, assume this basis is orthonormal. Fur-thermore, from Theorem 5.69, the vectors Av1, . . . ,Avr form an orthogonal basis of Im(A),where r is the number of non-zero eigenvalues (singular values) of ATA. Scale the vectorsAv1, . . . ,Avr to form the orthonormal set {u1, . . . ,ur}, where from Lemma 6.19:

ui =1

‖Avi‖Avi =

1

σiAvi

Therefore: Avi = σiui. Using the Gram-Schmidt procedure we can extend this to anorthonormal basis {u1, . . . ,um}. Do this by choosing any vectors not in the image of A andapplying Gram-Schmidt and normalizing.

Let V be the matrix whose columns are v1, . . . ,vn and let U be the matrix whosecolumns are u1, . . . ,um. Both are orthonormal matrices. Let D be the r×r diagonal matrixof (non-zero) singular values so that by construction:

ATA = V

[D2 00 0

]VT

is the diagonalization of the n× n symmetric matrix ATA. Notice we have D2 because fori = 1, . . . , r, the ith singular value σi =

√λi. Define Σ as the m× n matrix:

Σ =

[D 00 0

],

where D is surrounded by a sufficient number of 0’s to make Σ m× n.Let U1 be the m × r matrix composed of columns u1, . . . ,ur and let U2 be composed

of the remaining columns ur+1, . . . ,um. Likewise let V1 be the n × r matrix composed ofv1, . . . ,vr and let V2 be composed of the remaining columns. We know that:

ATAvi = 0

for i = r+1, . . . , n. Therefore vi ∈ Ker(ATA) for i = r+1, . . . , n. It follows from Proposition5.65 that vi ∈ Ker(A) for i = r + 1, . . . , n. Therefore:

AV2 = 0

Finally note that:

UΣ =[U1 U2

] [D 00 0

]=[U1D 0

]

But:

U1D =[u1 · · · ur

]

σ1 0 · · · 00 σ2 · · · 0...

.... . .

...0 0 · · · σr

=

[Av1

σ1· · · Av1

σr

]

σ1 0 · · · 00 σ2 · · · 0...

.... . .

...0 0 · · · σr

=

[Av1 · · ·Avr

]= AV1

Therefore:

UΣ =[AV1 0

]=[AV1 AV2

]= AV

98

By construction V is a unitary matrix, therefore we can multiply by VT on the right to seethat:

UΣVT = A


Remark 6.24. Notice we used results on ATA in constructing this. We could (however)have used AAT instead. It is therefore easy to see that ATA and AAT must share non-zeroeigenvalues and thus the singular values of A are unique. The singular value decompositionhowever, is not unique.

Exercise 88. Show that the matrix AAT could be used in this process instead. As aconsequence, show that AAT and ATA share non-zero eigenvalues.

Theorem 6.25. Let A ∈ Rm×n. Then:

AAT = UD1UT(6.9)

ATA = VD2VT(6.10)

where the columns of U and V are orthornormal eigenvectors. Furthermore, D1 and D2

share non-zero eigenvalues and Σ has non-zero elements corresponding to the singular valuesof A.

Proof. We argue in Remark 6.24 that the non-zero eigenvalues of AAT and AAT areidentical. By Theorem 6.23 the singular value of A exists and thus:

A = UΣVT

for some orthogonal matrices U and VT . Therefore:

AAT = UΣVTVΣTUT = UΣΣTUT = UD1UT

ATA = VΣTUTUΣVT = VΣTΣVT = VD2VT

It is easy to check that D1 and D2 are diagonal matrices with the eigenvalues of ATA onthe diagonal. This completes the proof. �

Exercise 89. Check that D1 and D2 are diagonal matrices with the eigenvalues of ATA

Example 6.26. Let:

A =

[3 2 11 2 3

]

we can compute:

AAT =

[14 1010 14

]=

[ 1√2− 1√

21√2

1√2

] [24 00 4

] [ 1√2

1√2

− 1√2

1√2

]

ATA =

10 8 68 8 86 8 10

=

1√3− 1√

21√6

1√3

0 −√

23

1√3

1√2

1√6

24 0 00 4 00 0 0

1√3

1√3

1√3

− 1√2

0 1√2

1√6−√

23

1√6

99

Thus we have:

U =

[ 1√2− 1√

21√2

1√2

]

V =

1√3− 1√

21√6

1√3

0 −√

23

1√3

1√2

1√6

We construct Σ as:

Σ =

[√24 0 0

0√

4 0

]=

[2√

6 0 00 2 0

]

Thus we have:

A = UΣVT =

[ 1√2− 1√

21√2

1√2

] [2√

6 0 00 2 0

]

1√3

1√3

1√3

− 1√2

0 1√2

1√6−√

23

1√6

Remark 6.27. One of the more useful elements of the singular value decomposition isits ability to project vectors into a lower dimensional space, just like in principal componentsanalysis and to provide a reasonable approximation to the matrix A using less data. Anotheris to increase the sparsity of A by finding a matrix Ak that approximate A. To formalizethis, we need a notion of a distance on among matrices.

Definition 6.28 (Frobenius Norm). Let A ∈ Rm×n. The Frobenius Norm of A is thereal value:

(6.11) ‖A‖F =

√√√√m∑

i=1

n∑

j=1

|Aij|2

Remark 6.29. The proof of the following theorem is outside the scope of these notes,but is accessible to a motivated student. Our interested in it is purely as an application,especially to image processing.

Theorem 6.30. Let A ∈ Rm×n with singular value decomposition A = UΣVT . Supposethe singular values are organized in descending order. Let Σk be the k × k diagonal matrixthat retains the largest k singular values. Let Uk ∈ Rm×k retain the first k columns of Uand let Vk ∈ Rn×k retain the first k columns of V. Then:

(6.12) Ak = UkΣkVk

has the property that it is the m×n rank k matrix minimizing the matrix distance ‖A−Ak‖F .

Remark 6.31. The previous theorem gives a recipe for approximating an arbitrary ma-trix with a matrix with less data in it. This can be particularly useful for image processingas we’ll illustrate below.

100

Example 6.32. Suppose we use only the largest singular value 2√

6 to recover the matrixA from Example 6.26. Using Theorem 6.30 we keep only the first columns of U and V andcompute:

(6.13) A1 =

[ 1√21√2

]2√

6[ 1√

31√3

1√3

]=

[2 2 22 2 2

]

Obviously this matrix has rank 1 (only one linearly independent column). It is also anapproximation of A (it’s actually a matrix of column means).

Example 6.33 (Application). A grayscale image of width n pixels and height m pixelscan be stored as an m × n matrix of grayscale values. If each grayscale value occupies 8bits, then transmitting this image with no compression requires transmission of 8mn bits.Transmission from space is complex; early Viking missions to Mars required a way to reducetransmission size and so used a Singular Value Decomposition compression method2, whichwe illustrate below. Suppose we have the grayscale image shown in Figure 6.6: The original

Figure 6.6. A gray scale version of the image found at http://hanna-barbera.

wikia.com/wiki/Scooby-Doo_(character)?file=Scoobydoo.jpg. CopyrightHannah-Barbara used under the fair use clause of the Copyright Act.

image is 391×272 pixels. Transmission of this image (uncompressed) would require 850, 816bits or approximately 106 kB (on the computer on which this document was written, itoccupies 111 kB due to various technical aspects of storing data on a hard drive). Using asingular value decomposition, we see the singular values show a clear decay in their value.This is illustrated in Figure 6.7. The fact that the singular values exhibit such steep declinemeans we can choose only a few (say between 15 and 50) and these (along with Uk and Vk

can be used to construct the image. In particular if we send just the vectors in Uk and Vk

and the singular values, then:

(1) If we choose 15 singular values, then we need only transmit 391 ·15 + 15 + 272 ·15 =9960 values to reconstruct the image. As compared to the original 106, 352 =391 · 272 values. This is a savings about 90%.

(2) If we choose 50 singular values, then we need only transmit 33, 200, a savings ofabout 67%.

The two reconstructed images using 15 and 50 singular values are shown in Figure 6.8.

2This may be apocryphal I cannot find a reference for this.

101



0 50 100 150 200 250 300Index

0

1

2

3

4

5

6

7

Sing

ular

Val

ue

×104 Singular Values by Index

Singular Values

(a) Singular Values

0 50 100 150 200 250 300Index

0

2

4

6

8

10

12

Log(σi)

Log-Plot of Singular Values

Log(σ i)

(b) Log Singular Values

Figure 6.7. The singular values of the image matrix corresponding to the imagein Figure 6.6. Notice the steep decay of the singular values.

(a) 15 Singular Val-ues

(b) 50 Singular Val-ues

Figure 6.8. Reconstructed images from 15 and 50 singular values capture a sub-stantial amount of detail for substantially smaller transmission sizes.

Remark 6.34. It is worth noting that there has been recent work on using singularvalue decomposition as a part of image steganography; i.e., hiding messages inside images[Wen06].

102

CHAPTER 7

Linear Algebra for Graphs and Markov Chains


(1) Introduce graphs.(2) Define the adjacency matrix and its properties.(3) Discuss the eigenvalue/eigenvector properties of the adjacency matrix.(4) Define eigenvector centrality.(5) Introduce Markov chains and the stochastic matrix and stationary probabilities as

an eigenvector.(6) Introduce the Page Rank Algorithm and relate it to eigenvector centrality.(7) Discuss Graph Laplacian and its uses.

2. Graphs, Multi-Graphs, Simple Graphs

Definition 7.1 (Graph). A graph is a tuple G = (V,E) where V is a (finite) set ofvertices and E is a finite collection of edges. The set E contains elements from the unionof the one and two element subsets of V . That is, each edge is either a one or two elementsubset of V .

Example 7.2. Consider the set of vertices V = {1, 2, 3, 4}. The set of edges

E = {{1, 2}, {2, 3}, {3, 4}, {4, 1}}Then the graph G = (V,E) has four vertices and four edges. It is usually easier to representthis graphically. See Figure 7.1 for the visual representation of G. These visualizations

1 2

4 3

Figure 7.1. It is easier for explanation to represent a graph by a diagram in whichvertices are represented by points (or squares, circles, triangles etc.) and edges arerepresented by lines connecting vertices.

are constructed by representing each vertex as a point (or square, circle, triangle etc.) andeach edge as a line connecting the vertex representations that make up the edge. That is, letv1, v2 ∈ V . Then there is a line connecting the points for v1 and v2 if and only if {v1, v2} ∈ E.

103

Definition 7.3 (Self-Loop). If G = (V,E) is a graph and v ∈ V and e = {v}, thenedge e is called a self-loop. That is, any edge that is a single element subset of V is called aself-loop.

Exercise 90. Graphs occur in every day life, but often behind the scenes. Provide anexample of a graph (or something that can be modeled as a graph) that appears in everydaylife.

Definition 7.4 (Vertex Adjacency). Let G = (V,E) be a graph. Two vertices v1 andv2 are said to be adjacent if there exists an edge e ∈ E so that e = {v1, v2}. A vertex v isself-adjacent if e = {v} is an element of E.

Definition 7.5 (Edge Adjacency). Let G = (V,E) be a graph. Two edges e1 and e2 aresaid to be adjacent if there exists a vertex v so that v is an element of both e1 and e2 (assets). An edge e is said to be adjacent to a vertex v if v is an element of e as a set.

Definition 7.6 (Neighborhood). Let G = (V,E) be a graph and let v ∈ V . Theneighbors of v are the set of vertices that are adjacent to v. Formally:

(7.1) N(v) = {u ∈ V : ∃e ∈ E (e = {u, v} or u = v and e = {v})}

In some texts, N(v) is called the open neighborhood of v while N [v] = N(v) ∪ {v} is calledthe closed neighborhood of v. This notation is somewhat rare in practice. When v is anelement of more than one graph, we write NG(v) as the neighborhood of v in graph G.

Exercise 91. Find the neighborhood of Vertex 1 in the graph in Figure 7.2.

Remark 7.7. Expression 7.1 is read

N(v) is the set of vertices u in (the set) V such that there exists an edge ein (the set) E so that e = {u, v} or u = v and e = {v}.

The logical expression ∃x (R(x)) is always read in this way; that is, there exists x so thatsome statement R(x) holds. Similarly, the logical expression ∀y (R(y)) is read:

For all y the statement R(y) holds.

Admittedly this sort of thing is very pedantic, but logical notation can help immensely insimplifying complex mathematical expressions1.

Remark 7.8. The difference between the open and closed neighborhood of a vertex canget a bit odd when you have a graph with self-loops. Since this is a highly specialized case,usually the author (of the paper, book etc.) will specify a behavior.

Example 7.9. In the graph from Example 7.2, the neighborhood of Vertex 1 is Vertices2 and 4 and Vertex 1 is adjacent to these vertices.

1When I was in graduate school, I always found Real Analysis to be somewhat mysterious until I gotused to all the ε’s and δ’s. Then I took a bunch of logic courses and learned to manipulate complex logicalexpressions, how they were classified and how mathematics could be built up out of Set Theory. Suddenly,Real Analysis (as I understood it) became very easy. It was all about manipulating logical sentences aboutthose ε’s and δ’s and determining when certain logical statements were equivalent. The moral of the story:if you want to learn mathematics, take a course or two in logic.

104

Definition 7.10 (Degree). Let G = (V,E) be a graph and let v ∈ V . The degree of v,written deg(v) is the number of non-self-loop edges adjacent to v plus two times the numberof self-loops defined at v. More formally:

deg(v) = |{e ∈ E : ∃u ∈ V (e = {u, v})}|+ 2 |{e ∈ E : e = {v}}|Here if S is a set, then |S| is the cardinality of that set.

Remark 7.11. Note that each vertex in the graph in Figure 7.1 has degree 2.

Example 7.12. If we replace the edge set in Example 7.9 with:

E = {{1, 2}, {2, 3}, {3, 4}, {4, 1}, {1}}then the visual representation of the graph includes a loop that starts and ends at Vertex 1.This is illustrated in Figure 7.2. In this example the degree of Vertex 1 is now 4. We obtain

1 2

4 3

Self-Loop

Figure 7.2. A self-loop is an edge in a graph G that contains exactly one vertex.That is, an edge that is a one element subset of the vertex set. Self-loops areillustrated by loops at the vertex in question.

this by counting the number of non self-loop edges adjacent to Vertex 1 (there are 2) andadding two times the number of self-loops at Vertex 1 (there is 1) to obtain 2 + 2× 1 = 4.

Definition 7.13 (Simple Graph). A graph G = (V,E) is a simple graph if G has noedges that are self-loops and if E is a subset of two element subsets of V ; i.e., G is not amulti-graph.

Remark 7.14. We will assume that every graph we discuss is a simple graph and we willuse the term graph to mean simple graph. When a particular result holds in a more generalsetting, we will state it explicitly.

3. Directed Graphs

Definition 7.15 (Directed Graph). A directed graph (digraph) is a tuple G = (V,E)where V is a (finite) set of vertices and E is a collection of elements contained in V × V .That is, E is a collection of ordered pairs of vertices. The edges in E are called directededges to distinguish them from those edges in Definition 7.1

Definition 7.16 (Source / Destination). Let G = (V,E) be a directed graph. Thesource (or tail) of the (directed) edge e = (v1, v2) is v1 while the destination (or sink orhead) of the edge is v2.

105

Remark 7.17. A directed graph (digraph) differs from a graph only insofar as we replacethe concept of an edge as a set with the idea that an edge as an ordered pair in which theordering gives some notion of direction of flow. In the context of a digraph, a self-loop is anordered pair with form (v, v). We can define a multi-digraph if we allow the set E to be atrue collection (rather than a set) that contains multiple copies of an ordered pair.

Remark 7.18. It is worth noting that the ordered pair (v1, v2) is distinct from the pair(v2, v1). Thus if a digraph G = (V,E) has both (v1, v2) and (v2, v1) in its edge set, it is nota multi-digraph.

Example 7.19. We can modify the figures in Example 7.9 to make it directed. Supposewe have the directed graph with vertex set V = {1, 2, 3, 4} and edge set:

E = {(1, 2), (2, 3), (3, 4), (4, 1)}This digraph is visualized in Figure 7.3(a). In drawing a digraph, we simply append arrow-heads to the destination associated with a directed edge.

We can likewise modify our self-loop example to make it directed. In this case, our edgeset becomes:

E = {(1, 2), (2, 3), (3, 4), (4, 1), (1, 1)}This is shown in Figure 7.3(b).

1 2

4 3

(a)

1 2

4 3

(b)

Figure 7.3. (a) A directed graph. (b) A directed graph with a self-loop. In adirected graph, edges are directed; that is they are ordered pairs of elements drawnfrom the vertex set. The ordering of the pair gives the direction of the edge.

Definition 7.20 (In-Degree, Out-Degree). Let G = (V,E) be a digraph. The in-degreeof a vertex v in G is the total number of edges in E with destination v. The out-degree ofv is the total number of edges in E with source v. We will denote the in-degree of v bydegin(v) and the out-degree by degout(v).

Remark 7.21. Notions like edge and vertex adjacency and neighborhood can be extendedto digraphs by simply defining them with respect to the underlying graph of a digraph. Thusthe neighborhood of a vertex v in a digraph G is N(v) computed in the underlying graph.

Definition 7.22 (Walk). A walk on a directed graph G = (V,E) is a sequence w =(v1, e1, v2, . . . , vn−1, en−1, vn) with vi ∈ V for i = 1, . . . , n, ei ∈ E and and ei = (vi, vi+1)for i = 1, . . . , n − 1. A walk on an undirected graph is defined in the same way exceptei = {vi, vi+1}.

106

Definition 7.23 (Walk Length). The length of a walk w is the number of edge itcontains.

Definition 7.24 (Path / Cycle). Let G = (V,E) be a (directed) graph. A walk w =(v1, e1, v2, . . . , vn−1, en−1, vn) is a path if for each i = 1, . . . , n, vi occurs only once in thesequence w. A walk is a cycle if the walk w′ = (v1, e1, v2, . . . , vn−1) is a path and v1 = vn.

Example 7.25. We illustrate a walk that is also a path and a different cycle in Figure7.4. The walk has length 3, the path has length 4.

12

3

4

5

(a) Walk

12

3

4

5

(b) Cycle

Figure 7.4. A walk (a) and a cycle (b) are illustrated.

Definition 7.26 (Connected Graph). A graph G = (V,E) is connected if for every pairof vertices v, u ∈ V there is at least one walk w that begins with v and ends with u. In thecase of a directed graph, we say the graph is strongly connected when there is a (directed)walk from v to u.

Example 7.27. Figure 7.5 we illustrate a connected graph, a disconnected graph and aconnected digraph that is not strongly connected.

4. Matrix Representations of Graphs

Definition 7.28 (Adjacency Matrix). Let G = (V,E) be a graph and assume thatV = {v1, . . . , vn}. The adjacency matrix of G is an n× n matrix M defined as:

Mij =

{1 {vi, vj} ∈ E0 else

12

3

4

5

(a) Connected

12

3

4

5

(b) Disconnected

Figure 7.5. A connected graph (a) and a disconnected graph (b).

107

1 2

3 4

Figure 7.6. The adjacency matrix of a graph with n vertices is an n × n matrixwith a 1 at element (i, j) if and only if there is an edge connecting vertex i to vertexj; otherwise element (i, j) is a zero.

Proposition 7.29. The adjacency matrix of a (simple) graph is symmetric.


Theorem 7.30. Let G = (V,E) be a graph with V = {v1, . . . , vn} and let M be itsadjacency matrix. For k ≥ 0, the (i, j) entry of Mk is the number of walks of length k fromvi to vj.

Proof. We will proceed by induction. By definition, M0 is the n × n identity matrixand the number of walks of length 0 between vi and vj is 0 if i 6= j and 1 otherwise, thusthe base case is established.

Now suppose that the (i, j) entry of Mk is the number of walks of length k from vi to vj.We will show this is true for k + 1. We know that:

(7.2) Mk+1 = MkM

Consider vertices vi and vj. The (i, j) element of Mk+1 is:

(7.3) Mk+1ij =

(Mk

i·)M·j

Let:

(7.4) Mki· =

[r1 . . . rn

]

where rl, (l = 1, . . . , n), is the number of walks of length k from vi to vl by the inductionhypothesis. Let:

(7.5) M·j =

b1...bn

where bl, (l = 1, . . . , n), is a 1 if and only there is an edge {vl, vj} ∈ E and 0 otherwise.Then the (i, j) term of Mk+1 is:

(7.6) Mk+1ij = Mk

i·M·j =n∑

l=1

rlbl

This is the total number of walks of length k leading to a vertex vl, (l = 1, . . . , n), fromvertex vi such that there is also an edge connecting vl to vj. Thus Mk+1

ij is the number ofwalks of length k + 1 from vi to vj. The result follows by induction. �

108

Example 7.31. Consider the graph in Figure 7.6. The adjacency matrix for this graphis:

(7.7) M =

0 1 1 11 0 0 11 0 0 11 1 1 0

Consider M2:

(7.8) M2 =

3 1 1 21 2 2 11 2 2 12 1 1 3

This tells us that there are three distinct walks of length 2 from vertex v1 to itself. Thesewalks are obvious:

(1) (v1, {v1, v2}, v2, {v1, v2}, v1)(2) (v1, {v1, v2}, v3, {v1, v3}, v1)(3) (v1, {v1, v4}, v4, {v1, v4}, v1)

We also see there is 1 path of length 2 from v1 to v2: (v1, {v1, v4}, v4, {v2, v4}, v2). We canverify each of the other numbers of paths in M2.

Definition 7.32 (Directed Adjacency Matrix). Let G = (V,E) be a directed graph andassume that V = {v1, . . . , vn}. The adjacency matrix of G is an n× n matrix M defined as:

Mij =

{1 (vi, vj) ∈ E0 else

Theorem 7.33. Let G = (V,E) be a digraph with V = {v1, . . . , vn} and let M be itsadjacency matrix. For k ≥ 0, the (i, j) entry of Mk is the number of directed walks of lengthk from vi to vj.

Exercise 93. Prove Theorem 7.33. [Hint: Use the approach in the proof of Theorem7.30.]

5. Properties of the Eigenvalues of the Adjacency Matrix

Lemma 7.34 (Rational Root Theorem). Let anxn + · · ·+ a1x+ a0 = 0 for x = p/q with

gcd(p, q) = 1 and an, . . . , a0 ∈ Z. Then p is an integer factor by a0 and q is an integer factorof an. �

Theorem 7.35. Let G = (V,E) be a graph with adjacency matrix M. Then:

(1) Every eigenvalue of M is real and(2) If λ is a rational eigenvalue of M, then it is integer.

�

Proof. The first part follows at once from the Spectral Theorem for Real Matrices.The second observation follows from the fact that the characteristic polynomial will consistof all integer coefficients because the adjacency matrix consists only of ones and zeros.

109

Consequently, any rational eigenvalue (root of the characteristic equation) x = p/q musthave q a factor of 1 (the coefficient of λn, where n is the number of vertices). Therefore anyrational eigenvalue is an integer. �

Definition 7.36 (Irreducible Matrix). A matrix M ∈ Rn×n is irreducible if for each(i, j) pair, there is some k ∈ Z with k > 0 so that Mk

ij > 0.

Lemma 7.37. If G = (V,E) is a connected graph with adjacency matrix M, then M isirreducible.

Exercise 94. Prove Lemma 7.37.

Theorem 7.38 (Perron-Frobenius Theorem). If M is an irreducible matrix, then M hasan eigenvalue λ0 with the following properties:

(1) The eigenvalue λ0 is positive and if λ is an alternative eigenvalue of M, then λ0 ≥|λ|,

(2) The matrix M has an eigenvectors v0 corresponding to λ0 with only positive entrieswhen properly scaled,

(3) The eigenvalue λ0 is a simple root of the characteristic equation for M and thereforehas a unique (up to scale) eigenvector v0.

(4) The eigenvector v0 is the only eigenvector of M that can have all positive entrieswhen properly scaled.

�

Remark 7.39. The Perron-Frobenius theorem is a classical result in Linear Algebra withseveral proofs (see [Mey01]). Also, note the quote from Meyer that starts this chapter.

Corollary 7.40. If G = (V,E) is a connected graph with adjacency matrix M, thenit has a unique largest eigenvalue which corresponds to an eigenvector that is positive whenproperly scaled.

Proof. Applying Lemma 7.37 we see that M is irreducible. Further, we know thatthere is an eigenvalue λ0 of M that is (i) greater than or equal to in absolute value all othereigenvalues of M and (ii) a simple root. From Theorem 7.35, we know that all eigenvaluesof M are real. But for (i) and (ii) to hold, no other (real) eigenvalue can have value equalto λ0 (otherwise it would not be a simple root). Thus, λ0 is the unique largest eigenvalue ofM. This completes the proof. �

6. Eigenvector Centrality

Remark 7.41. This approach to justifying eigenvector centrality comes from Leo Spizzirri[Spi11]. It is reasonably nice, and fairly rigorous. This is not meant to be anymore than ajustification. It is not a proof of correctness. Before proceeding, we recall the principal axistheorem:

Theorem 7.42 (Principal Axis Theorem). Let M ∈ Rn×n be a symmetric matrix. ThenRn has a basis consisting of the eigenvectors of M. �

Remark 7.43 (Eigenvector Centrality). We can assign to each vertex of a graph G =(V,E) a score (called its eigenvector centrality) that will determine its relative importance

110

in the graph. Here importance it measured in a self-referential way: important verticesare important precisely because they are adjacent to other important vertices. This self-referential definition can be resolved in the following way.

Let xi be the (unknown) score of vertex vi ∈ V and let xi = κ(vi) with κ being thefunction returning the score of each vertex in V . We may define xi as a pseudo-average ofthe scores of its neighbors. That is, we may write:

(7.9) xi =1

λ

∑

v∈N(vi)

κ(v)

Here λ will be chosen endogenously during computation.Recall that Mi· is the ith row of the adjacency matrix M and contains a 1 in position j

if and only if vi is adjacent to vj; that is to say vj ∈ N(vi). Thus we can rewrite Equation7.9 as:

xi =1

λ

n∑

j=1

Mijxj

This leads to n equations, one for vertex in V (or each row of M). Written as a matrixexpression we have:

(7.10) x =1

λMx =⇒ λx = Mx

Thus x is an eigenvector of M and λ is its eigenvalue.Clearly, there may be several eigenvectors and eigenvalues for M. The question is, which

eigenvalue / eigenvector pair should be chosen? The answer is to choose the eigenvector withall positive entries corresponding to the largest eigenvalue. We know such an eigenvalue /eigenvector pair exists and is unique as a result of the Perron-Frobenius Theorem and Lemma7.37.

Theorem 7.44. Let G = (V,E) be a connected graph with adjacency matrix M ∈ Rn×n.Suppose that λ0 is the largest real eigenvalue of M and has corresponding eigenvalue v0. Ifx ∈ Rn×1 is a column vector so that x · v0 6= 0, then

(7.11) limk→∞

Mkx

λk0= α0v0

Proof. Applying Theorem 7.42 we see that the eigenvectors of M must form a basis forRn. Thus, we can express:

(7.12) x = α0v0 + α1v1 + · · ·+ αn−1vn−1

Multiplying both sides by Mk yields:

(7.13) Mkx = α0Mkv0+α1M

kv1+· · ·+αn−1Mkvn−1 = α0λk0v0+α1λ

k1v1+· · ·+αn−1λknvn−1

because Mkvi = λki vi for any eigenvalue vi. Dividing by λk0 yields:

(7.14)Mkx

λk0= α0v0 + α1

λk1λk0

v1 + · · ·+ αn−1λkn−1λk0

vn−1

111

Applying the Perron-Frobenius Theorem (and Lemma 7.37) we see that λ0 is greater thanthe absolute value of any other eigenvalue and thus we have:

(7.15) limk→∞

λkiλk0

= 0

for i 6= 0. Thus:

(7.16) limk→∞

Mkx

λk0= α0v0

�

Remark 7.45. We can use Theorem 7.44 to justify our definition of eigenvector centralityas the eigenvector corresponding to the largest eigenvalue. Let x be a vector with a 1 atindex i and 0 everywhere else. This vector corresponds to beginning at vertex vi in graphG with n vertices. If M is the adjacency matrix, then Mx is the ith column of M whosejth index tells us the number of walks of length 1 leading from vertex vj to vertex vi and bysymmetry the number of walks leading from vertex vi to vertex vj. We can repeat this logicto see that Mkx gives us a vector of whose jth element is the number of walks of length kfrom vi to vj. Note for the remainder of this discussion, we will exploit the symmetry thatthe (i, j) element of Mk is both the number of walks from i to j and the number of walksfrom j to i.

From Theorem 7.44 we know that no matter which vertex we choose in creating x that:

(7.17) limk→∞

Mkx

λ0= α0v0

Reinterpreting Equation 7.17 we observe that as k →∞, Mkx will converge to some multipleof the eigenvector corresponding to the eigenvalue λ0. That is, the eigenvector correspondingto the largest eigenvalue is a multiple of the number of walks of length k leading from someinitial vertex i, since the Perron-Frobeinus eigenvector is unique (up to a scale).

Now, suppose that we were going to choose one of these (many) walks of length k atrandom, we might ask what is the probability of arriving at vertex j given that we startedat vertex i. If we let y = Mkx then the total number of walks leading from i is:

(7.18) T =n∑

j=1

yj

The probability of ending up at vertex j then is the jth element of the vector 1TMkx. Put

more intuitively, if we start at a vertex and begin wandering along the edges of the graph atrandom, the probability of ending up at vertex j after k steps is the jth element of 1

TMkx.

If we let: |x|1 be the sum of the components of a vector x, then from Equation 7.17 wemay deduce:

(7.19) limk→∞

Mkx

|Mkx|1=

v0

|v0|1Thus, eigenvector centrality tells us a kind of probability of ending up at a specific vertexj assuming we are allowed to wander along the edges of a graph (from any starting vertex)for an infinite amount of time. Thus, the more central a vertex is, the more likely we will

112

arrive at it as we move through the edges of the graph. More importantly, we see theself-referential nature of eigenvector centrality: the more likely we are to arrive at a givenvertex after walking along the edges of a graph, the more likely we are to arrive at one of itsneighbors. Thus, eigenvector centrality is a legitimate measure of vertex importance if onewishes to measure the chances of ending up at a vertex when wondering around a graph.We will discuss this type of model further when we investigate random walks on graph.

Example 7.46. Consider the graph shown in Figure 7.7. Recall from Example 7.31 this

1 2

3 4

Figure 7.7. A matrix with 4 vertices and 5 edges. Intuitively, vertices 1 and 4should have the same eigenvector centrality score as vertices 2 and 3.

graph had adjacency matrix:

M =

0 1 1 11 0 0 11 0 0 11 1 1 0

We can use a computer to determine the eigenvalues and eigenvectors of M. The eigenvaluesare:

{0,−1,

1

2+

1

2

√17,

1

2− 1

2

√17,

}

while the corresponding floating point approximations of the eigenvalues the columns of thematrix:

0.0 −1.0 1.0 1.000000001

−1.0 0.0 0.7807764064 −1.280776407

1.0 0.0 0.7807764069 −1.280776408

0.0 1.0 1.0 1.0

The largest eigenvalue is λ0 = 12

+ 12

√17 which has corresponding eigenvector:

v0 =

1.00.78077640640.7807764064

1.0

113

We can normalize this vector to be:

v0 =

0.2807764065

0.2192235937

0.2192235937

0.2807764065

Illustrating that vertices 1 and 4 have identical (larger) eigenvector centrality scores andvertices 2 and 3 have identical (smaller) eigenvector centrality scores. By way of comparison,consider the vector:

x =

1000

We consider Mkx/|Mkx|1 for various values of k:

M1x

|M1x|1=

0.0

0.3333333333

0.3333333333

0.3333333333

M10x

|M10x|1=

0.2822190823

0.2178181007

0.2178181007

0.2821447163

M20x

|M20x|1=

0.2807863651

0.2192136380

0.2192136380

0.2807863590

M40x

|M40x|1=

0.2807764069

0.2192235931

0.2192235931

0.2807764069

It’s easy to see that as k → ∞, Mkx/|Mkx|1 approaches the normalized eigenvector cen-trality scores as we expected.

7. Markov Chains and Random Walks

Remark 7.47. Markov Chains are a type of directed graph in which we assign to eachedge a probability of walking along that edge given we imagine ourselves standing in a specificvertex adjacent to the edge. Our goal is to define Markov chains, and random walks on agraph in reference to a Markov chain and show that some of the properties of graphs canbe used to derive interesting properties of Markov chains. We’ll then discuss another wayof ranking vertices; this one is used (more-or-less) by Google for ranking webpages in theirsearch.

Definition 7.48 (Markov Chain). A discrete time Markov Chain is a tupleM = (G, p)where G = (V,E) is a directed graph and the set of vertices is usually referred to as thestates, the set of edges are called the transitions and p : E → [0, 1] is a probability assignmentfunction satisfying:

(7.20)∑

v′∈No(v)

p(v, v′) = 1

114

for all v ∈ V . Here, No(v) is the neighborhood reachable by out-edge from v. If there is noedge (v, v′) ∈ E then p(v, v′) = 0.

Remark 7.49. There are continuous time Markov chains, but these are not in the scopeof these notes. When we say Markov chain, we mean discrete time Markov chain.

Example 7.50. A simple Markov chain is shown in Figure 7.8. We can think of aMarkov chain as governing the evolution of state as follows. Think of the states as citieswith airports. If there is an out-edge connecting the current city to another city, then wecan fly from our current city to this next city and we do so with some probability. Whenwe do fly (or perhaps don’t fly and remain at the current location) our state updates to thenext city. In this case, time is treated discretely.

1 21

2

1

7

6

7

1

2

Figure 7.8. A Markov chain is a directed graph to which we assign edge proba-bilities so that the sum of the probabilities of the out-edges at any vertex is always1.

A walk along the vertices of a Markov chain governed by the probability function is calleda random walk.

Definition 7.51 (Stochastic Matrix). Let M = (G, p) be a Markov chain. Then thestochastic matrix (or probability transition matrix) of M is:

(7.21) Mij = p(vi, vj)

Example 7.52. The stochastic matrix for the Markov chain in Figure 7.8 is:

M =

[12

12

17

67

]

Thus a stochastic matrix is very much like an adjacency matrix where the 0’s and 1’s indi-cating the presence or absence of an edge are replaced by the probabilities associated to theedges in the Markov chain.

Definition 7.53 (State Probability Vector). If M = (G, p) be a Markov chain with nstates (vertices) then a state probability vector is a vector x ∈ Rn×1 such that x1 +x2 + · · ·+xn = 1 and xi ≥ 0 for i = 1, . . . , n and xi represents the probability that we are in state i(at vertex i).

Remark 7.54. The next theorem can be proved in exactly the same way that Theorem7.30 is proved.

Theorem 7.55. Let M = (G, p) be a Markov chain with n states (vertices). Let x(0) ∈Rn×1 be an (initial) state probability vector. Then assuming we take a random walk of lengthk in M using initial state probability vector x(0), the final state probability vector is:

(7.22) x(k) =(MT

)kx(0)

115

�

Remark 7.56. If you prefer to remove the transpose, you can write x(0) ∈ R1×n; that is,x(0) is a row vector. Then:

(7.23) x(k) = x(0)Mk

with x(k) ∈ R1×n.

Exercise 95. Prove Theorem 7.55. [Hint: Use the same inductive argument from theproof of Theorem 7.30.]

Example 7.57. Consider the Markov chain in Figure 7.8. The state vector:

x(0) =

[10

]

states that we will start in State 1 with probability 1. From Example 7.52 we know whatM is. Then it is easy to see that:

x(1) =(MT

)kx(0) =

[12

12

]

Which is precisely the state probability vector we would expect after a random walk of length1 in M.

Definition 7.58 (Stationary Vector). Let M = (G, p) be a Markov chain. Then avector x∗ is stationary for M if

(7.24) x∗ = MTx∗

Remark 7.59. Expression 7.24 should look familiar. It says that MT has an eigenvalueof 1 and a corresponding eigenvector whose entries are all non-negative (so that the vectorcan be scaled so its components sum to 1). Furthermore, this looks very similar to theequation we used for eigenvector centrality.

Lemma 7.60. LetM = (G, p) be a Markov chain with n states and with stochastic matrixM. Then:

(7.25)∑

j

Mij = 1

for all i = 1, . . . , n.


Lemma 7.61. M = (G, p) be a Markov chain with n states and with stochastic matrixM. If G is strongly connected, then M and MT are irreducible.

Proof. If G is strongly connected, then there is a directed walk from any vertex vi toany other vertex vj in V , the vertex set of G. Consider any length k walk connecting vi tovj (such a walk exists for some k). Let ei be the vector with 1 in its ith component and 0everywhere else. Then (MT )kei is the final state probability vector associated with a walkof length k starting at vertex vi. Since there is a walk of length k from vi to vj, we knowthat the jth element of this vector must be non-zero. That is:

eTj (MT )kei > 0

116

where ej is defined just as ei is but with the 1 at the jth position. Thus, (MT )kij > 0 for

some k for every (i, j) pair and thus MT is irreducible. The fact that M is irreducible followsimmediately from the fact that (MT )k = (Mk)T . This completes the proof. �

Theorem 7.62 (Perron-Frobenius Theorem Redux). If M is an irreducible matrix, thenM has an eigenvalue λ0 with the following properties:

(1) The eigenvalue λ0 is positive and if λ is an alternative eigenvalue of M, then λ0 ≥|λ|,

(2) The matrix M has an eigenvectors v0 corresponding to λ0 with only positive entries,(3) The eigenvalue λ is a simple root of the characteristic equation for M and therefore

has a unique (up to scale) eigenvectors v0.(4) The eigenvector v0 is the only eigenvector of M that can have all positive entries

when properly scaled.(5) The following inequalities hold:

mini

∑

j

Mij ≤ λ0 ≤ maxi

∑

j

Mij

Theorem 7.63. Let M = (G, p) be a Markov chain with stochastic matrix M, MT , isirreducible then M has a unique stationary probability distribution.

Proof. From Theorem 4.47 we know that M and MT have identical eigenvalues. Bythe Perron-Frobenius theorem, M has a largest positive eigenvalue λ0 that satisfies:

mini

∑

j

Mij ≤ λ0 ≤ maxi

∑

j

Mij

By Lemma 7.60, we know that:

mini

∑

j

Mij = maxi

∑

j

Mij = 1

Therefore, by the squeezing lemma λ0 = 1. The fact that MT has exactly one strictlypositive eigenvector v0 corresponding to λ0 = 1 means that:

(7.26) MTv0 = v0

Thus v0 is the unique stationary state probability vector for M = (G, p). This completesthe proof. �

8. Page Rank

Definition 7.64 (Induced Markov Chain). Let G = (V,E) be a graph. Then the inducedMarkov chain from G is the one obtained by defining a new directed graph G′ = (V,E ′) witheach edge {v, v′} ∈ E replaced by two directional edges (v, v′) and (v′, v) in E and definingthe probability function p so that:

(7.27) p(v, v′) =1

degoutG′ v

117

1

23

4 1

23

4 1/3

1/3 1/3

1

1/2

1/21/2

1/2

Original Graph Induced Markov Chain

Figure 7.9. An induced Markov chain is constructed from a graph by replacingevery edge with a pair of directed edges (going in opposite directions) and assigninga probability equal to the out-degree of each vertex to every edge leaving that vertex.

Example 7.65. An induced Markov chain is shown in Figure 7.9. The Markov chain inthe figure has the stationary state probability vector:

x∗ =

38282818

which is the eigenvector corresponding to the eigenvalue 1 in the matrix MT . Arguing aswe did in the proof of Theorem 7.44 and Example 7.46, we could expect that for any statevector x we would have:

limk→∞

(MT

)kx = x∗

and we would be correct. When this convergence happens quickly (where we leave quicklypoorly defined) the graph is said to have a fast mixing property.

If we used the stationary probability of a vertex in the induced Markov chain as a measureof importance, then clearly vertex 1 would be most important followed by vertices 2 and 3and lastly vertex 4. We can compare this with the eigenvector centrality measure, whichassigns a rank vector of:

x+ =

0.31544880650.26959443750.26959443750.1453623195

Thus eigenvector centrality gives the same ordinal ranking as using the stationary stateprobability vector, but there are subtle differences in the values produced by these tworanking schemes. This leads us to PageRank [BP98].

Remark 7.66. Consider a collection of web pages each with links. We can construct adirected graph G with the vertex set V consisting of the we web pages and E consisting ofthe directed links among the pages. Imagine a random web surfer who will click among theseweb pages following links until a dead-end is reached (a page with no outbound links). Inthis case, the web surfer will type a new URL in (chosen from the set of web pages available)and the process will continue.

118

From this model, we can induce a Markov chain in which we define a new graph G′ withedge set E ′ so that if v ∈ V has out-degree 0, then we create an edge in E ′ to every othervertex in V and we then define:

(7.28) p(v, v′) =1

degoutG′ v

exactly as before. In the absence of any further insight, the PageRank algorithm simplyassigns to each web page a score equal to the stationary probability of its state in theinduced Markov chain. For the remainder of this remark, let M be the stochastic matrix ofthe induced Markov chain.

In general, however, PageRank assumes that surfers will get bored after some numberof clicks (or new URL’s) and will stop (and move to a new page) with some probabilityd ∈ [0, 1] called the damping factor. This factor is usually estimated. Assuming there are nweb pages, let r ∈ Rn×1 be the PageRank score for each page. Taking boredom into accountleads to a new expression for rank (similar to Equation 7.9 for Eigenvector centrality):

(7.29) ri =1− dn

+ d

(n∑

j=1

Mjirj

)for i = 1, . . . , n

Here the d term acts like a damping factor on walks through the Markov chain. In essence,it stalls people as they walk, making it less likely a searcher will keep walking forever. Theoriginal System of Equations 7.29 can be written in matrix form as:

(7.30) r =

(1− dn

)1 + dMT r

where 1 is a n× 1 vector consisting of all 1’s. It is easy to see that when d = 1 r is preciselythe stationary state probability vector for the induced Markov chain. When d 6= 1, r isusually computed iteratively by starting with an initial value of r0i = 1/n for all i = 1, . . . , nand computing:

r(k) =

(1− dn

)1 + dMT r(k−1)

The reason is that for large n, the analytic solution:

(7.31) r =(In − dMT

)−1(

1− dn

)1

is not computationally tractable2.

Example 7.67. Consider the induced Markov chain in Figure 7.9 and suppose we wishto compute PageRank on these vertices with d = 0.85 (which is a common assumption). We

2Note,(In − dMT

)−1computes a matrix inverse, which we reviewed briefly in Chapter 4. We should

note that for stochastic matrices, this inverse is guaranteed to exist. For those interested, please consult andof [Dat95, Lan87, Mey01].

119

might begin with:

r(0) =

14141414

We would then compute:

r(1) =

(1− dn

)1 + dMT r(0) =

0.462499999999999967

0.214583333333333320

0.214583333333333320

0.108333333333333337

We would repeat this again to obtain:

r(2) =

(1− dn

)1 + dMT r(1) =

0.311979166666666641

0.259739583333333302

0.259739583333333302

0.168541666666666673

This would continue until the difference between in the values of r(k) and r(k−1) was small.The final solution would be close to the exact solution:

r∗ =

0.366735867135100591

0.245927818588310476

0.245927818588310393

0.141408495688278513

Note this is (again) very close to the stationary probabilities and the eigenvector centralitieswe observed earlier. This vector is normalized so that all the entries sum to 1.

Exercise 97. Consider the Markov chain shown below:

1

23

4 1/3

1/3 1/3

1/2

1/21/2

1/21/2

1/2

Suppose this is the induced Markov chain from 4 web pages. Compute the page-rank ofthese web pages using d = 0.85.

Exercise 98. Find an expression for r(2) in terms of r(0). Explain how the dampingfactor occurs and how it decreases the chance of taking long walks through the inducedMarkov chain. Can you generalize your expression for r(2) to an expression for r(k) in termsof r(0)?

120

9. The Graph Laplacian

Remark 7.68. In this last section, we return to simple graphs and discuss the GraphLaplacian matrix, which can be used to partition the vertices of a graph in a sensible way.

Definition 7.69 (Degree Matrix). LetG = (V,E) be a simple graph with V = {v1, . . . , vn}.The degree matrix is the diagonal matrix D with the degree of each vertex in the diagonal.That is Dii = deg(vi) and Dij = 0 if i 6= j.

Example 7.70. Consider the graph in Figure 7.10. It has degree matrix:

1

2

3

4

6

5

Figure 7.10. A set of triangle graphs.

D =

2 0 0 0 0 00 2 0 0 0 00 0 2 0 0 00 0 0 2 0 00 0 0 0 2 00 0 0 0 0 2

,

because each of its vertices has degree 2.

Definition 7.71 (Graph Laplacian). Let G = (V,E) be a simple graph with V ={v1, . . . , vn}, adjacency matrix M and degree matrix D. The Graph Laplacian Matrix is thematrix L = D−M.

Example 7.72. The graph shown in Figure 7.10 has adjacency matrix:

M =

0 1 1 0 0 01 0 1 0 0 01 1 0 0 0 00 0 0 0 1 10 0 0 1 0 10 0 0 1 1 0

Therefore, it has Laplacian:

L =

2 −1 −1 0 0 0−1 2 −1 0 0 0−1 −1 2 0 0 00 0 0 2 −1 −10 0 0 −1 2 −10 0 0 −1 −1 2

121

Remark 7.73. Notice the row-sum of each row in the Laplacian matrix is zero. TheLaplacian matrix is also symmetric. This is not an accident; it will always be the case.

Proposition 7.74. Let G be a graph with Laplacian matrix L, then L is symmetric.

Proof. Let D and M be the (diagonal) degree matrix and the adjacency matrix respec-tively. Both D and M are symmetric. Therefore L = D−M is symmetric, since

LT = (D−M)T = DT −MT = D−M = L.�

Lemma 7.75. The row-sum of the adjacency matrix of a simple graph is the degree of thecorresponding vertex.


Corollary 7.76. The row-sum for each row of the Laplacian matrix of a simple graphis zero. �

Theorem 7.77. If L ∈ Rn×n, then 1 = 〈1, 1, . . . , 1〉 ∈ Rn is an eigenvector of L witheigenvalue 0.

Proof. Let:

(7.32) L =

d11 −a12 −a13 · · · −a1n−a21 d22 −a23 · · · −a2n

......

.... . .

...−an1 −an2 −an3 · · · dnn

Let v = L1. Then from Equation 2.2, we know that:

(7.33) vi = Li· · 1 =[di1 −ai2 −ai3 · · · −ain

]

111...1

= di1 − ai2 − ai3 − · · · − ain = 0

Thus vi = 0 for i = 1, . . . , n and v = 0. Thus:

L1 = 0 = 0 · 1Thus 1 is an eigenvector with eigenvalue 0. This completes the proof. �

Remark 7.78. It is worth noting that 0 can be an eigenvalue, but the zero vector 0cannot be an eigenvector.

Definition 7.79 (Subgraph). Let G = (V,E) be a graph. If H = (V ′, E ′) and V ′ ⊆ Vand E ′ ⊆ E, then H is a subgraph of G.

Remark 7.80. We know from the Principal Axis Theorem (Theorem 5.55) that L musthave n linearly independent (and orthogonal) eigenvectors that form a basis for Rn, since itsa real symmetric matrix. We’ll use that fact shortly.

Example 7.81. Consider the graph shown in Figure 7.10. One of the two triangles is a(proper) subgraph of this graph. The graph is a subgraph of itself (an improper) subgraph.

122

Definition 7.82 (Component). Suppose G = (V,E) is not a connected graph. If H isa subgraph of G and H is connected and for any vertex v not in H, there is no path from vto any vertex in H, then H is a component of G.

Example 7.83. The graph in Figure 7.10 has two components. Each triangle is a com-ponent. For example, let H be the left triangle. For any vertex v from the right triangle,there is not path from v to any vertex in the left triangle. Therefore, H is a component.

Theorem 7.84. Let G = (V,E) be a graph with V = {v1, . . . , vn} and with Laplacian L.Then the (algebraic) multiplicity of the eigenvalue 0 is equal to the number of components ofG.

Proof. Assume G has more than 1 component; order the components H1, . . . , Hk andsuppose that each component has ni vertices. Then n1 +n2 + · · ·+nk = n. Each componenthas its own Laplacian matrix Li for i = 1, . . . , k and the Laplacian matrix of G is the blockmatrix:

L =

L1 0 · · · 00 L2 · · · 0...

.... . .

...0 0 · · · Lk

The fact that 1i (a vector of 1′ with dimension appropriate to Li) is an eigenvector for Li witheigenvalue 0 implies that: vi = 〈0, · · · ,1i,0, · · · ,0〉 is an eigenvector for L with eigenvalue0. Thus, L has eigenvalue 0 with at least multiplicity k.

Now suppose v is an eigenvector with eigenvalue 0. Then:

Lv = 0

That is, v ∈ Ker(fL), that is v is in the kernel of the linear transform fL(x) = Lx. We haveso far proved:

dim (Ker(fL)) ≥ k

since each eigenvector vi is linearly independent of any other eigenvector vj for i 6= j. Thus,the basis of Ker(fL) contains at least k vectors. On the other hand, it is clear by constructionthat the rank of the Laplacian matrix Li is exactly ni − 1. The structure of L ensures thatthe rank of L is:

n1 − 1 + n2 − 1 + · · ·+ nk − 1 = n− k

But we know from the rank-nullity theorem that:

rank(L) = dim (Im(fL)) = n− k

and:

n = dim (Im(fL)) + dim (Ker(fL)) = n− k + y

and y ≥ k. But it follows that y must be exactly k. Therefore, the multiplicity of theeigenvalue 0 is precisely the number of components. �

123

Remark 7.85. We state the following fact without proof. Its proof can be found in[GR01] (Lemma 13.1.1). It is a consequence of the fact that the Laplacian matrix is positivesemi-definite, meaning that for any v ∈ Rn, the (scalar) quantity:

vTLv ≥ 0

Lemma 7.86. Let G be a graph with Laplacian matrix L. The eigenvalues of L are allnon-negative. �

Definition 7.87 (Fiedler Value/Vector). Let G with n vertices be a graph with Lapla-cian L and eigenvalues {λn, . . . , λ1} ordered from largest to smallest (i.e., so that λn ≥λn−1 ≥ · · · ≥ λ1). The second smallest eigenvalue λ2 is called the Fiedler value and itscorresponding eigenvector is called the Fiedler vector.

Proposition 7.88. Let G be a graph with Laplacian matrix L. The Fiedler value λ2 > 0if and only if G is connected.

Proof. If G is connected, it has 1 component and therefore the multiplicity of the 0eigenvalue is 1. By Lemma 7.86, λ2 > 0. On the other hand, suppose that λ2 > 0, thennecessarily λ1 = 0 and has multiplicity 1. �

Remark 7.89. We state a remarkable fact about the Fiedler vector, whose proof can befound in [Fie73].

Theorem 7.90. Let G = (V,E) be a graph with V = {v1, . . . , vn} and with Laplacianmatrix L. If v is the eigenvector corresponding to the Fiedler value λ2 then the set of vertices:

V (v, c) = {vi ∈ V : vi ≥ c}and the edges between these vertices form a connected sub-graph. �

Remark 7.91. In particular, this means that if c = 0, then the vertices whose indicescorrespond to the positive entries in v allow for a natural bipartition of the vertices of G.This bipartition is called a spectral cluster and it is useful in many areas of modern life. Inparticular, it can be useful for finding groupings of individuals in social networks.

Example 7.92. Consider the social network shown in Figure 7.11. If we compute the

Bob

Alice

Cheryl

DavidEdward

Finn

Figure 7.11. A simple social network.

124

Fiedler value for this graph we see it is λ2 = 3−√

5 > 0, since the graph is connected. Thecorresponding Fiedler vector is:

v =

{1

2

(−1−

√5),1

2

(−1−

√5),1

2

(√5− 3

), 1,

1

2

(1 +√

5), 1

}≈

{−1.61803,−1.61803,−0.381966, 1., 1.61803, 1.}Thus, setting c = 0 and assuming the vertices are in alphabetical order, a natural partitionof this social network is:

V1 = {Alice,Bob,Cheryl}V2 = {David,Edward,Finn}

That is, we have grouped the vertices together with negative entries in the Fiedler vector andgrouped the vertices together with positive entried in the Fiedler vector. This is illustratedin Figure 7.12. It is worth noting that if an entry is 0 (i.e., on the border) that vertex can be

Bob

Alice

Cheryl

DavidEdward

Finn

Figure 7.12. A graph partition using positive and negative entries of the Fiedler vector.

placed in either partition or placed in a partition of its own. It usually bridges two distinctvertex groups together within the graph structure.

125

CHAPTER 8

Linear Algebra and Systems of Differential Equations


(1) Introduce differential equations and systems of differential equations.(2) Introduce Linear Homogeneous Systems of Equations(3) Show how Taylor series can be used to compute a matrix exponent.(4) Show how the Jordan Decomposition can be used to solve a system of linear homo-

geneous equations.(5) Explain the origin of multiples of t in systems with repeated eigenvalues.

2. Systems of Differential Equations

Definition 8.1 (System of Ordinary Differential Equations). A system of ordinary dif-ferential equations is an equation system involving (unknown) functions of a single common(independent) variable and their derivatives.

Remark 8.2. The notion of order for a system of ordinary differential equations is simplythe order of the highest derivative (i.e., second derivative, third derivative etc.)

Remark 8.3. It is worth noting that any order n system of differential equations canbe transformed to an equivalent order n − 1 system of differential equations. We illustrateusing Equation

(8.1) y − αy = Ay

Define v = y. Then we may rewrite Equation 8.1 as the system of differential equations:

(8.2)

{v = Ay + av

y = v

Thus any order n system of differential equations can be reduced to a first order system ofdifferential equations.

Definition 8.4 (Initial Value Problem). Consider a system of first order differentialequations with unknown functions y1, . . . , yn. If we are provided with information of theform: y1(a) = r1,. . . , yn(a) = rn, for some a ∈ R and constants r1, . . . , rn, then the problemis called an initial value problem.

Definition 8.5 (Linear Differential Equation). A system of differential equations is lin-ear if unknown functions and their derivatives appear only as monomials, possibly multipliedby known functions of the independent variable.

Example 8.6. While looking awfully non-linear, the following is a linear differentialequation for y(x):

(8.3) y′ + sin(x)y = cos(x)

127

while the following system of differential equations for u(t) and v(t) is nonlinear :

(8.4)u = αu− βuvv = γuv − δv

for α, β, γ, δ ∈ R.

In general and in keeping with Remark 8.3, (and following Strogatz [Str01]), we willfocus on differential equation systems with a special form. Let x1(t), . . . , xn(t) be n unknownfunctions with independent time variable t. We focus on the system:

(8.5)

x1 = f1(x1, . . . , xn)

...

xn = fn(x1, . . . , xn)

Here we assume that for i = 1, . . . , n, fi(x1, . . . , xn) has derivatives of all orders. In this casewe say that fi(x1, . . . , xn) is smooth. Let:

(8.6) x =

x1...xn

If we let F : Rn → Rn be a (smooth) vector valued function given by

(8.7) F(x) =

f1(x1, . . . , xn)

...fn(x1, . . . , xn)

Then, we can write System 8.5 as:

(8.8) x = F(x)

Notice that t does not appear explicitly in these equations. Thus, Equation 8.3 cannot bedescribed in this way, but the nonlinear equations given in System 8.4 most certainly fit thispattern.

Definition 8.7 (System of Autonomous Differential Equations). The system of differ-ential equations defined by Equation 8.5 (or Equation 8.8) is called a system of autonomousdifferential equations. Notice that time (t) does not appear explicitly anywhere.

Definition 8.8 (Orbit). Consider the initial value problem

x = F(x)

x(0) = x0

Any solution x(t; x0) to this problem is called an orbit1.

1Some authors make a distinction between the parametric curve x(t;x0) and the set of points in Rn thatare defined by this curve. The latter is called an orbit while the former is called a trajectory. We will not bethat precise in these notes.

128

Example 8.9. Linear systems with constant coefficients fit the pattern of System 8.5.Consider the following differential equation system:

(8.9)x = αx− βyy = γy − δx

In this case, we can rewrite F(x) using matrix notation. System 8.9 can be expressed as:

(8.10)

[xy

]=

[α −β−δ γ

] [xy

]

As we will see, many of the properties of the solutions of this system depend on the propertiesof the coefficient matrix.

Definition 8.10 (Linear System with Constant Coefficients). Let A be an n×n matrixand x be a vector of n unknown functions of a dependent variable (in our case t). Then:

(8.11) x = Ax

is a linear system of differential equations with constant coefficients.

Remark 8.11. For our purposes, we will always assume entries in A are drawn from R.

Exercise 100. Consider the differential equation:

(8.12) y(x) + y(x) + cos(t)y = sin(t)

Use the technique of converting a second order differential equation into a first order dif-ferential equation to write this as a system of first order equations. Then show that theresulting system can be written in the form:

(8.13) y = A(t)y + b(t)

where y is a vector of unknown functions, A(t) is a matrix with non-constant entries (func-tions of t) and b(t) is a vector with non-constant entries (functions of t).

Remark 8.12 (Linear Homogeneous System). Let A(t) be an n × n matrix of timevarying functions (that do not depend on x1, . . . , xn). The system, then:

(8.14) x = A(t)x

is a linear homogeneous system of differential equations. If b(t) is a vector of time varyingfunctions (that do not depend on x1, . . . , xn), then:

(8.15) x = A(t)x + b(t)

is just a (first order) linear system of differential equations. It turns out, that there is aknown form for the solution of equations of this type and they are extremely useful in thefield of control theory [Son98]. The interested reader can consult [Arn06].

129

3. A Solution to the Linear Homogenous Constant Coefficient DifferentialEquation

Consider the differential equation:

(8.16) x = ax

Equation 8.16 can be easily solved:

x = kx =⇒ dx

dt= ax

=⇒ dx

x= adt

=⇒∫

1

xdx =

∫adt

=⇒ log(x) = at+ C

=⇒ x(t) = exp(at+ c) = A exp(at)

where A = exp(C). If we are given the initial value x(0) = x0, then the exact solution is:

(8.17) x(t) = x0 exp(at)

It is easy to see that when a > 0, the solution explodes as t→∞ and if a < 0, the solutioncollapses to 0 as t→∞.

Consider now the natural generalization of Equation 8.16, given by the linear homogenousdifferential equation in Expression 8.11.

x = Ax

and the associated initial value problem x(0) = x0. In this section, we are interested ina general solution to this problem, and what this solution can tell us about non-lineardifferential equations. Given what we know about exponential growth already, we intuitivelywish to write the solution:

(8.18) x(t) = eAt · x0 = exp (At) · x0,

but it’s not entirely clear what such an expression means. Certainly, we could use the Taylorseries expansion for the exponential function and argue that:

(8.19) exp (At) = (At)0 + At+(At)2

2!+

(At)3

3!+ · · · = In + At+

A2t2

2!+

A3t3

3!+ · · ·

Given this assumption, exp (At) · x0 is a matrix / vector product. The following theoremcan be proved using formal differentiation on the Taylor series expansion:

Theorem 8.13. The function:

(8.20) x(t) = eAt · x0 = exp (At) · x0 =

(In + At+

A2t2

2!+

A3t3

3!+ · · ·

)· x0

is a solution to the initial value problem:{

x = Ax

x(0) = x0

130

Exercise 101. Use formal differentiation on the power series to prove the previoustheorem.

Remark 8.14. Theorem 8.13 does not tell us when this solution exists (i.e., when theseries converges) or in fact how to compute exp (At). Nor does it tell us what these solutions“look” like in so far as their long-term behavior. For example, in the case of Equation 8.16,we know that if a > 0 the solution tends to ∞ as t→∞.

However, it turns out that for many matrices, exp (At) can be computed rather easilyusing diagonalization. More importantly, the properties we’ll see for diagonalizable matricescarry over for any matrix A and, in fact, can be used to some extent in studying non-linearsystems through linearization, as we’ll discuss in the next sections.

Remark 8.15. The following theorem helps explain why we’re expended so much en-ergy discussing matrix diagonalization. It is, essentially, the key to understanding linearhomogeneous systems.

Theorem 8.16. Suppose A is diagonalizable and

A = PDP−1

Then

(8.21) exp(A) = P exp(D)P−1

and if:

(8.22) D =

λ1 0 0 · · · 00 λ2 0 · · · 0...

.... . . · · · ...

0 0 0 · · · λn

then:

(8.23) exp(D) =

eλ1 0 0 · · · 00 eλ2 0 · · · 0...

.... . . · · · ...

0 0 0 · · · eλn

This gives us a very nice meaning to the seemingly impossible idea of raising a numberto the power of a matrix. However, it is only for a diagonal matrix, such as D, that exp(D)can be assigned this meaning.

Sketch of Proof. Note that:

(8.24) An =(PDP−1

)n= (PDP−1)(PDP−1) · · · (PDP−1) = P(D)nP−1

Applying this fact and Equation 8.20 (when t = 1) allows us to deduce the theorem. �

Exercise 102. Finish the details of the proof of Theorem 8.16.

131

4. Three Examples

Example 8.17. We can now use the results of Theorem 8.16 to find a solution to:

(8.25)

{x = −yy = x

with initial value x(0) = x0 and y(0) = y0. Note that the system of differential equationscan be written in matrix form as:

(8.26)

[xy

]=

[0 −11 0

] [xy

]

Thus we know:

(8.27) A =

[0 −11 0

]

From Example 4.64 and Theorem 8.16 we know the solution is:

(8.28)

[x(t)y(t)

]=

[−i i1 1

] [e−it 00 eit

] [i2

12−i

212

]·[x0y0

]

Unfortunately, this is not such a convenient expression. To simplify it, we first expand thematrix product to obtain:

(8.29)

[x(t)y(t)

]=

[1/2 e−it + 1/2 eit −1/2 ie−it + 1/2 ieit

1/2 ie−it − 1/2 ieit 1/2 e−it + 1/2 eit

]·[x0y0

]

This simplifies to:

(8.30)

[x(t)y(t)

]=

1

2

[e−it + eit i (eit − e−it)

i (e−it − eit) e−it + eit

]·[x0y0

]

Next, we remember Euler’s Formula:

(8.31) eit = cos(t) + i sin(t)

and note that:

(8.32) e−it + eit = (cos(t)− i sin(t)) + (cos(t) + i sin(t)) = 2 cos(t)

while

(8.33) eit − e−it = (cos(t) + i sin(t))− (cos(t)− i sin(t)) = 2i sin(t)

Thus:

(8.34) i(eit − e−it

)= 2i2 sin(t) = −2 sin(t)

Thus we conclude that:

(8.35)

[x(t)y(t)

]=

[cos(t) − sin(t)sin(t) cos(t)

]·[x0y0

]

But this matrix is none other than a (counter-clockwise) rotation matrix that when multipliedby the vector [x0, y0]

T will rotate it by an angle t - so any specific solution to the initial valueproblem is a vector of constant length rotating around the origin. The initial vector isthe vector of initial conditions. This is illustrated in Figure 8.1. It is worth noting for anarbitrary x0 and y0, any parametric solution curve is a orbit. One can expand the expression

132

-2 -1 1 2

-2

-1

1

2

Solution Curve

Initial Vector

x0

y0

�

Rotation Direction

Figure 8.1. The solution to the differential equation can be thought of as a vectorof fixed unit rotation about the origin.

for the orbits of this problem to obtain exact functional representations for x(t) and y(t) asneeded:

x(t) = x0 cos(t)− y0 sin(t)(8.36)

y(t) = x0 sin(t) + y0 cos(t)(8.37)

For representative initial values x0 = 1 and y0 = 1, we plot the resulting functions in Figure8.2.

1 2 3 4 5 6

-1.0

-0.5

0.5

1.0

Figure 8.2. A plot of representative solutions for x(t) and y(t) for the simplehomogeneous linear system in Expression 8.25.

Exercise 103. Show that the second order differential equation x + x = 0 yields thesystem of differential equations just analyzed. This second order ODE is called a harmonicoscillator.

Exercise 104. Show that the total square velocity (x′)2 + (y′)2 is constant for thedifferential system just analyzed.

Example 8.18. Consider the second order differential equation:

(8.38) x− x+ x = 0

133

Letting y = x, we can rewrite the previous differential equation (Equation 8.38) as: y = y+xand obtain the first order system:

(8.39)

{x = y

y = −x+ y

Using the same methods as in Example 8.17, one can show that the eigenvalues of the matrixcorresponding to System 8.39:[

0 1−1 1

]

are:

1

2± i√

3

2Notice these eigenvalues are the roots of the characteristic equation, thus explaining theimportance of the characteristic equation. This is true in general. Now using the sameapproach as the one in Example 8.17 one can show that for initial value x(0) = x0 andy(0) = y0 the solution to this differential equation is:

x(t) =1

3et/2

(3x0 cos

(√3t

2

)−√

3(x0 − 2y0) sin

(√3t

2

))(8.40)

y(t) =1

3et/2

(√

3(y0 − 2x0) sin

(√3t

2

)+ 3y0 cos

(√3t

2

))(8.41)

Figure 8.3 illustrates the solution curves to this problem when x0 = y0 = 1.

Out[38]=

-2 2 4 6 8 10t

-40

-20

20

40

60

x,y

xHtLyHtL

Figure 8.3. Representative solution curves for Expression 8.39 showing sinusoidalexponential growth of the system.

Example 8.19. Consider the following linear system of differential equations:

(8.42)

{x = −x− yy = x− y

Using the same methods as in Example 8.17, one can show that for initial value x(0) = x0and y(0) = y0 the solution to this differential equation is:

x(t) = e−t cos(t)x0 − e−t sin(t)y0(8.43)

y(t) = e−t sin(t)x0 + e−t cos(t)y0(8.44)

134

We notice in this case that the eigenvalues of the matrix for Expression 8.42 are:

λ1 = −1 + i λ2 = −1− iIn identifying the solution, it is important to remember that:

(8.45) e(−1+i)t = e−teit = e−t (cos(t) + i sin(t))

explaining the origin of the e−t factors. Figure 8.4 illustrates the solution curves to thisproblem when x0 = y0 = 1. Notice the factor e−t causes the solution curves to decay to 0exponentially.

1 2 3 4 5 6

-0.2

-0.1

0.1

0.2

0.3

0.4

0.5

Figure 8.4. Representative solution curves for Expression 8.42 showing exponentialdecay of the system.

Exercise 105. Use matrix diagonalization to show show the solution given to System8.42 is correct.

5. Non-Diagonalizable Matrices

Remark 8.20. In this final section, we discuss an odd result in ordinary differentialequations. When the matrix of a linear system with constant coefficients has a repeateeigenvalue, we are often taught to imagine an extra solution by multiplying t by eλt toobtain some linear combination of eλt and teλt. In this section, we discover from where thatt comes and dispel the nonsense idea that this is some kind of lucky guess.

Remark 8.21 (Non-Diagonalizable A). When A ∈ Rn×n is not diagonalizable, we fallback to Theorem 4.68. From this we deduce:

(1) exp(At) = P exp (Λt) exp(Nt)P−1. Here N is a nilpotent matrix.(2) The expression exp(Nt) is a polynomial in t. (This may help explain some results

you’ve seen in a class on Differential Equations.)

In particular, suppose that Nk = 0. Then:

exp(Nt) =k∑

j=0

Ajtj

j!

where, A0 = In. Thus, we conclude that:

x(t) = P exp (Λt)

(k∑

j=0

Ajtj

j!

)P−1 · x0

135

Thus we see that the diagonalization of the matrix A is still very important. Furthermore,the eigenvalues of A are still very important and, as we’ll see in the next chapter, these drivethe long-term behavior of the solutions of the differential equation.

Example 8.22. Consider the following linear system of differential equations2:

x = 7x+ y

y = −4x+ 3y

We can see that this has form x = Ax when:

A =

[7 1−4 3

]

The Jordan Decomposition for this matrix yields:

P =

[−1 −1

22 0

]Λ =

[5 00 5

]N =

[0 10 0

]

It is easy to verify that A has two identical eigenvalues, both equal to 5 and that:[0 10 0

]2=

[0 00 0

]0

We can compute:

P−1 =

[0 1

2−2 −1

]

Thus, we can write as a solution for the differential equation:[xy

]=

[−1 −1

22 0

]

P

·[e5t 00 e5t

]

exp(λt)

·([

1 00 1

]+

[0 t0 0

])

exp(Nt)

·[

0 12

−2 −1

]

P−1

·[x0y0

]

Expanding and simplifying yields:[xy

]=

[e5t(2t+ 1) e5tt−4e5tt e5t(1− 2t)

]·[x0y0

]

Suppose x0 = 2 and y0 = −5, then expanding we obtain:[xy

]=

[2e5t(2t+ 1)− 5e5tt−5e5t(1− 2t)− 8e5tt

]= e5t

[−t+ 22t− 5

]

Exercise 106. Verify the Jordan Decomposition given in the previous example andconfirm that the matrix A does have a repeated eigenvalue equal to 5. Confirm (by differ-entiation) that the proposed solution does satisfy the differential equation.

2This example is taken from http://tutorial.math.lamar.edu/Classes/DE/RepeatedEigenvalues.

aspx, where it is presented differently.

136

http://tutorial.math.lamar.edu/Classes/DE/RepeatedEigenvalues.aspx

http://tutorial.math.lamar.edu/Classes/DE/RepeatedEigenvalues.aspx

Bibliography

[Arn06] V. I. Arnold, Ordinary differential equations, Universitext, Springer, 2006.[BP98] S. Brin and L. Page, The anatomy of a large-scale hypertextual web search engine, Seventh Inter-

national World-Wide Web Conference (WWW 1998), 1998.[Dat95] B. N. Datta, Numerical linear algebra, Brooks/Cole, 1995.[Fie73] M. Fiedler, Algebraic connectivity of graphs, zechoslovak Math. J. 23 (1973), no. 98, 298–305.[Fra99] J. B. Fraleigh, A First Course in Abstract Algebra, 6 ed., Addison-Wesley, 1999.[GR01] C. Godsil and G. Royle, Algebraic graph theory, Springer, 2001.[Lan87] S. Lang, Linear Algebra, Springer-Verlag, 1987.[Mey01] C. D. Meyer, Matrix analysis and applied linear algebra, SIAM Publishing, 2001.[Son98] E. D. Sontag, Mathematical Control Theory: Deterministic Finite Dimensional Systems, Springer-

Verlag, 1998.[Spi11] L. Spizzirri, Justication and application of eigenvector centrality, http://www.math.washington.

edu/~morrow/336_11/papers/leo.pdf, March 6 2011 (Last Checked: July 20, 2011).[Str01] S. Strogatz, Nonlinear dynamics and chaos, Westview Press, 2001.[Wen06] Daniel Wengerho, Using the singular value decomposition for image steganography, Master’s thesis,

Iowa State University, 2006.

137

http://www.math.washington.edu/~morrow/336_11/papers/leo.pdf

http://www.math.washington.edu/~morrow/336_11/papers/leo.pdf

Documents

Intermediate Linear Algebra · If you don’t like Lang’s book, I also like Gilbert Strang’s Linear Algebra and its Applications. To be fair, I’ve only used the third edition