4 lectures, Michaelmas Term 2010 Prof. Steve Roberts sjrob ...4 The function f(x) = ax satisfies both requirements, whereas functions like x , log( x) and x + 2 clearly do not: they

1

University of Oxford

Department of Engineering Science

A1 Mathematics

Linear Algebra

4 lectures, Michaelmas Term 2010

Prof. Steve Roberts

[email protected]

2

Recommended Texts

These are some books that I found useful while preparing these notes.

• Allenby, R.B.J.T. (1995). Linear algebra. Elsevier.

• Anton, H. (1984). Elementary linear algebra (4th edn). Wiley.

• Anton, H. & Rorres, C. (2000). Elementary linear algebra – applications version (8th edn). Wiley.

• Kreyszig, E. (2006). Advanced engineering mathematics (9th edn). Wiley.

• Larson, R. & Falvo, D.C. (2010). Elementary linear algebra (6th edn). Brooks/Cole.

• Strang, G. (2006). Linear algebra and its applications (4th edn). Thomson Brooks/Cole

3

Topic 1

Systems of linear equations

1.1 Introduction

Let us begin by discussing the simple equation

ax b= (1.1)

This is a linear equation in the unknown x. It is linear because x appears on the left-hand side in a very simple

form: raised to the power one, and multiplied by a constant. There are no terms such as x , log(x) or x + 2 on

the left-hand side. The only other term in the equation – the b on the right-hand side – is a constant.

We can think of the left-hand side of eqn (1.1) as the result of a function that maps x to ax:

( )f x ax=

The adjective ‘linear’ arises from the fact that the graph of this function is a straight line (and in particular, one

that goes through the origin). More formally, we say that f is a linear function if it has the properties

(i) f(x + y) = f(x) + f(y)

(ii) f(αx) = αf(x) [where α = an arbitrary scalar]

4

The function f(x) = ax satisfies both requirements, whereas functions like x , log(x) and x + 2 clearly do not:

they are non-linear.

More generally, a linear equation in n unknowns x1, x2, …, xn is defined as one that can be expressed in the form

1 1 2 2 n na x a x a x b+ + + =K

where the coefficients a1, a2, … an and the right-hand side b are all constants. We will assume that the constants

and the unknowns are numbers in the real domain RRRR, but it should be pointed out that some applications (e.g.

quantum mechanics) give rise to linear equations involving numbers in the complex domain CCCC.

For much of this course, we will be studying sets of simultaneous linear equations that we aim to solve for the

unknowns. We will describe a set of m linear equations in n unknowns as an m × n linear system. As you know

from P1, such a system can be expressed in the compact form

=Ax b (1.2)

where A is an m × n matrix, and b and x are column vectors with m and n components respectively. Note the

similarity with eqn (1.1).

We can think of the left-hand side of eqn (1.2) as the result of a function that maps x (a vector in RRRRn) to Ax (a

5

vector in RRRRm):

( )f =x Ax

This function is a linear transformation from RRRRn to RRRRm because

(i) f(x + y) = A(x + y) = Ax + Ay = f(x) + f(y)

(ii) f(αx) = A(αx) = α(Ax) = αf(x)

Essentially, the origin is mapped to the origin (i.e. there is no translation), and straight lines in RRRRn are mapped to

straight lines in RRRRm.

RRRRn

RRRRm

O

R

P

Q

O

R′ P′ Q′

6

1.2 Two ways of thinking about Ax = b

Consider the 2 × 2 system

1

2

1 1 3

4 1 2

x

x

= −

We need to find values for x1 and x2 such that the matrix-vector product on the left-hand side is equal to the

vector on the right-hand side.

The familiar ‘row-oriented’ view interprets the system as

1 2

1 2

3

4 2

x x

x x

+ =

− =

Our task is to find a point in RRRR2 with coordinates (x1, x2) such that both equations are satisfied simultaneously.

Geometrically, we can visualize plotting two lines:

7

The lines intersect at the point x1 = 1, x2 = 2. This is the (unique) solution of the original system.

The alternative ‘column-oriented’ view interprets the system as

1 2

1 1 3

4 1 2x x

+ = −

This time we have two vectors in RRRR2, namely 4+i j and −i j , and our task is to combine them (by choosing

4x1 – x2 = 2

x1 + x2 = 3

x1

x2

1 2 3 4

-1

-2

-3

-1 -2 -3

-4

-4

1

2

3

4

(1, 2)

8

suitable scalar multipliers x1 and x2) in such a way that the resultant vector is 3i + 2j. Geometrically, we can

visualize constructing a parallelogram:

The ‘right’ linear combination of the column vectors is 1(i + 4j) + 2(i - j). Hence x1 = 1, x2 = 2 is the (unique)

solution of the original system.

The row- and column-oriented pictures look very different, but they lead, of course, to the same solution. They

1 2 3 4

-1

-2

-3

-1 -2 -3

-4

-4

1

2

3

4

3i + 2j

i + 4j

2(i – j)

i – j

9

are two sides of the same coin. The column-oriented view may seem less natural, but we will see that it can be

very helpful in understanding and solving general m × n linear systems, whether homogeneous (Ax = 0) or non-

homogeneous (Ax = b). In particular, the column-oriented view provides us with an alternative way of posing

(and then answering!) the two fundamental questions that arise in relation to any linear system:

• Is there any solution at all?

• If there is a solution, is it unique, or are there infinitely many?

Row-oriented view Column-oriented view

Solution

existence

Do the m hyperplanes in

n-dimensional space have

any point in common?

Can b be expressed as a

linear combination of the

columns of A?

Solution

uniqueness

Do the hyperplanes

intersect at just one point,

or infinitely many points?

Can the linear combination

be formed in just one way,

or infinitely many ways?

10

1.3 Echelon form

Regardless of which geometric interpretation we have in mind, we need a reliable way of answering the basic

questions of solution existence and solution uniqueness for a general linear system Ax = b. In P1 you learnt how

to apply Gaussian elimination to the augmented coefficient matrix

[ ]A b

Having reduced this matrix (and thus the system of equations) to a simpler form, you were generally able to

solve for the unknowns by a systematic process of back-substitution, leading to a unique solution x. Sometimes,

however, elimination revealed that there was no solution, implying that the original equations were inconsistent.

In other cases, elimination revealed that there were infinitely many solutions – one or more of the unknowns

could be assigned an arbitrary value.

All of the systems you studied in P1 involved n equations in n unknowns. We now want to relax that restriction,

and see what can happen when the coefficient matrix A is rectangular (m × n). There are three scenarios:

11

m = n m < n m > n

Square Underdetermined Overdetermined

Same number of

equations as

unknowns

Fewer equations than

unknowns

More equations than

unknowns

Example: 3 ×3 Example: 2 ×3 Example: 4 ×3

1

2

3

x

x

x

∗ ∗ ∗ ∗ ∗ ∗ ∗ = ∗ ∗ ∗ ∗ ∗

1

2

3

x

x

x

∗ ∗ ∗ ∗ = ∗ ∗ ∗ ∗

1

2

3

x

x

x

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ = ∗ ∗ ∗ ∗

∗ ∗ ∗ ∗

‘Usual’ expectation:

unique solution


infinitely many solns


no solution

The good news is that Gaussian elimination – with a few tweaks – continues to play a central role. The key is to

reduce the (augmented) coefficient matrix to row-echelon form, or just echelon form for short. The word is

borrowed directly from French (échelon = “a rung on a ladder”).

12

To illustrate the meaning, the following matrices are in echelon form:

2 2 1 4

0 1 0 3

0 0 3 2

−

−

2 3 1 4 1

0 0 1 3 2

0 0 0 2 3

0 0 0 0 4

−

−

0 3 2 5

0 0 2 3

0 0 0 0

−

In any row that does not consist entirely of zeros, the leftmost nonzero entry is given a special name: the leading

entry. We can now proceed to a formal definition.

A matrix is in row-echelon form (≡ echelon form) if

• in every pair of successive nonzero rows, the leading entry in the

lower row lies strictly to the right of the leading entry in the upper row;

• any zero rows are grouped together at the bottom of the matrix.

Some authors add a third condition, that each of the leading entries must be 1. This complicates the elimination

process slightly (each row has to be divided through by its leading entry) but it simplifies back-substitution. We

will not insist on this; the main thing is to get the pattern right.

13

Here are some examples of matrices that are not in echelon form:

2 2 1 4

0 1 0 3

3 0 3 2

−

−

2 3 1 4 1

0 0 1 3 2

0 0 2 2 3

0 0 0 0 4

−

−

0 3 2 5

0 0 0 0

0 0 2 3

−

Any matrix can be reduced to echelon form by performing a sequence of elementary row operations. These

operations allow the corresponding system of equations to be simplified, without altering the solution.

The three available elementary row operations are:

ERO1 – add a multiple of one row to another row.

ERO2 – swap two rows.

ERO3 – multiply a row by a nonzero constant.

Note that if a square matrix is reduced to echelon form, we get a bonus because the determinant is simply the

product of the entries on the diagonal. Be careful, however. Although ERO1 (remarkably) does not affect the det,

ERO2 and ERO3 both do. Swapping two rows negatives the det, while scaling a row scales the det by the same

factor.

14

Example

Reduce the following matrix to echelon form.

1 2 3 2 1

2 4 6 7 8

1 4 2 2 1

4 8 12 2 8

− − − −

−

15

Solution

1 2 3 2 1

0 0 0 3 6 2 2 2 1

0 2 1 4 0 3 3 1

0 0 0 6 12 4 4 4 1

1 2 3 2 1

0 2 1 4 0 2 3

0 0 0 3 6

0 0 0 6 12

1 2 3 2 1

0 2 1 4 0

0 0 0 3 6

0 0 0 0 0 4 4 2 3

R R R

R R R

R R R

R R

R R R

= −

− = +

− − = −

− ↔

− −

−

= +

16

Optional extra #1: if we wanted all the leading entries to be 1, we could have scaled the rows en route, but we

can also scale them now:

1 1

2 2

1

3

1 2 3 2 1

0 1 2 0 2 2

0 0 0 1 2 3 3

0 0 0 0 0

R R

R R

− − = × −

= ×

Optional extra #2: having done this, we could carry on and clear out the columns vertically above each leading 1,

working from right to left. This eventually leads to the reduced row-echelon form:

1

2

1

2

1 2 3 0 3 1 1 2 3

0 1 0 4 2 2 2 3

0 0 0 1 2

0 0 0 0 0

1 0 4 0 11 1 1 2 2

0 1 0 4

0 0 0 1 2

0 0 0 0 0

R R R

R R R

R R R

− = − − = +

− = − −

17

Reduced row-echelon form is the simplest form that a matrix (or really, the underlying system of equations) can

take. It makes back-substitution a breeze, and furthermore it is unique, whereas an ordinary row-echelon form

(even with leading 1’s) is not. Then again, we need to work harder to obtain the reduced row-echelon form in the

first place.

For Matlab enthusiasts, there is a function called rref for computing the reduced row echelon form of a matrix.

18

1.4 Rank and nullity

Let us first refresh the notion of linear independence.

A set of vectors is linearly independent if no member of the set can be

expressed as a linear combination of the other members. Otherwise, the

set of vectors is linearly dependent.

When discussing matrices, the terms row rank ( = number of linearly independent rows) and column rank (=

number of linearly independent columns) are commonly used. Somewhat surprisingly, it turns out that the row

rank and the column rank of any matrix are the same, so there is no ambiguity in talking about ‘the rank’ of a

matrix. In summary:

The rank of a matrix A can be defined as either

• the number of linearly independent rows of A, or

• the number of linearly independent columns of A.

The two definitions are equivalent.

19

Sometimes the rank of a matrix is obvious from inspection, e.g.

1 0

0 2

3 4

1 0 0

0 2 0

3 4 5

1 0 2 0

0 3 0 4

2 3 4 4

have ranks of 2, 3 and 2. The failsafe technique, however, is to reduce the matrix to echelon form, then count the

number of nonzero rows (or equally well, count the number of leading entries on the echelon – both counts will

be same because of the strict definition of echelon form).

We now need to introduce the idea of the kernel (or null space) of a matrix. In Section 1.1 we discussed how an

m × n matrix A defines a linear transformation from RRRRn to RRRRm. Essentially, the kernel of A is the set of all vectors

in RRRRn that are mapped to the zero vector in RRRRm. It is denoted ker(A). The dimension of this set, i.e. the number of

linearly independent vectors it contains, is called the nullity of A. More formally,

The kernel of a matrix A is the set of all solutions of the homogeneous

system Ax = 0.

The nullity of a matrix A is the number of linearly independent solutions

of the homogeneous system Ax = 0.

20

It is clear that ker(A) always contains the zero vector, 0. If this is the only vector in ker(A), then by convention,

nullity(A) = 0. If ker(A) contains vectors other than 0, then nullity(A) > 0. One way to determine nullity(A) is to

solve the homogeneous system Ax = 0, then count the number of linearly independent solutions. We will do this

in the next section. There is, however, a neat shortcut based on the rank-nullity theorem:

Let A be a matrix with n columns. Then

rank(A) + nullity(A) = n

This means that we can determine the nullity of a matrix as soon as it has been reduced to echelon form.

21

Example

Find the rank and nullity of the matrix

1 2 3 5

2 3 4 7

3 4 5 9

=

A

Solution

1 2 3 5

0 1 2 3 2 2 2 1

0 2 4 6 3 3 3 1

1 2 3 5

0 1 2 3

0 0 0 0 3 3 2 2

R R R

R R R

R R R

− − − = −

− − − = −

− − −

= −

There are two nonzero rows, so rank(A) = 2. Since the matrix has n = 4 columns, nullity(A) = n – rank(A) = 4 – 2

= 2.

22

1.5 Solving Ax = 0

A linear system in which the right-hand side vector is filled with zeros is said to be homogeneous. Suppose we

are given a matrix A and asked to find the general solution of Ax = 0. This is exactly the same as being asked to

find ker(A). To start with, we can quickly determine nullity(A) as outlined above. If it turns out to be zero, ker(A)

contains only the zero vector, and we can declare immediately that Ax = 0 has only the trivial solution x = 0. If

nullity(A) is not zero, then ker(A) contains vectors other than 0, and Ax = 0 has infinitely many solutions. To find

them, we use reduction to echelon form and a ‘special’ back-substitution process.

Example

Find the general solution of Ax = 0 where

2 3 2 1 4

4 6 1 1 1

8 12 1 5 11

4 6 5 7 19

− − =

− − − − −

A

23

Solution

2 3 2 1 4

0 0 3 3 9 2 2 2 1

0 0 9 9 27 3 3 4 1

0 0 9 9 27 4 4 2 1

2 3 2 1 4

0 0 3 3 9

0 0 0 0 0 3 3 3 2

0 0 0 0 0 4 4 3 2

R R R

R R R

R R R

R R R

R R R

− − − = −

− − = −

− = +

− − −

= −

= +

We see that rank(A) = 2 and nullity(A) = 5 – 2 = 3, so the system Ax = 0 has infinitely many solutions. In fact, we

could see at the outset that the system was underdetermined (4 × 5), so we knew that nullity(A) had to be at

least 5 – 4 = 1. In general, an underdetermined homogeneous system always has infinitely many solutions.

24

Writing the echelon form of Ax = 0 out in full, we have:

1

2

3

4

5

2 3 2 1 4 0

0 0 3 3 9 0

0 0 0 0 0 0

0 0 0 0 0 0

x

x

x

x

x

−

− − =

The variables are now split into two groups. Columns C1 and C3 contain the leading entries on the echelon

(circled). The corresponding variables x1 and x3 are variously referred to as the leading variables, key variables,

bound variables or pivot variables. The remaining columns (C2, C4, C5) are associated with the variables x2, x4

and x5. These are called the free variables (or sometimes, the non-leading variables). To find the general

solution, we first assign arbitrary parameters to the free variables, say

2 4 5, , x x x= α = β = γ

Next we solve for the leading variables in terms of these parameters, working up through the nonzero rows, and

from right to left along the echelon. (Note how each of the zero rows automatically has a zero on the right-hand

side – it is impossible for a homogeneous system to be inconsistent.) We first solve row R2 for x3,

3 4 5

3

3

3 3 9 0

3 3 9 0

3

x x x

x

x

− + − =

− + β − γ =

= β − γ

25

then solve row R1 for x1:

( )1 2 3 4 5

1

3 11 2 2

2 3 2 4 0

2 3 2 3 4 0

x x x x x

x

x

+ + − + =

+ α + β − γ − β + γ =

= − α − β + γ

The general solution can now be assembled, then (optionally) expressed as a linear combination of vectors, one

for each arbitrary parameter:

3 3 111 22 2 2

2

3

4

5

1

001

313 0

010

100

x

x

x

x

x

−− α − β + γ −

α = = α + β + γ − β − γ

β γ

(1.3)

This is the desired general solution of Ax = 0; it is also ker(A). We can easily read off a set of linearly

independent basis vectors for the kernel:

3 1

221

001

, , 310

010

100

−− −

26

In this instance, ker(A) is seen to be a 3-dimensional subspace of RRRR5. Hence nullity(A) – which is just the

dimension of ker(A) – is equal to 3. This agrees with the initial ‘prediction’ of nullity(A) that was made using the

rank-nullity theorem. It is good practice to check that each of the kernel basis vectors does actually satisfy Ax =

0, and this can readily be verified. It follows that any linear combination of the basis vectors, as in eqn (1.3), must

also satisfy Ax = 0.

1.6 Solving Ax = b

With a homogeneous linear system Ax = 0, the answer to the question of solution existence is always affirmative,

since (as discussed above) there will always be at least the trivial solution x = 0. With a non-homogenous system

Ax = b, the answer to this question is no longer automatic: there is now a possibility that the system will have no

solution.

As usual, reduction to echelon form (this time performing elementary row operations on the augmented

coefficient matrix [A | b]) provides clarity. Suppose we have a 4 × 5 system and end up with something like

0 0

0 0 0 0 0

0 0 0 0 0 0

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗

∗

27

We can immediately infer that the original system Ax = b is inconsistent, i.e. it has no solution. We can interpret

the circled entry in two different ways, using the terminology introduced in Section 1.2. Taking the row-oriented

view, the emergence of a contradictory equation indicates that the 4 hyperplanes in RRRR5 have no point in common.

Taking the column-oriented view, the fact that [A | b] and A have different column ranks (3 and 2 respectively)

indicates that the vector b in RRRR4 cannot be expressed as a linear combination of the 5 columns of A.

On the other hand, suppose we reduce a general m × n system Ax = b to echelon form, and find that the trailing

rows (if any) all look like

[ ]0 0 0 0L

We can then infer that the original system Ax = b is consistent, i.e. it has at least one solution. Taking the row-

oriented view, the absence of a contradictory equation indicates that the m hyperplanes in RRRRn have at least one

point in common. Taking the column-oriented view, the fact that [A | b] and A have the same column rank

indicates that the vector b in RRRRm can be expressed – in at least one way – as a linear combination of the n

columns of A.

28

If the above analysis shows that there is at least one solution, we turn to the question of uniqueness: is there just

one solution, or a whole family of solutions?

The answer for Ax = b, just as for Ax = 0, is determined by the nullity of A. If nullity(A) = 0, the solution is unique,

and it can be found by ‘standard’ back-substitution (leading variables only, no free variables). If nullity(A) > 0, the

solution involves a corresponding number of arbitrary parameters, and its general form has to be worked out by

‘special’ back-substitution (separate treatment of leading and free variables).

Our last example will show how to solve a non-homogeneous system Ax = b that has infinitely many solutions.

To illustrate some important points, we will retain the same A matrix that was used to demonstrate the solution of

Ax = 0 in Section 1.5.

29

Example

Find the general solution of Ax = b where

2 3 2 1 4

4 6 1 1 1

8 12 1 5 11

4 6 5 7 19

− − =

− − − − −

A ,

7

2

8

22

=−

b

30

Solution

2 3 2 1 4 7

4 6 1 1 1 2

8 12 1 5 11 8

4 6 5 7 19 22

2 3 2 1 4 7

0 0 3 3 9 12 2 2 2 1

0 0 9 9 27 36 3 3 4 1

0 0 9 9 27 36 4 4 2 1

2 3 2 1 4 7

0 0 3 3 9 12

0 0 0 0 0 0 3 3 3 2

0 0 0 0 0 0 4 4 3 2

R R R

R R R

R R R

R R R

R R R

− −

− − − − − −

− − − − = −

− − − = −

− = +

− − − −

= −

= +

Since rank([A | b]) = rank(A) = 2, the system is consistent: there is at least one solution. Furthermore, since

nullity(A) = 5 – 2 = 3, we know that the general solution will involve 3 arbitrary parameters, so in fact the system

has infinitely many solutions.

31

Proceeding just as in Section 1.5, we first assign arbitrary parameters to the free variables, say

2 4 5, , x x x= α = β = γ

Next we solve for the leading variables, remembering to incorporate the right-hand side entries (this time they

are not zero!). Solving row R2 for x3 gives

3 4 5

3

3

3 3 9 12

3 3 9 12

4 3

x x x

x

x

− + − = −

− + β − γ = −

= + β − γ

and solving row R1 for x1 gives

( )1 2 3 4 5

1

31 11 2 2 2

2 3 2 4 7

2 3 2 4 3 4 7

x x x x x

x

x

+ + − + =

+ α + + β − γ − β + γ =

= − − α − β + γ

32

The general solution can now be assembled, then ‘unpacked’ as before:

3 31 11 11 2 22 2 2 2

2

3

4

5

p

1

00 01

34 14 3 0

00 10

10 00

ker( )

x

x

x

x

x

− −− − α − β + γ −

α = = + α + β + γ − + β − γ

β γ

x x A

123 14243 1444442444443

(1.4)

We recognize the second part of the solution as ker(A): it is identical to the solution of the homogeneous system

Ax = 0 that we obtained in eqn (1.3). But we now have an additional term: a particular solution xp that (we hope!)

satisfies Ax = b. It is easy to check that it does:

1

22 3 2 1 4 70

4 6 1 1 1 24

8 12 1 5 11 80

4 6 5 7 19 220

−−

− =

− − − − − −

�

Of course this is not the only possible choice for xp. The general solution contains three arbitrary parameters,

and we can pick any values we like for those.

33

For example, setting α = 1, β = 0 and γ = 1 in eqn (1.4), we get another solution

[ ]1 1 1 0 1T

= −x

that also satisfies Ax = b,

12 3 2 1 4 7

14 6 1 1 1 2

18 12 1 5 11 8

04 6 5 7 19 22

1

− −

− =

− − − − − −

�

and could therefore serve just as well as the particular solution xp. An infinite number of other (equally valid)

choices could be made. Note, however, that any expression for the general solution of Ax = b must always

include the whole of the kernel:

p ker( )= +x x A

34

In closing it is worth recalling the ‘first’ form of the general solution in eqn (1.4), and reflecting on the significance

of the rank and nullity of the coefficient matrix A.

31 11 2 2 2

2

3

4

5

4 3

x

x

x

x

x

− − α − β + γ

α = + β − γ

β γ

In this example nullity(A) is 3, which corresponds to the number of free variables (x2, x4, x5) that can be assigned

arbitrary values. Also, rank(A) is 2, which corresponds to the number of leading variables (x1, x3) that are ‘bound’

to the free variables in certain prescribed ratios.

In general, the significance of the nullity of A is that it tells us the number of degrees of freedom in the general

solution of a linear system Ax = b (or indeed Ax = 0). The rank of A is a complementary measure – it tells us the

number of constraints imposed on the general solution. If the nullity is zero, the solution has no degrees of

freedom, therefore that solution is unique. If the nullity is greater than zero, the solution has a corresponding

number of degrees of freedom, and consequently the system has infinitely many solutions.

35

Flowchart for Ax = b (matrix A is m × n)

36

Example (A1 2004 Q7)

A set of linear equations is given by Ax = b where

1 1 2

1 2 3

2 3 5

1 3 4

=

A ,

1

2

3

x

x

x

=

x and

3

4

7

5

=

b

(a) Define the rank, nullity and kernel of a matrix.

(b) Reduce the augmented matrix [A | b] to echelon form and hence find the rank, nullity and kernel of the

matrix A.

(c) Confirm that solutions to Ax = b exist, and find the general solution for x.

37

Topic 2

Norms and conditioning

2.1 Vector norms

When we talk about the ‘size’ or ‘magnitude’ of a real number, we mean its absolute value. Although the number

–6.2 is algebraically smaller than the number 2.8, most of us would agree that –6.2 is ‘bigger’ than 2.8, in the

sense that it lies further away from zero. The idea can be expressed more formally: the length of the vector

running from the origin to –6.2 is greater than that of the vector running from the origin to 2.8.

What about higher dimensions? We know how to calculate the length of vectors in RRRR2 and RRRR3 using Pythagoras’s

theorem, and there is a very natural extension to RRRRn:

[ ]

[ ]

[ ]

2 21 2 1 2

2 2 21 2 3 1 2 3

2 2 21 2 1 2

length of

length of

length of

T

T

T

n n

x x x x

x x x x x x

x x x x x x

= +

= + +

= + + +

M

L K

38

The general definition works also perfectly well when n = 1, since

[ ] 21 1 1length of x x x= =

Going back to the example in the opening paragraph, we get the desired outcome: the length of the 1-component

vector [ ]6.2− is greater than that of [ ]2.8 .

We are very used to calculating the length of a vector in this way, but in fact a whole family of ‘length’ measures

can be defined as follows:

[ ] ( )

[ ] ( )

[ ] ( )

1

1 2 1 2

1

1 2 3 1 2 3

1

1 2 1 2

length of

length of

length of

pT p p

pT p p p

pT p p pn n

x x x x

x x x x x x

x x x x x x

= +

= + +

= + + +

M

L K

This generalized measure of length is called the p-norm (or Lp norm). To denote the p-norm of a vector x, we use

the special notation p

x . We never write p

x , though the similarity is of course intentional; taking the norm of a

vector gives us a scalar measure of its magnitude, and is thus analogous to taking the absolute value of a real

(or complex) number.

39

The p-norm can be defined compactly as follows.

Let x be a vector and p (≥ 1) be a scalar. The p-norm of x is

1 pp

ipi

x

= ∑x

It is not essential that p be an integer, but in practice it invariably is. In fact, there are only a few norms in

common use, namely the ones that are easy to calculate: those for p = 1, p = 2 and p = ∞.

The Pythagorean measure of length that we are so accustomed to is just the p-norm with p = 2. It is commonly

referred to as the Euclidean norm, or the Euclidean metric (note that in this context, metric means “measure of

distance” – nothing to do with SI units). The 2-norm can also be expressed in a compact vectorial form, since it is

the square root of the dot product of x with itself:

2

T= =x x x x x�

It is often simpler (e.g. when minimizing the length of a vector, see Q7 of the tutorial sheet) to work with square

of the 2-norm:

2

2

T= =x x x x x�

40

If we set p = 1 in the general definition, we get

1 i

i

x=∑x

i.e. the sum of the absolute values of the components. The 1-norm is sometimes referred to as the ‘taxicab’ or

‘Manhattan’ norm, since a hypothetical taxi driving from [ ]0 0T

to a destination with coordinates [ ]1 2

Tx x has to

cover a distance of at least 1 2x x+ if it is constrained to drive on a rectangular grid of streets in RRRR2.

Finally, if we let p approach infinity in the general definition, we get

( )max ii

x∞

=x

i.e. the supremum of the absolute values of the components. You are asked to justify this in Q3 of the tutorial

sheet.

It is also possible to come up with vector norms other than the p-norm, but any proposed norm must possess

certain common-sense properties in order to qualify as a meaningful measure of length. These are similar to the

properties of the absolute value function.

41

Specifically,

(i) 0 if and only if = =x x 0

(ii) α = αx x

(iii) + ≤ +x y x y (the ‘triangle inequality’)

It can be shown that these basic properties imply another common-sense property that we would expect any

norm to exhibit:

0 for all ≥x x

For example, the 1-norm ii

x∑ is a valid vector norm, but ii

x∑ is not, since the sum of the terms in a vector

could be negative.

A good way to understand vector norms is to study how the unit circle in RRRR2 looks under the three common p-

norms:

42

Example

Find the 1-, 2-, 5-, and ∞-norms of the vector [ ]3 5 4 1T

= − −x .

Solution

13 5 4 1 13= + − + + − =x

( ) ( )1 2

2 22 2

23 5 4 1 51 7.14 = + − + + − = =

x

1 55 5 5 5 5

53 5 4 1 4393 5.35 = + − + + − = =

x

( )max 3 , 5 , 4 , 1 5∞

= − − =x

21=x

1

1

11=x

1

1

1∞

=x

1

1

43

2.2 Matrix norms

Can we extend the notion of a norm from vectors to matrices? Intuition suggests that a matrix such as

100 200

75 400

−

must somehow be ‘bigger’ than a matrix such as

1 3

1 2

−

One idea comes quickly to mind. We could place all the entries of the matrix into a single vector, then apply the

vector p-norm introduced above. If we took p = 2, this would be equivalent to squaring each entry of the matrix,

summing and taking the square root. This is indeed a commonly used matrix norm known as the Frobenius

norm:

2

Fro iji j

A= ∑∑A

Another approach, and one that turns out to be much more useful, is to think about the action of a matrix on an

arbitrary vector. As we noted in the first lecture, an m × n matrix A can be viewed as a linear operator that maps

a vector x in RRRRn to a vector Ax in RRRRm. If we take the norm of the ‘output’ vector Ax and divide it by the norm of the

44

‘input’ vector x, we should get some clue as to the ‘magnifying power’ of the matrix A (and thus its inherent ‘size’

or ‘magnitude’). We are therefore motivated to calculate the ratio

p

p

Ax

x

assuming of course that x is not the zero vector. The trouble is that, unless A happens to be the identity matrix

(or some multiple thereof), this ratio will depend on the direction of the input vector x. To use a very simple

example, consider the diagonal matrix

5 0

0 2

=

A

The vector [ ]1 0T

=x is mapped to [ ]5 0T

=Ax (magnification = 5), but the vector [ ]0 1T

is mapped to

[ ]0 2T

(magnification = 2). Vectors in other directions undergo intermediate levels of magnification. This can be

clarified by looking at the image of the unit circle under the action of A. It is mapped to an ellipse with semi-axes

of length 5 and 2:

45

-3

-2

-1

0

1

2

3

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6

Given that a general matrix A is likely to exert widely varying ‘magnifying powers’ on different input vectors x, we

had better be conservative and define the matrix p-norm of A as

maxp

p

p≠

=x 0

AxA

x

With this definition, the matrix p-norm provides an upper bound on the ‘magnifying power’ of A (as measured by

applying the vector p-norm to the input vector x and output vector Ax). It means that we are assured of satisfying

the following important inequality for any vector x, regardless of its direction.

46

Let A be a matrix with n columns, and let x be a vector in R R R Rn. If the

inequality

≤Ax A x

is satisfied for all vectors x in RRRRn, then the matrix norm A is said to be

compatible with, or induced from, the vector norm used as a metric for x

and Ax.

There is no general formula for calculating the p-norm of a matrix A in terms of p and the entries Aij. For selected

values of p, however, it is possible to work out formulae for p

A that guarantee the satisfaction of the above

‘compatibility inequality’ for an arbitrary vector x.

For p = 1, the compatible matrix norm is

1

max ijj

i

A

= ∑A

which is the maximum absolute column sum of A (remember that i counts down the rows, and j counts across

the columns). We could just take this on trust, but it would be nice to verify that the inequality

47

1 1 1

≤Ax A x

is actually satisfied for any vector x. The proof requires some messy fiddling with summations and absolute

values. The indices i and j are understood to run from 1 to m and 1 to n respectively.

1

1 1

1 1

*

max

ij ji j

ij ji j

ij ji j

ij jj i

j ijj i

j ijj

j i

A x

A x

A x

A x

x A

x A

=

≤

=

=

=

≤ ⋅

=

=

∑∑

∑∑

∑∑

∑∑

∑ ∑

∑ ∑

Ax

x A

A x

48

The justification for the step marked * is the scalar version of the triangle inequality mentioned above:

α + β ≤ α + β .

For p = ∞, the compatible matrix norm is

max iji

j

A∞

=

∑A

which is the maximum absolute row sum of A. To verify the validity of this definition, we need to show that the

inequality

∞ ∞ ∞

≤Ax A x

is certain to be satisfied for any vector x. This time the proof is a little shorter, but still requires careful study to

follow the reasoning. You might find it helpful to make up a few small but concrete ‘test cases’ (e.g. with A = 2 ×

3 and x = 3 × 1) and trace what happens at each step. Make sure you include a few negative terms in both A and

x!

49

( )

max

max *

max

max max

ij ji

j

ij ji

j

ij ji

j

ij ji j

j

A x

A x

A x

A x

∞

∞ ∞

=

≤

=

≤ ⋅

=

∑

∑

∑

∑

Ax

A x

Again, note the use of the triangle inequality to reach line *.

The matrix 2-norm is widely used, but can be difficult to calculate. If A happens to be square and symmetric, the

2-norm is the supremum of the absolute values of the eigenvalues:

( )2

max ii

= λA

(You might like to think about the unit circle to ellipse mapping discussed above.)

50

This is compatible with the 2-norm for vectors, i.e. it ensures

2 2 2

≤Ax A x

for any vector x. More generally, the 2-norm of A turns out to be its largest singular value (this is beyond the

scope of the present course).

Example

Find the 1-norm, ∞-norm and Frobenius norm of the matrix

3 2 6

5 8 4

4 1 3

− = − − −

A

Solution

The 1-norm is the maximum absolute column sum: ( )1

max 12, 11, 13 13= =A

The ∞-norm is the maximum absolute row sum: ( )max 11, 17, 8 17∞

= =A

The Frobenius norm is the square root of the sum of squares:

( ) ( )1 2

2 22

Fro3 2 3 180 13.42 = + − + + − = =

A K

51

Example

Find the 2-norm of the matrix

1 2

2 3

= −

A

Solution

The matrix is square and symmetric, so its 2-norm can be calculated from the eigenvalues.

( )( )2

1 2det 0

2 3

1 3 4 0

2 7 0

− λ = − − λ

− λ − − λ − =

λ + λ − =

Solving the quadratic gives 1 8 1.83, 3.83λ = − ± = − . Hence

( )2

max 1.83 , 3.83 3.83= − =A

52

Example

Given the matrix and vector

1.8 1.4

0.1 0.6

= −

A , 0.3

0.2

=

x

investigate the satisfaction of the ‘compatibility inequality’

p q p

≤Ax A x

when p and q take values 1 and ∞ in all combinations.

Solution

1.8 1.4 0.3 0.82

0.1 0.6 0.2 0.09

= = −

Ax

We therefore have

1

0.5, 0.3∞

= =x x (vector norms of x)

1

2.0, 3.2∞

= =A A (matrix norms of A)

1

0.91, 0.82∞

= =Ax Ax (vector norms of Ax)

53

Combination p q p≤Ax A x Inequality satisfied?

p = 1, q = 1 0.91 ≤ 2.0 × 0.5 YES (expected)

p = ∞, q = ∞ 0.82 ≤ 3.2 × 0.3 YES (expected)

p = 1, q = ∞ 0.91 ≤ 3.2 × 0.5 YES (lucky)

p = ∞, q = 1 0.82 ≤ 2.0 × 0.3 NO

Moral: don’t mix and match incompatible matrix and vector norms!

54

2.3 Conditioning of linear systems

Consider the 2 × 2 system

1

2

1 1 2

1 1 0

x

x

= −

which has the solution

1

2

1

1

x

x

=

If we perturb one of the terms in the matrix by 1%,

1

2

1 1.01 2

1 1 0

x

x

= −

the solution changes to

1

2

0.995

0.995

x

x

=

This seems reasonable – both components of x have changed by 0.5%, which is of the same order of magnitude

as the perturbation.

55

Similarly, if we perturb the first term of the right hand side vector by 1%,

1

2

1 1 2.02

1 1 0

x

x

= −

the solution becomes

1

2

1.01

1.01

x

x

=

which is again in line with our intuitive expectation.

This type of ‘nicely behaved’ linear system is said to be well-conditioned.

Now let’s look at another 2 × 2 system

1

2

1 1 2

0.99 1.01 2

x

x

=

which also happens to have the solution

1

2

1

1

x

x

=

56

When we perturb one of the terms in the matrix by 1%,

1

2

1 1.01 2

0.99 1.01 2

x

x

=

we now get a drastic change in the solution:

1

2

0

1.98

x

x

=

When we perturb the first term of the right hand side vector by 1%,

1

2

1 1 2.02

0.99 1.01 2

x

x

=

the solution is wildly different again:

1

2

2.01

0.01

x

x

=

57

This behaviour is surprising, and clearly undesirable: the solution x is highly sensitive to small perturbations in

the problem data A and b (such as those which might arise from ‘noisy’ data, imperfect measurements, or

numerical roundoff errors during a computation). This type of ‘badly behaved’ linear system is said to be ill-

conditioned. We can gain some insight into what is happening if we plot the two ‘base case’ situations (well-

conditioned on the left, ill-conditioned on the right):

0

1

2

0 1 2

0

1

2

0 1 2

The reason for the ill-conditioned behaviour is now clear. The lines are so close to parallel that even small

changes in slope and/or intercept cause the intersection point to move far away from the ‘base case’ solution

point (1, 1).

58

If the lines were parallel, the determinant of the coefficient matrix would be zero, which might prompt the idea of

using det(A) as an indicator of possible conditioning trouble. The ill-conditioned system above has

1 1

0.99 1.01

=

A

such that det(A) = 0.02, which seems close to zero. Unfortunately, the determinant is useless for predicting ill-

conditioning because a simple rescaling of the problem leads to a change in the determinant. If we multiply the

equations of the ill-conditioned system by 10, we get

1

2

10 10 20

9.9 10.1 20

x

x

=

such that det(A) = 2. The matrix appears to be much ‘less singular’, but the solution remains just as sensitive to

1% (or other small) perturbations of the entries in A and/or b.

Fortunately, there is a better approach for quantifying the sensitivity of a linear system, thus allowing ill-

conditioned behaviour to be predicted in advance (i.e. without resorting to trial and error perturbations of the sort

we imposed above). The perturbations to A and b are characterized by the ‘relative error’ ratios

δA

A and

δb

b

59

We would like to be able to predict the effect on the solution, via the corresponding ‘relative error’ ratio

δx

x

Any convenient norm can be used, but it is essential that the matrix norm and vector norm be ‘compatible’ in the

sense defined previously.

Suppose we have a linear system defined by a matrix A, a right-hand side vector b, and an exact solution x,

such that

=Ax b

The matrix A is assumed to be square and invertible.

If the matrix is perturbed from A to A + δA, the solution will be perturbed from x to x + δx, such that

( )( )+ δ + δ =A A x x b

Subtracting Ax from the left and b from the right,

( )δ + δ + δ =A x A x x 0

60

Premultiplying by 1−A and rearranging,

( )

( )

1

1

−

−

δ + δ + δ =

δ = − δ + δ

x A A x x 0

x A A x x

Taking norms and applying the ‘compatibility inequality’ twice,

( )

( )

1

1

1

−

−

−

δ = δ + δ

≤ δ + δ

≤ δ + δ

x A A x x

A A x x

A A x x

Hence

1−δ

≤ δ+ δ

xA A

x x

or equivalently

( )1−δ δ≤

+ δ

x AA A

x x A (2.1)

61

Similarly, if the right-hand side vector is perturbed from b to b + δb, the solution will be perturbed from x to x +

δx, such that

( )+ δ = + δA x x b b

Subtracting Ax from the left and b from the right,

δ = δA x b

Premultiplying by 1−A ,

1−δ = δx A b

Taking norms and applying the ‘compatibility inequality’,

1

1

−

−

δ = δ

≤ δ

x A b

A b

Multiplying by b on the left and Ax on the right,

1−δ ≤ δx b A b Ax

Applying the ‘compatibility inequality’ once more,

1−δ ≤ δx b A b A x

62

Rearranging,

( )1−δ δ

≤x b

A Ax b

(2.2)

Remarkably, the bracketed term 1−A A appears in both eqn (2.1) and eqn (2.2). It is called the condition

number, and is usually denoted κ.

Let A be a square, invertible matrix. The condition number of A is

( ) 1−κ =A A A

For a linear system Ax = b, the condition number provides an upper

bound on the relative error in x caused by relative errors in A and b:

( )δ δ

≤ κ+ δ

x AA

x x A

( )δ δ

≤ κx b

Ax b

63

Example

Find the condition numbers for the two systems considered at the start of this section. Also, for the ill-conditioned

system, verify that the actual relative errors caused by the perturbations in A and b are within the bounds

predicted by the condition number. Use the infinity norm.

Solution

For the well-conditioned system,

11 1 0.5 0.5 and

1 1 0.5 0.5

− = = − −

A A

hence

( ) 1 2 1 2−∞ ∞ ∞

κ = = × =A A A (which is low)

For the ill-conditioned system,

11 1 50.5 50 and

0.99 1.01 49.5 50

− − = = −

A A

64

hence

( ) 1 2 100.5 201−∞ ∞ ∞

κ = = × =A A A (which is high)

The matrix was changed from

1 1

0.99 1.01

=

A to 1 1.01

0.99 1.01

+ δ =

A A

such that

0 0.01

0 0

δ =

A

This caused the solution to change from

1

1

=

x to 0

1.98

+ δ =

x x

such that

1

0.98

− δ =

x

65

We therefore expect

( )

1 0.01201

1.98 2

0.505 1.005

∞ ∞∞

∞ ∞

δ δ≤ κ

+ δ

≤ ×

≤

x AA

x x A

which checks out.

The right-hand side was changed from

2

2

=

b to 2.02

2

+ δ =

b b

such that

0.02

0

δ =

b

This caused the solution to change from

1

1

=

x to 2.01

0.01

+ δ =

x x

66

such that

1.01

0.99

δ = − x

We therefore expect

( )

1.01 0.02201

1 2

1.01 2.01

∞ ∞∞

∞ ∞

δ δ≤ κ

≤ ×

≤

x bA

x b

which again checks out.

Documents

4 lectures, Michaelmas Term 2010 Prof. Steve Roberts sjrob ...4 The function f(x) = ax satisfies both requirements, whereas functions like x , log( x) and x + 2 clearly do not: they