Least Squares example There are 3 mountains u,y,z that from one site have been measured as 2474 ft., 3882 ft., and 4834 ft.. But from u, y looks 1422 ft

Least Squares example

There are 3 mountains u,y,z that from one site have been measured as 2474 ft., 3882 ft., and 4834 ft.. But from u, y looks 1422 ft. taller and the z looks 2354 ft. taller, and from y, z looks 950 ft. taller. Set up the overdetermined system.

1 0 0

0 1 0

0 0 1

-1 1 0

-1 0 1

0 -1 1

u

y

z ~2474

3882

4834

1422

2354

950

=bAx=

Want to minimize ||Ax-b||2

Exponential example

Assume you are given the following data:

1.30.850.50.40.30.20.150.140.120.11

b

12345678910

t

Which you think fits the model

b = u + w e-.5t

for some values of u and w which you wish to determine so that

The Matrix A would look like:

5

5.1

1

5.

10

.

.3

2

1

e

e

e

e

bw

uA

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 2 3 4 5 6 7 8 9 10

Approaches to solve Ax b

• Normal equations-quick and dirty

• QR- standard in libraries uses orthogonal decomposition

• SVD - decomposition which also gives indication how linear independent columns are

• Conjugate gradient- no decompositions, good for large sparse problems

Quick and Dirty Approach

Multiply by AT to get the normal equations:

AT A x = AT b

For the mountain example the matrix AT A is 3 x 3.

The matrix AT A is symmetric .

However, sometimes AT A can be nearly singular or singular.

Consider the matrix A = 1 1

e 0

0 e

The matrix AT A = 1+ e2 1

1 1+ e2

becomes singular if e is less than the square root of the machine precision.

Normal equations for least squares - slide 2

For the hill problem:

3 -1 -1 u -1302

AT A x = -1 3 -1 y = 4354 = ATb

-1 -1 3 x 8138

Solution is u =2472, y= 3886, z=4816.

Assume the m x n matrix A and the m-vector b can be partitioned into

A = R and b = c

0 d

where R is a nonsingular n x n matrix and c has length n.

Then ||Ax-b||22 = ||Rx -c ||22 + || d ||22

So pick x such that ||Rx -c ||22

which implies ||Ax-b||22 = || d ||22 .This is the best one can do.

QR: A more stable approach

QR for least squares-slide 2

But most A matrices do not have this form, but using orthogonal transformations we can transform matrices of full column rank to this form.

A matrix Q is orthogonal if QTQ=I.

If Q is orthogonal then for any x, ||Qx ||22 =||x ||22 ,

that is an orthogonal matrix preserves the 2 norm.

Examples of orthogonal matrices:

1 0 0 1 cos y sin y

0 1 1 0 -sin y cos y

for some angle y

Givens

rotations

QR for least squares -slide 3

We wish to pick a sequence of orthogonal matrices such that A might be transformed into upper triangular form:

x x x x x x x x x x x x

x x x 0 x x 0 x x 0 x x = R

x x x 0 x x 0 0 x 0 0 x

x x x 0 x x 0 0 x 0 0 0 0

The transformations are then applied to the data vector b

The solution is now found by solving Rx =c, where c

is the first n elements of the transformed b.

QR for least squares -slide 4

Householder transformations of the form I -2uuT / uTu can easily eliminate whole columns.

If the A matrix is almost square the QR approach and the normal equations approach require about the same number of operations.

If A is tall and skinny, the QR approach takes about the twice number of operations.

Good project: Investigate structure of R if A is sparse.

Most good least squares solvers use the QR approach.

In Matlab: x= A \b.

Singular Value DecompositionThe singular value decomposition(SVD) of a matrix A is given by

A = USVT

where U and V are orthogonal and S is a diagonal matrix of the form.

3

2

1

s

s

s

If any of the s’s are 0, the matrix is singular. Thus one can determine how close A is to a singular matrix by looking at the

smallest s’s.

Good project: Investigate an application of SVD

SVD for least squares

The solution to the least squares problem Ax b is given by

x=VS1-1U1

Tb

Requires 4 to 10 times more work than QR but shows dependencies in model.

If A is an m x n matrix of rank n, then

A =USVT = [ U1 U2 ] VT = U1S1VT

where U1 has the first n rows of U and S1 is n x n.

0

1S

Conjugate gradient

•Does not require decomposition of matrix

•Good for large sparse problem-like PET

•Iterative method that requires matrix vector multiplication by A and AT each iteration

•In exact arithmetic for n variables guaranteed to converge in n iterations- so 2 iterations for the exponential fit and 3 iterations for the hill problem.

•Does not zigzag

Initialization:

Guess x

Set r= b-Ax , p = Atr ( p will be a direction),

Set γ=pTp

Until convergence

Set q= Ap , α = γ/qTq

Reset x to x + α p

Reset r to r - α q

Set s = Atr , γnew =sTs, β = γnew/ γ , γ = γnew

Reset p to s+ β p.

Conjugate gradient algorithm

Nonlinear example

Assume in the exponential example that the model was

b = u + we-kt with w added to the list of unknowns.

The variable k is nonlinear while u and w are linear.

Possible approaches:

•Treat all three variables as nonlinear and use a nonlinear solver

•Use a 1-dimensional solver for k and each time a function value is requested one solves for u and w using a linear least squares solver, plug the best value for them in and give back the residual.

One dimensional minimization of f(x)

Assumption

Given an interval [a,b], where one “knows” there is a minimum, that f is unimodal on [a,b], i.e. there is only one local minimum.

With this assumption we can find the minimum by sampling and discarding portions of the interval that cannot have the solution.

Best algorithms are combinations of golden section search and quadratic interpolation with 3 points and finding minimum of the quadratic-Brent.

Golden section on -(2x3-9x2=12x+6)

Originally a =0, x1= .7639, x2= 1.2361, b=2.0

Subsequent values of a and b

a b

.7639 2.0

.7639 1.5279

.7639 1.2361

.9443 1.2361

.9443 1.1246

.9443 1.0557

.9868 1.0557

.9868 1.0294 etc.

Unconstrained minimization of f(x) where x has n elements

•Steepest descent- requires first derivatives (gradient)

might zigzag

good beginning strategy

sped up by conjugate gradient

•Newton- requires first and second derivatives(Hessian)

requires solution of linear system with n variables

fast if close to solution

•Quasi-Newton(most practical)- requires first derivative

no linear system to solve

builds up approximation to inverse of Hessian

Newton’s method for minimization

Let g = gradient, H = matrix of second partials

Taylor’s Theorem:

f(x+s) f(x) + gTs + 0.5 sTHs.

This quadratic function in s is minimized when

s =- H-1g

Algorithm: guess x

Until convergence

Solve H(x) s =- g(x) {needs H and solver}

Reset x to x + s

Quasi- Newton Method

•Builds up approximation to Hessian in directions that have been searched

•Almost as fast as Newton

Initial: pick x, set B = I.

Until convergence:

set s = - Bg (no linear system to solve)

set xnew = x+s

let γ= g(xnew) -g(x); δ = xnew - x; x= xnew

reset B to B + δ δT/ δT γ - B γ (B γ)T/ γT B γ

Comparison of steepest descent, Newton, and quasi Newton on f(x) = 0.5x1

2 + 2.5x22

Iteration Steepest Newton Quasi-Newton

0 5.00 1.00 5.00 1.00 5.00 1.00

1 3.333 -0.667 0.00 0.00 0.00 -4.00

2 2.222 0.444 -2.222 0.444

3 1.481 -0.296 0.816 0.082

4 0.988 0.198 -0.009 -0.015

5 0.658 -0.132 -0.001 0.001

6 0.439 0.088

7 0.293 -0.059

Large Scale Problems Conjugate gradient vs. Limited Memory Quasi-Newton

Conjugate gradient- each step is linear combination of

previous step and current gradient

Limited Memory-(Nocedal,Schnabel, Byrd, Kaufman)

Do not multiply B out but keep vectors. Need to keep 2 vectors per iteration

After k steps(k is about 5) reset B to I and start again .

Experimentation favors LMQN over CGGood project: How should LMQN be done

Documents

Least Squares example There are 3 mountains u,y,z that from one site have been measured as 2474 ft., 3882 ft., and 4834 ft.. But from u, y looks 1422 ft