Linear Least Squares Fitting with Microsoft Excel · The task is seemingly simple- find a least squares linear fit for the 6 datasets. Let us try Excel, and see how it fares with

Linear Least Squares Fitting with Microsoft Excelby: Antony Kaplan

04/13/10

Abstract

Microsoft Excel is a popular spreadsheet software which is used widely in both industry, academia, and education. Whether it isin a high school Physics classroom, or in the accounting departments of large Wall Street firms, people rely on Microsoft Excelto give them accurate results. One of the most used functions of Excel is Least Squares Fitting, or finding a best-fitting curvegiven a set of points, by minimizing the squared error. This paper will explore four different ways in which a user can calculate aLeast Squares Linear Fit with Excel and analyze how the four methods perform on mostly ill-conditioned data. We will compareExcel's results to those of another popular program called Matlab.

The "Trendline" Function

Suppose we have 6 datasets which are of the form:

X Y

1 + 412p

1

1 + 422p

2

1 + 432p

3

1 + 442p

4

1 + 452p

5

1 + 462p

6

1 + 472p

7

1 + 482p

8

where pœ{25,26,27,28,51,52}. The task is seemingly simple- find a least squares linear fit for the 6 datasets. Let us try Excel, and see how it fares with the task.First inputting the data into a spreadsheet, then plotting it on X-Y Scatter Plots, and then finally using the Excel function "AddTrendline", we get the following results in Figures 1-6, for the least squares linear fits for the datasets:

FIGURE 1:

FIGURE 2:

FIGURE 3:

FIGURE 4:

FIGURE 5:

FIGURE 6:

2 ExcelProject.nb


FIGURE 1:

!"#"$$%%&&$'()))))))))))))))*"+"$$%%&&,'()))))))))))))))"

-."#"/()))))))))))))))"

)()))))"

/()))))"

'()))))"

$()))))"

&()))))"

%()))))"

0()))))"

,()))))"

1()))))"

2()))))"

/()))))" /()))))" /()))))" /()))))" /()))))" /()))))"

!"#$#%&

'3'%"

456789:'3'%;"

FIGURE 2:

!"#"$%&'&(%$)*%%%%%%%%%%%%%%+","$%&'&(&-)'%%%%%%%%%%%%%%"

./"#"0)%1%%%%%%%%%%%%%"

%)%%%%%"

0)%%%%%"

*)%%%%%"

()%%%%%"

&)%%%%%"

1)%%%%%"

')%%%%%"

$)%%%%%"

2)%%%%%"

-)%%%%%"

0)%%%%%" 0)%%%%%" 0)%%%%%" 0)%%%%%" 0)%%%%%" 0)%%%%%" 0)%%%%%" 0)%%%%%"

!"#$#%&

*3*'"

456789:*3*';"

FIGURE 3:

FIGURE 4:

FIGURE 5:

FIGURE 6:

ExcelProject.nb 3


FIGURE 1:

FIGURE 2:

FIGURE 3:

!"#"$%&$&'%&()'''''''''''''''*"+"$%&$&'(,-)'''''''''''''''"

./"#"$)0$,1'''''''''''"

+-)'''''"

+,)'''''"

')'''''"

,)'''''"

-)'''''"

&)'''''"

()'''''"

$')'''''"

$)'''''" $)'''''" $)'''''" $)'''''" $)'''''" $)'''''" $)'''''" $)'''''"

!"#$#%&

,2,%"

3456789,2,%:"

FIGURE 4:

!"!!!!!#

$"!!!!!#

%"!!!!!#

&"!!!!!#

'"!!!!!#

("!!!!!#

)"!!!!!#

*"!!!!!#

+"!!!!!#

,"!!!!!#

$"!!!!!# $"!!!!!# $"!!!!!# $"!!!!!# $"!!!!!# $"!!!!!# $"!!!!!# $"!!!!!#

!"#$#%&

%-%+#

./01234%-%+5#

FIGURE 5:

FIGURE 6:

4 ExcelProject.nb


FIGURE 1:

FIGURE 2:

FIGURE 3:

FIGURE 4:

FIGURE 5:

!"!!!!!#

$"!!!!!#

%"!!!!!#

&"!!!!!#

'"!!!!!#

("!!!!!#

)"!!!!!#

*"!!!!!#

+"!!!!!#

,"!!!!!#

!",,,)!# $"!!$*!#

!"#$%&'

%-'(#

./01234%-'(5#

FIGURE 6:

!"#"$%&&&&&&&&&&&&&&&"

'("#"&%&&&&&&&&&&&&&&&"

&%&&&&&"

)%&&&&&"

*%&&&&&"

+%&&&&&"

,%&&&&&"

-%&&&&&"

.%&&&&&"

/%&&&&&"

$%&&&&&"

0%&&&&&"

&%000.&" )%&&)/&"

!"#$%#&

*1-*"

2345678*1-*9"

Something strange is happening here. The only dataset for which Excel seems to perform well is P=25; it gives a linear fit withan R2 value of 1.0. R2is known in statistics as the coefficient of determination, whose value (which ranges from 0-1), describesthe "goodness of fit" of the model or more specifically, it is the proportion of the variability in the data explained by the modelover the total variability of the data. So the R2that Excel shows in Figure 1, denotes that it found a perfect linear fit. Analytically,we can confirm that this should be the case; each point is evenly spaced on the X-axis and the Y-axis, so they should all fall onone line. In fact this is true of all 6 datasets.

For the datasets P=26 and P=27 Excel obtains fits which it claims have R2 values that are greater than 1; mathematicallythis is meaningless. For datasets with 28§p§51, Excel (without any warning message) refuses to give a linear fit and display anequation or an R2value (Figures 4 and 5 show the endpoints of this interval). We can attempt to explain this behavior by sayingthat, if excel cannot distinguish between the x values for 28§p§51 that is, if the numbers 1 + 41

2pand 1 + 42

2pare equal in Excel,

then it refuses to fit a linear function to the data because a vertical line is not a mathematical function. This explanation iscontradicted by the fact that for P=28, Excel can clearly distinguish between the x values (refer to the plot of Figure 4). Further-more, it seems that for P=52, Excel no longer has any problems displaying the equation and R2of the linear fit (although not anactual line on the plot). Furthermore, for P>52, Excel's behavior becomes unpredictable: for some values of P it does display theequation and R2value while for others, it does not. It seems that this behavior should be added to the long list of mysteriousbehaviors that plague Excel.

On another interesting note, with the Trendline function, a user can choose to display up to 99 decimal places in thevalues of slope, intercept, and R2! However, Excel can actually only display up to 15 siginificant decimal digits2. Although thisbug may seem harmless, significant digits are of great importance in scientific computation (i.e. uncertainty analysis).

ExcelProject.nb 5

Something strange is happening here. The only dataset for which Excel seems to perform well is P=25; it gives a linear fit withan R2 value of 1.0. R2is known in statistics as the coefficient of determination, whose value (which ranges from 0-1), describesthe "goodness of fit" of the model or more specifically, it is the proportion of the variability in the data explained by the modelover the total variability of the data. So the R2that Excel shows in Figure 1, denotes that it found a perfect linear fit. Analytically,we can confirm that this should be the case; each point is evenly spaced on the X-axis and the Y-axis, so they should all fall onone line. In fact this is true of all 6 datasets.

For the datasets P=26 and P=27 Excel obtains fits which it claims have R2 values that are greater than 1; mathematicallythis is meaningless. For datasets with 28§p§51, Excel (without any warning message) refuses to give a linear fit and display anequation or an R2value (Figures 4 and 5 show the endpoints of this interval). We can attempt to explain this behavior by sayingthat, if excel cannot distinguish between the x values for 28§p§51 that is, if the numbers 1 + 41

2pand 1 + 42

2pare equal in Excel,

then it refuses to fit a linear function to the data because a vertical line is not a mathematical function. This explanation iscontradicted by the fact that for P=28, Excel can clearly distinguish between the x values (refer to the plot of Figure 4). Further-more, it seems that for P=52, Excel no longer has any problems displaying the equation and R2of the linear fit (although not anactual line on the plot). Furthermore, for P>52, Excel's behavior becomes unpredictable: for some values of P it does display theequation and R2value while for others, it does not. It seems that this behavior should be added to the long list of mysteriousbehaviors that plague Excel.

On another interesting note, with the Trendline function, a user can choose to display up to 99 decimal places in thevalues of slope, intercept, and R2! However, Excel can actually only display up to 15 siginificant decimal digits2. Although thisbug may seem harmless, significant digits are of great importance in scientific computation (i.e. uncertainty analysis).

"By Hand"

Unsatisfied with the results we got from the "Trendline" function in Excel, we can take another approach to calculatingthe linear least squares fit for our data: input our own formulas into the Excel spreadsheet to calculate the fit. From any introduc-tory statistics textbook5, we can find that the formulas for the slope and intercept of linear least squares fit are:

b2 =nHSxyL-HSxL HSyLnISx2M-HSxL2

b1 =HSyL ISx2M-HSxL HSxyL

nISx2M-HSxL2

where b2 and b1 are the slope and y-intercept respectively, n is the number of data points, and x and y are the data points. We implement these formulas in steps as follows:

(1) Calculate intermediate products (e.g. "xy", "x2")(2) Calculate intermediate Sums (e.g. "Sx", "Sx2", "Sy", "Sxy")(3) Calculate Numerators (e.g. "n(Sxy) - (Sx) (Sy)" and "HSyL ISx2M - HSxL HSxyL")

(4) Calculate Denominator (e.g."nISx2M - HSxL2")(5) Calculate b1and b2 using (3) and (4)

Having calculated b2and b1, we can calculate R2using the following formula:

R2 ª 1 - SSerrSStot

SStot =⁄i Hyi - yL2

SSerr =⁄i Hyi - fiL2

where y is the mean of the observed data (all datapoints y), fi's are the values predicted by the linear fit i.e. fi = b2 xi + b1, andthe index i goes over all data points. The formulas for R2are similarly implemented in steps. Here are the results for the 6 datasets:

Dataset Hp = L b2 b1 R2

25 33 554 432 -33 554 472 126 Ò DIV ê 0! Ò DIV ê 0! Ò DIV ê 0!27 Ò DIV ê 0! Ò DIV ê 0! Ò DIV ê 0!28 Ò DIV ê 0! Ò DIV ê 0! Ò DIV ê 0!51 Ò DIV ê 0! Ò DIV ê 0! Ò DIV ê 0!52 Ò DIV ê 0! Ò DIV ê 0! Ò DIV ê 0!

Our fit performs even worse than Excel's native "Trendline" function! The results for the P=25 dataset are the same as thoseproduced by Trendline (refer to figure 1) however, for every other dataset, we get a "Division by zero" error. The division by 0occurs when calculating the slope and intercept: the denominator nISx2M - HSxL2 is calculated to be 0 for the 5 datasets. Where isour error? Can we fix this?

It turns out, that the error is not our own but Excel's! Prof. Velvel Kahan found that for some computations in Excel, anextra set of parentheses around the entire expression (which analytically does not change the value of the expression), changedthe result of the computation.2 Can a few parentheses really change the results of our fit? After placing an extra set of parenthesesto surround the formula of each of the steps taken above to calculate b2, b1, and R2, the results were the following:


25 33 554 432.0000000 -33 554 472.0000000 1.00000000000000026 70 464 307.2000000 -70 464 349.6000000 0.99166666719885127 176 160 768.000000 -176 160 824.000000 0.067336309523809528 Ò DIV ê 0! Ò DIV ê 0! Ò DIV ê 0!51 Ò DIV ê 0! Ò DIV ê 0! Ò DIV ê 0!52 0.000000000000000 8.00000000000000 -2.33333333333333

Every slope and intercept in the table above, matches those produced by the Trendline function! Datasets in the interval betweenp=28 and p=51 still produce the # DIV/0! error, but those are exactly the datasets for which the Trendline could not compute thelinear fit. Notice however that the R2values differ for datasets p=26,27,52. This is because Excel uses another less general but, inthe case of simple linear regression, equivalent formula to calculate R2that is:

R2 =SSregSStot

where SSreg =⁄i I fi - f M2 and f is the mean of the values predicted by the model. Implementing this formula in the spreadsheet,

we reproduce exactly those values for R2 that the Trendline function produced. The two equivalent forms of R2producingdrastically different results hints at a large numerical instability of our algorithms and ill-conditioning of our data.

Had it not been for Kahan's paper, such a bug would have never been considered and hence discovered in our implementa-tion of the linear fit. The problem that the linear fit fails in the interval 28§p§51 still persists in our implementation; this could bea limitation of our method in conjunction with finite-precision arithmetic.

6 ExcelProject.nb




nISx2M-HSxL2
















R2 =SSregSStot




ExcelProject.nb 7




nISx2M-HSxL2
















R2 =SSregSStot




Reformulating the Least Squares Problem

Let us reformulate the problem of a linear least squares fit in matrix notation. Consider the 8 x 2 matrix A:

A=

1 1 + 412p

1 1 + 422p

1 1 + 432p

1 1 + 442p

1 1 + 452p

1 1 + 462p

1 1 + 472p

1 1 + 482p

the 2 x 1 vector x:

x=K b1b2

O

and the 8 x 1 vector b:

b=

12345678

Our task is to find such a vector x, as to minimize the squared Euclidean norm of the residual r = b - Ax, or:

minx »» b - Ax »»22. We can extend this to the general case for A of size m x n (m¥n), x of size n x 1, and b of size m x 1.

The most straightforward approach to solving the least squares problem is called the method of Normal Equations. The deriva-tion for the method is as follows:

We can define the residual as a vector function of x,

rHxL = b - A x

we are trying to minimize the squared Euclidean norm of the residual, or:

EHxL = »» b - Ax »»22 =⁄i=1

m ri2HxL

to minimize EHxL we need to find an x such that the gradient of EHxL is zero, that is the partial derivative with respect to each x jiszero:

∂EHxL∂x j

= 0 = 2*⁄i=1m ri

∂ri∂x j

by definition, ri = bi - A x = yi -⁄ j=1n Aij x j, and so:

∂ri∂x j

= ∂

∂x jbi -⁄ j=1

n Aij x j = -Aij

and so:

∂EHxL∂x j

= 0 = 2*⁄i=1m ri

∂ri∂x j

= -2 ⁄i=1m IAijM Ibi -⁄k=1

n Aik xkM = -2 ⁄i=1m IAijM HbiL + 2 ⁄i=1

m⁄k=1n IAijM Aik xk = 0

So we just have to solve the following expression:

⁄i=1m

⁄k=1n IAijM HAikL xk =⁄i=1

m IAijM HbiL (1)

which in matrix notation is:

IAT AM x = AT b (2)

These are the normal equations.

The equations for the slope Hb2) and y-intercept Hb1L of the linear fit that we used in the previous section are actually just directlyderived from the Normal Equations, by expanding equation (1) above.

8 ExcelProject.nb


A=

1 1 + 412p

1 1 + 422p

1 1 + 432p

1 1 + 442p

1 1 + 452p

1 1 + 462p

1 1 + 472p

1 1 + 482p

the 2 x 1 vector x:

x=K b1b2

O


b=

12345678





rHxL = b - A x


EHxL = »» b - Ax »»22 =⁄i=1

m ri2HxL


∂EHxL∂x j

= 0 = 2*⁄i=1m ri

∂ri∂x j


∂ri∂x j

= ∂

∂x jbi -⁄ j=1

n Aij x j = -Aij

and so:

∂EHxL∂x j

= 0 = 2*⁄i=1m ri

∂ri∂x j





⁄i=1m


m IAijM HbiL (1)


IAT AM x = AT b (2)



ExcelProject.nb 9


A=

1 1 + 412p

1 1 + 422p

1 1 + 432p

1 1 + 442p

1 1 + 452p

1 1 + 462p

1 1 + 472p

1 1 + 482p

the 2 x 1 vector x:

x=K b1b2

O


b=

12345678





rHxL = b - A x


EHxL = »» b - Ax »»22 =⁄i=1

m ri2HxL


∂EHxL∂x j

= 0 = 2*⁄i=1m ri

∂ri∂x j


∂ri∂x j

= ∂

∂x jbi -⁄ j=1

n Aij x j = -Aij

and so:

∂EHxL∂x j

= 0 = 2*⁄i=1m ri

∂ri∂x j





⁄i=1m


m IAijM HbiL (1)


IAT AM x = AT b (2)



Conditioning and the Pseudo-Inverse

It is fairly straightforward to show that the method of Normal equations is numerically instable. Suppose that the matrix A (alsocalled the Vandermonde matrix), is ill conditioned, that is:

condHAL >> 1

To solve the linear least squares problem using the method of normal equations we have to solve the following system:

IAT AM x = AT b

and we can show that:

condIAT AM º condHAL2

and so:

condIAT AM >> cond HAL>>1

Clearly, there is large growth in the algorithm. The method of Normal Equations is known to be numerically instable and toperform poorly for ill-conditioned problems. Perhaps this is the reason why the Excel implementation of the linear fit fails in theinterval 28§p§51.

In the discussion above we referred to the condition number of the matrix A (cond(A)). The condition number is defined as:

condHAL = »» A »» * »» A-1 »»

but A is an m x n matrix, with m¥n, that is, A is in general rectangular. If A is non-square, it is non-invertible. How then, do wecalculate the condition number of A?

We introduce a new concept of a generalized inverse, more specifically the Moore-Penrose pseudoinverse A+. For a general m xn matrix A, A+ is an n x m matrix that has the following properties1:

(1) A A+ A = A(2) A+ A A+ = A+

(3) HA A+L* = A A+

(4) H A+ AL* = A+ A where * denotes the conjugate transpose.

For general m x n matrices A we define the condition number to be:

condHAL = »» A »» * »» A+ »»

Using this definition we can find how ill conditioned each of our 6 data sets is:

Dataset Matrix Ap Cond IApM

A25 2.928874824058564 e + 07A26 5.857745763834818 e + 07A27 1.171548756109411 e + 08A28 2.343097090872832 e + 08A51 1.925820207179830 e + 15A52 3.404401319607318 e + 15

It is interesting to note that the condition number of the datasets seems to follow the general trend:

Ap+1Ap

º 2

Our Excel implemented solution stops giving meaningful results at around p=26, and breaks down completely at p>27. We mustremember however that since we are essentially using the method of normal equations to solve the least squares problem, ourcondition number is roughly squared, and so we can say our solution breaks down at about:

condIA26 *A26TM = 3.431318543372478 e + 15 º condHA26L2

10 ExcelProject.nb

It is fairly straightforward to show that the method of Normal equations is numerically instable. Suppose that the matrix A (alsocalled the Vandermonde matrix), is ill conditioned, that is:

condHAL >> 1

To solve the linear least squares problem using the method of normal equations we have to solve the following system:

IAT AM x = AT b

and we can show that:

condIAT AM º condHAL2

and so:

condIAT AM >> cond HAL>>1

Clearly, there is large growth in the algorithm. The method of Normal Equations is known to be numerically instable and toperform poorly for ill-conditioned problems. Perhaps this is the reason why the Excel implementation of the linear fit fails in theinterval 28§p§51.

In the discussion above we referred to the condition number of the matrix A (cond(A)). The condition number is defined as:

condHAL = »» A »» * »» A-1 »»

but A is an m x n matrix, with m¥n, that is, A is in general rectangular. If A is non-square, it is non-invertible. How then, do wecalculate the condition number of A?

We introduce a new concept of a generalized inverse, more specifically the Moore-Penrose pseudoinverse A+. For a general m xn matrix A, A+ is an n x m matrix that has the following properties1:

(1) A A+ A = A(2) A+ A A+ = A+

(3) HA A+L* = A A+

(4) H A+ AL* = A+ A where * denotes the conjugate transpose.

For general m x n matrices A we define the condition number to be:

condHAL = »» A »» * »» A+ »»

Using this definition we can find how ill conditioned each of our 6 data sets is:

Dataset Matrix Ap Cond IApM

A25 2.928874824058564 e + 07A26 5.857745763834818 e + 07A27 1.171548756109411 e + 08A28 2.343097090872832 e + 08A51 1.925820207179830 e + 15A52 3.404401319607318 e + 15

It is interesting to note that the condition number of the datasets seems to follow the general trend:

Ap+1Ap

º 2

Our Excel implemented solution stops giving meaningful results at around p=26, and breaks down completely at p>27. We mustremember however that since we are essentially using the method of normal equations to solve the least squares problem, ourcondition number is roughly squared, and so we can say our solution breaks down at about:

condIA26 *A26TM = 3.431318543372478 e + 15 º condHA26L2

QR Decomposition

Since our data sets are substantially ill-conditioned, it would be wise to use a more numerically stable algorithm to solve thelinear least squares problem. One such algorithm is implemented using QR Decomposition (or factorization). With QR decomposi-tion, we can factor a general m x n matrix A into a product of an m x m orthogonal matrix Q, and an m x n matrix R partitionedinto two parts, an n x n upper triangular block, and an (m-n) x n block of zeros. We can express that mathematically as:

A =Q BRn0F

Using this factorization we can derive a solution to linear least squares problem. The residual, as previously defined, is:

r = b - A x

We can multiply the expression above by QT to get:

QT r =QT b - QT A x =QT b - IQT QM R x =QT b - R x =IQT bMn - Rn x

IQT bMm-n= K

c1c2

O

»» r »»22 = rT r = rT QQT r = K

c1c2

OTKc1c2

O = H c1 c2 L Kc1c2

O = c12 + c22

Since the value of x does not affect the value of c2, to minimize »» r »»22 we must choose an x to minimize c1 that is, make it equalzero. Therefore the equation governing the solution of the least linear squares problem becomes:

IQT bMn - Rn x = 0

or:

Rn x = IQT bMn

To solve the linear least squares problem, we must solve the equation above. The QR factorization is accomplished via orthogonal transformations which, by definition, preserve Euclidean norms. Therefore,we expect the algorithm to be more numerically stable than the method of Normal Equations, because there is no growth in theQR algorithm, that is:

condHAL º condHRnL

ExcelProject.nb 11

Since our data sets are substantially ill-conditioned, it would be wise to use a more numerically stable algorithm to solve thelinear least squares problem. One such algorithm is implemented using QR Decomposition (or factorization). With QR decomposi-tion, we can factor a general m x n matrix A into a product of an m x m orthogonal matrix Q, and an m x n matrix R partitionedinto two parts, an n x n upper triangular block, and an (m-n) x n block of zeros. We can express that mathematically as:

A =Q BRn0F

Using this factorization we can derive a solution to linear least squares problem. The residual, as previously defined, is:

r = b - A x

We can multiply the expression above by QT to get:

QT r =QT b - QT A x =QT b - IQT QM R x =QT b - R x =IQT bMn - Rn x

IQT bMm-n= K

c1c2

O

»» r »»22 = rT r = rT QQT r = K

c1c2

OTKc1c2

O = H c1 c2 L Kc1c2

O = c12 + c22

Since the value of x does not affect the value of c2, to minimize »» r »»22 we must choose an x to minimize c1 that is, make it equalzero. Therefore the equation governing the solution of the least linear squares problem becomes:

IQT bMn - Rn x = 0

or:

Rn x = IQT bMn

To solve the linear least squares problem, we must solve the equation above. The QR factorization is accomplished via orthogonal transformations which, by definition, preserve Euclidean norms. Therefore,we expect the algorithm to be more numerically stable than the method of Normal Equations, because there is no growth in theQR algorithm, that is:

condHAL º condHRnL

LINEST

Starting with the 2003 version, the developers of Excel decided to implement a least squares algorithm using QRdecomposition in a function called "LINEST" (before Excel 2003, LINEST used the method of Normal Equations). Instead offinding the R2value to evaluate the "goodness of fit", we provide the squared Euclidean norm of the residual r = b - Ax that is,the very thing we are trying to minimize with the least squares solution. Using LINEST we get the following results for our 6datasets:

Dataset Hp =L b2 b1 »» r »»22

25 0.000000000000000 4.50000000000000 42.00000000000000026 0.000000000000000 4.50000000000000 42.00000000000000027 0.000000000000000 4.50000000000000 42.00000000000000028 0.000000000000000 4.50000000000000 42.00000000000000051 0.000000000000000 4.50000000000000 42.00000000000000052 0.000000000000000 4.50000000000000 42.000000000000000

An algorithm that is supposed to be more numerically stable and perform better on ill-conditioned problems, gives thesame bad linear fit for all 6 of our datasets! What is wrong here? Is it the QR algorithm that is causing the problem, or Excel'simplementation?

To compare results we can use Matlab's implementation of QR decomposition to find the least linear squares fits for ourdatasets. Here are the results produced by Matlab's implementation of linear least squares with QR decomposition:


25 3.355443200800436 e7 -3.355447200800437 e7 3.3307 e - 1626 6.710886380847121 e7 -6.710890380847107 e7 1.998401444325282 e - 15

27 1.342177264337416 e8 -1.342177664337410 e8 5.595524044110790 e - 1428 2.684354541191144 e8 -2.684354941191140 e8 7.105427357601003 e - 1551 4.499999999999912 0 41.99999999999984452 4.499999999999957 0 41.999999999999908

The Matlab implemented algorithm performs very well- it gives good linear fits up to p=51. On the other hand, Excel'simplementation gives meaningful fits only up to p=23. Furthermore, the fits that it gives for p>23 have a larger residual than anyfit given by Matlab. So the developers of Excel implemented a more numerically stable QR algorithm to better perform on ill-conditioned data however, implemented it in such a way that it performs worse than the most naive algorithms (even Excel's ownTrendline).

12 ExcelProject.nb

Starting with the 2003 version, the developers of Excel decided to implement a least squares algorithm using QRdecomposition in a function called "LINEST" (before Excel 2003, LINEST used the method of Normal Equations). Instead offinding the R2value to evaluate the "goodness of fit", we provide the squared Euclidean norm of the residual r = b - Ax that is,the very thing we are trying to minimize with the least squares solution. Using LINEST we get the following results for our 6datasets:


25 0.000000000000000 4.50000000000000 42.00000000000000026 0.000000000000000 4.50000000000000 42.00000000000000027 0.000000000000000 4.50000000000000 42.00000000000000028 0.000000000000000 4.50000000000000 42.00000000000000051 0.000000000000000 4.50000000000000 42.00000000000000052 0.000000000000000 4.50000000000000 42.000000000000000

An algorithm that is supposed to be more numerically stable and perform better on ill-conditioned problems, gives thesame bad linear fit for all 6 of our datasets! What is wrong here? Is it the QR algorithm that is causing the problem, or Excel'simplementation?

To compare results we can use Matlab's implementation of QR decomposition to find the least linear squares fits for ourdatasets. Here are the results produced by Matlab's implementation of linear least squares with QR decomposition:


25 3.355443200800436 e7 -3.355447200800437 e7 3.3307 e - 1626 6.710886380847121 e7 -6.710890380847107 e7 1.998401444325282 e - 15

27 1.342177264337416 e8 -1.342177664337410 e8 5.595524044110790 e - 1428 2.684354541191144 e8 -2.684354941191140 e8 7.105427357601003 e - 1551 4.499999999999912 0 41.99999999999984452 4.499999999999957 0 41.999999999999908

The Matlab implemented algorithm performs very well- it gives good linear fits up to p=51. On the other hand, Excel'simplementation gives meaningful fits only up to p=23. Furthermore, the fits that it gives for p>23 have a larger residual than anyfit given by Matlab. So the developers of Excel implemented a more numerically stable QR algorithm to better perform on ill-conditioned data however, implemented it in such a way that it performs worse than the most naive algorithms (even Excel's ownTrendline).

Rank Deficiency

When using QR to solve least linear squares on datasets with p¥51, Matlab warns the user with the following error: "Warning:Rank deficient, rank = 1". For a matrix M of order n, M is rank deficient if rank(M)<n that is, if it has less than n linearly indepen-dent columns. Indeed for p¥51, matlab considers the matrix A

1 1 + 412p

1 1 + 422p

1 1 + 432p

1 1 + 442p

1 1 + 452p

1 1 + 462p

1 1 + 472p

1 1 + 482p

to have at most 1 linearly independent column (the operation rank() on the matrix returns a "1"). Its interesting to note that forsquare matrices, the condition number is a measure of how close the matrix is to being singular, while for the general rectangularmatrices, the condition number is a measure of how close the matrix is to being rank deficient. According to the matlab documen-tation, for rank deficient matrices M, the linear least squares implemented with QR decomposition no longer returns the minimallength solution as it is bound by the semantics of QR factorization to return a solution with rank(M) non-zero values.Meanwhile, the method of Normal Equations breaks down completely for rank deficient matrices A. The solution for x from theNormal Equations is:

x = IAT AM-1 AT b

If A is rank deficient, then AT A is no longer a square matrix, and IAT AM-1 does not even exist. This is the reason that our implementations of the method of Normal equations to linear least squares problem begin to breakdown at high condition numbers, the matrix A becomes rank deficient. Is there any way to calculate a linear least squaressolution for rank deficient matrices?

ExcelProject.nb 13

When using QR to solve least linear squares on datasets with p¥51, Matlab warns the user with the following error: "Warning:Rank deficient, rank = 1". For a matrix M of order n, M is rank deficient if rank(M)<n that is, if it has less than n linearly indepen-dent columns. Indeed for p¥51, matlab considers the matrix A

1 1 + 412p

1 1 + 422p

1 1 + 432p

1 1 + 442p

1 1 + 452p

1 1 + 462p

1 1 + 472p

1 1 + 482p

to have at most 1 linearly independent column (the operation rank() on the matrix returns a "1"). Its interesting to note that forsquare matrices, the condition number is a measure of how close the matrix is to being singular, while for the general rectangularmatrices, the condition number is a measure of how close the matrix is to being rank deficient. According to the matlab documen-tation, for rank deficient matrices M, the linear least squares implemented with QR decomposition no longer returns the minimallength solution as it is bound by the semantics of QR factorization to return a solution with rank(M) non-zero values.Meanwhile, the method of Normal Equations breaks down completely for rank deficient matrices A. The solution for x from theNormal Equations is:

x = IAT AM-1 AT b

If A is rank deficient, then AT A is no longer a square matrix, and IAT AM-1 does not even exist. This is the reason that our implementations of the method of Normal equations to linear least squares problem begin to breakdown at high condition numbers, the matrix A becomes rank deficient. Is there any way to calculate a linear least squaressolution for rank deficient matrices?

Pseudoinverse and the SVD

Yes! The way to calculate a least linear squares solution for rank deficient matrices is by using the Singular Value Decomposi-tion. Before getting into the details of the algorithm, lets first look at the motivation.

Notice from the normal equations (eq. 2 above), we can find an expression for the solution x,

x = IAT AM-1 AT b

and the n x m matrix IAT AM-1 AT is actually A+, the pseudoinverse of A-it satisfies each of the four properties listed above. Wewill show that the four properties hold:

(1) A A+ A = A IAT AM-1 AT A = A JIAT AM-1 I AT AMN = A QED

(2) A+ A A+ = JIAT AM-1 ATN A JIAT AM-1 ATN = JIAT AM-1 IAT AMN IAT AM-1 AT = IAT AM-1 AT = A+ QED

(3) HA A+L* = HA A+LT = IHA +LT ATM = J IAT AM-1 ATN

TAT = JAJ IAT AM-1N

TATN = JA JIAT AMTN

-1ATN

= A IAT AM-1 AT = A A+QED Note: since we know that in our case we're working in the Reals, we can say that M* = MT

(4) H A+ AL* = H A+ ALT = JATHA+LTN = AT JIAT AM-1 ATN

T= AT JA JIAT AM-1N

TN = AT A JIAT AMTN

-1

= IAT AM IAT AM-1 = I = IAT AM-1 IAT AM = A+ A QED

Since IAT AM-1 AT satisfies the four properties above, it is indeed the pseudoinverse of A.We can see from the In a sense, the problem of linear least squares is reduced to finding the pseudoinverse of the Vandermondematrix A.

With Singular Value Decomposition, we can factor a matrix A as follows:

A=U S VT where S = diagIs1, s2. . . spM

si follow the relation s1 ¥ s2 ¥ .... ¥ sp ¥ 0 and are called the singular values of A. The columns of U are called the leftsingular vectors, and the columns of V are called the right singular vectors. Using the SVD, we can calculate the pseudoinverseA+by the relation:

A+ = V S+ UT where S+ = diagJ 1

s1, 1

s2, . . . , 1

snN for all nonzero si.

If si = 0, then we set the corresponding singular value in S+ to 0.1

What makes the SVD so powerful, is that it allows us to manually "fiddle" with the singular values. For instance, if a matrix isrank deficient, at least one of its singular values is zero. Hence when calculating the pseudoinverse, we set the correspondingsingular value in S+to zero. By "manually" changing the singular value, we avoid the erroneous infinite result we would havegotten from the division by zero. The SVD is powerful for nearly deficient Matrices as well- we could set a certain thresholdvalue, and ensure that if a singular value is smaller than the threshold value, we set it to zero (this is the way the algorithm isimplemented in Matlab). By setting small singular values to zero, we are essentially making the matrix less ill-conditioned (oneof the definitions of the condition number is the s1

snwhere s1and sn are the largest and smallest non-zero singular values respec-

tively).

Using Matlab's pinv() routine, we find the least squares linear fit for our datasets: Dataset Matrix Ap b2 b1 »» r »»2

2

A25 3.355443195571761 e7 -3.355447195571753 e7 2.164934898019056 e - 15A26 6.710886391143523 e7 -6.710890391143516 e7 1.776356839400251 e - 15A27 1.342177268800614 e8 -1.342177668800610 e8 5.240252676230738 e - 14A28 2.684354499888867 e8 -2.684354899888856 e8 7.105427357601003 e - 14A51 2.249999999999956 2.250000000000000 41.999999999999080A52 2.249999999999977 2.249999999999999 41.999999999999571

Unfortunately, unlike any serious statistical or mathematical software, Excel does not have its own implemented routine tocalculate the SVD, or the pseudoinverse. We were however, able to find an open-source macro called Biplot3, which had thecapability of calculating the SVD. Using Excel with the Biplot Macro, we find roughly the same (not very good) fit for each ofour 6 data sets:

Dataset Matrix Ap b2 b1 »» r »»22

A25,26,27,28,51,52 2.249997055159946 2.250000039113575 41.999999999999834

14 ExcelProject.nb



x = IAT AM-1 AT b





TAT = JAJ IAT AM-1N

TATN = JA JIAT AMTN

-1ATN



T= AT JA JIAT AM-1N

TN = AT A JIAT AMTN

-1







s1, 1

s2, . . . , 1





tively).


2




A25,26,27,28,51,52 2.249997055159946 2.250000039113575 41.999999999999834

ExcelProject.nb 15



x = IAT AM-1 AT b





TAT = JAJ IAT AM-1N

TATN = JA JIAT AMTN

-1ATN



T= AT JA JIAT AM-1N

TN = AT A JIAT AMTN

-1







s1, 1

s2, . . . , 1





tively).


2




A25,26,27,28,51,52 2.249997055159946 2.250000039113575 41.999999999999834

Conclusion

In the course of this paper, we have explored 4 different methods by which to calculate the Least Squares Linear Fit inExcel. Using Excel native Trendline function, we found meaningful results only in datasets with p§25 (or cond(A)§2.928874824058564 e + 07). We then implemented commonly used statistical formulas for the least squares linear fit in thespreadsheet, and essentially reproduced the results of Excel's Trendline function. We confirmed that both of the fits use themethod of Normal Equations which we showed was numerically instable (it has large growth), and hence does not perform wellon ill-conditioned data.

We went on to discuss a more numerically stable algorithm, which calculates the Least Squares Linear Fit using QRDecomposition. Using Matlab, we were able to get reasonably good fits for datasets with p< 51 (or con-d(A)<1.925820207179830 e + 15) at which point, Matlab considered our matrix A to be rank deficient. Unfortunately,Excel's built-in LINEST function, which supposedly uses QR decomposition to give a linear fit, does not fare as well. LINESTgives good fits for datasets with p§23. For p>23, LINEST returns the same (very poor) fit. LINEST performs even worse thanthe first algorithms we discussed (e.g. Trendline), which are just naive implementations of the method of Normal Equations.

Finally, we discussed how to solve the Least Square Linear Fit for the most ill-conditioned problems (those that are, ornearly are, rank deficient) using the concept of pseudoinverse and the Singular Value Decomposition. Using Matlab, we wereable to use the SVD to find reasonable linear fits for our most ill-conditioned datasets with p=51, 52 Icond HAL º 1015M. Using themethod of SVD, we find a linear fit with a slightly smaller (but comparable) residual than QR decomposition. Since Excel doesnot implement a native routine to calculate the SVD, we found an open-source macro which did the job. However, just as withLINEST, the macro only gives meaningful fits on datasets with p<25. For p>25, the macro returns roughly the same linear fit forall datasets, with a residual that is greater than the largest residual returned by Matlab's QR and SVD algorithms for any of thedatasets.

It is hard to say why Excel performs so poorly without delving deep into the source code, which we of course do not haveaccess to. Even in the duration of this project we discovered at least two substantial bugs in the software; one has to do with thefact that Excel displays up to 99 decimal digits in the Trendline function (30 digits elsewhere), while it really only has 15significant decimal digits of precision, and the other more serious bug has to do with Excel evaluating two mathematicallyequivalent expressions differently because of an extra set of parentheses surrounding the expression. This second bug had causedour code to fail, producing a "# DIV/0!" for all datasets with p>25. Without being aware of this bug (the only place we found amention of this bug is in Prof. Kahan's paperL,2 the code would have been virtually undebuggable. Perhaps all of these bugs stemfrom the fact that Excel tries to make their arithmetic seem Decimal and not Binary. One direct side effect of this is that Excelcan only use 15 significant digits of precision as opposed to 17. Perhaps developers (both of Excel and not), have caught on tothe fact that Excel does not perform well with floating point arithmetic (especially with very ill-conditioned data), and hencehave written their algorithms to not even execute on ill-conditioned data to avoid unpredictable results (this could be the casewith LINEST and the open source macro previously discussed). In any case, one thing is for certain: unless your data is verywell conditioned, you should not use Excel to find a Least Squares Linear fit.

16 ExcelProject.nb

In the course of this paper, we have explored 4 different methods by which to calculate the Least Squares Linear Fit inExcel. Using Excel native Trendline function, we found meaningful results only in datasets with p§25 (or cond(A)§2.928874824058564 e + 07). We then implemented commonly used statistical formulas for the least squares linear fit in thespreadsheet, and essentially reproduced the results of Excel's Trendline function. We confirmed that both of the fits use themethod of Normal Equations which we showed was numerically instable (it has large growth), and hence does not perform wellon ill-conditioned data.

We went on to discuss a more numerically stable algorithm, which calculates the Least Squares Linear Fit using QRDecomposition. Using Matlab, we were able to get reasonably good fits for datasets with p< 51 (or con-d(A)<1.925820207179830 e + 15) at which point, Matlab considered our matrix A to be rank deficient. Unfortunately,Excel's built-in LINEST function, which supposedly uses QR decomposition to give a linear fit, does not fare as well. LINESTgives good fits for datasets with p§23. For p>23, LINEST returns the same (very poor) fit. LINEST performs even worse thanthe first algorithms we discussed (e.g. Trendline), which are just naive implementations of the method of Normal Equations.

Finally, we discussed how to solve the Least Square Linear Fit for the most ill-conditioned problems (those that are, ornearly are, rank deficient) using the concept of pseudoinverse and the Singular Value Decomposition. Using Matlab, we wereable to use the SVD to find reasonable linear fits for our most ill-conditioned datasets with p=51, 52 Icond HAL º 1015M. Using themethod of SVD, we find a linear fit with a slightly smaller (but comparable) residual than QR decomposition. Since Excel doesnot implement a native routine to calculate the SVD, we found an open-source macro which did the job. However, just as withLINEST, the macro only gives meaningful fits on datasets with p<25. For p>25, the macro returns roughly the same linear fit forall datasets, with a residual that is greater than the largest residual returned by Matlab's QR and SVD algorithms for any of thedatasets.

It is hard to say why Excel performs so poorly without delving deep into the source code, which we of course do not haveaccess to. Even in the duration of this project we discovered at least two substantial bugs in the software; one has to do with thefact that Excel displays up to 99 decimal digits in the Trendline function (30 digits elsewhere), while it really only has 15significant decimal digits of precision, and the other more serious bug has to do with Excel evaluating two mathematicallyequivalent expressions differently because of an extra set of parentheses surrounding the expression. This second bug had causedour code to fail, producing a "# DIV/0!" for all datasets with p>25. Without being aware of this bug (the only place we found amention of this bug is in Prof. Kahan's paperL,2 the code would have been virtually undebuggable. Perhaps all of these bugs stemfrom the fact that Excel tries to make their arithmetic seem Decimal and not Binary. One direct side effect of this is that Excelcan only use 15 significant digits of precision as opposed to 17. Perhaps developers (both of Excel and not), have caught on tothe fact that Excel does not perform well with floating point arithmetic (especially with very ill-conditioned data), and hencehave written their algorithms to not even execute on ill-conditioned data to avoid unpredictable results (this could be the casewith LINEST and the open source macro previously discussed). In any case, one thing is for certain: unless your data is verywell conditioned, you should not use Excel to find a Least Squares Linear fit.

Bibliography

(1)Burdick, Prof. Joel. "The Moore-Penrose Pseudo Inverse." Web.<http://robotics.caltech.edu/~jwb/courses/ME115/handouts/pseudo.pdf>

(2) Kahan, William. "How Futile Are Mindless Assessments of Roundoff in Floating-Point Computation?" Web.<http://www.cs.berkeley.edu/~wkahan/Mindless.pdf>.

(3)Lipkovich, Ilya, and Eric P. Smith. "Biplot and Singular Value Decomposition Macros for Excel." Virginia Tech Departmentof Statistics. Web. <http://filebox.vt.edu/artsci/stats/vining/keying/biplot.doc>.

(4)Markovsky, Prof. Ivan. "Least Squares and Singular Value Decomposition." University of South Hampton. Web.<http://users.ecs.soton.ac.uk/im/bari08/svd.pdf>.

(5) Taylor, John R. An Introduction to Error Analysis: the Study of Uncertainties in Physical Measurements. Sausalito, Calif.:University Science, 1997. Print.

ExcelProject.nb 17

Documents

Linear Least Squares Fitting with Microsoft Excel · The task is seemingly simple- find a least squares linear fit for the 6 datasets. Let us try Excel, and see how it fares with