PARALLEL VARIABLE DISTRIBUTION FOR TOTAL LEAST SQUARESrosie/mypapers/vdtls_rev5.pdf · in-variables regression" for a long time as a natural generalization of least squares (LS),

PARALLEL VARIABLE DISTRIBUTION FOR TOTAL LEAST

SQUARES

HONGBIN GUO AND ROSEMARY A. RENAUT

Abstract. A novel parallel method for determining an approximate to-

tal least squares (TLS) solution is introduced. Based on domain dis-

tribution, the global TLS problem is partitioned into several dependent

TLS subproblems. A convergent algorithm using the parallel variable

distribution technique (Ferris and Mangasarian, 1994) is presented. Nu-

merical results support the development and analysis of the algorithms.

Keywords: Total least squares, Variable distribution, Errors-in-variables

regression.

AMS(MOS) subject classification: 65F20, 90C06, 90C26, 90C30

1. Introduction

Total least squares (TLS), which has existed in statistics under the name “errors-

in-variables regression” for a long time as a natural generalization of least squares

(LS), was introduced to numerical specialists in 1973 by Golub([7]). In the last

two decades, the TLS problem has been researched from many different numerical

perspectives, [14, 2, 8, 5, 16, 13, 20], excepting the development of a parallel algo-

rithm. Among these references, [14] includes a complete introduction and analysis

of basic algorithms. Here, we consider large-scale well-posed problems; for ill-posed

TLS problems refer to [8, 5, 13, 20].

In contrast to the standard LS model for the TLS model both the matrix A and

the right hand side b in the overdetermined linear system Ax ≈ b, A ∈ Rm×n,m >

n, are assumed to contain errors. The TLS solution xTLS solves the following

optimization problem:

(1.1) minE,f

‖[E, f ]‖2F subject to (A+E)x = b+ f,

Date: September 10, 2004.

This work was partially supported by the Arizona Center for Alzheimer’s Disease

Research, by NIH grant EB 2553301 and for the second author by NSF CMG-02223.

Email:hb [email protected].

Email:[email protected].

1

2 HONGBIN GUO AND ROSEMARY A. RENAUT

where ‖.‖F denotes the Frobenius norm. If UΣV T is a singular value decomposition

of the augmented matrix [A, b], the smallest singular value σn+1 is simple and

v(n + 1, n + 1) 6= 0 (generic), then xTLS = −v(1 : n, n + 1)/v(n + 1, n + 1) and

[E, f ] = −σn+1un+1vTn+1, where here ui, vi are the columns of matrices U and V ,

respectively. Moreover, xTLS satisfies,

f =AxTLS − b

1 + ‖xTLS‖2,(1.2)

E = −f · xTTLS,(1.3)

and minimizes the sum of squared normalized residuals,

(1.4) xTLS = argminxφ(x) = argminx

‖Ax− b‖2

1 + ‖x‖2,

[10, 14]. Here φ is the Rayleigh quotient (RQ) of matrix [A, b]T [A, b]. For conve-

nience we denote the TLS problem for the augmented matrix [A, b] by TLS(A, b)

and say that xTLS solves the problem TLS(A, b).

For a large-scale problem, the singular value decomposition (SVD) calculation

is expensive. Because we only need one singular vector, iterative methods can

be used to improve the computational speed. For example, inverse power, in-

verse Chebyshev iteration [14] and Rayleigh quotient iteration (RQI) [2] could be

used. Related methods for computing smallest singular value(s) (and vector(s))

are Jacobi-Davidson [22], trace minimization [21] and Inverse Rayleigh Ritz itera-

tion [12]. We also note that block Lanczos and block Davidson methods, based on

parallel sparse matrix-vector multiplication, were discussed for computing the sin-

gular subspace associated with the smallest singular values [18]. Here, rather than

computing the singular vectors, we use the TLS minimization formulation (1.4),

and solve the optimization problem by application of the parallel variable distri-

bution(PVD) approach of Ferris and Mangasarian, [4]. This approach is suitable

for both sparse and dense data structures. The specialization of PVD to the linear

least squares problem was presented in [19], [3] and a more general approach was

presented in [6].

To apply the PVD approach we introduce the decomposition of the space Rn

as a Cartesian product of lower dimensional subspaces Rnj , j = 1, · · · , p, where∑p

j=1 nj = n. Accordingly, any vector x ∈ Rn is decomposed as (Px)T = (xT1 , x

T2 ,

· · · , xTp ), where P is a permutation matrix. Matrices A and E are partitioned

PARALLEL VARIABLE DISTRIBUTION FOR TOTAL LEAST SQUARES 3

consistently

AP = [A1, A2, · · · , Ap],

EP = [E1, E2, · · · , Ep].

Without loss of generality, we ignore the permutation matrix P in the analysis and

theoretical development of the algorithms. For ease we introduce the notation xi

to be the complement of xi, namely the vector x with zeros in block i.

Consistent with application of the PVD approach we assume for the iterative

algorithm that there are p processors, each of which can update a different block-

component, xj . Each of these processors may be designated a slave processor, which

is coordinated by a single master processor. The slaves solve the local problems and

are coordinated for solution of the global problem by the master processor. If there

are insufficient available processors, the algorithms are modified appropriately to

consider the separate processes, rather than processors. If∑p

j=1 Rnj = Rn and

∑p

j=1 nj > n, the spatial decomposition is designated as an overlapped decompo-

sition, otherwise it is without overlap. In the theoretical analysis we assume that

the subproblems are not overlapped, although our numerical experiments consider

both situations.

In the presentation of the basic algorithms we do not give all details of the

parallel implementation for the linear algebra operations. For example, calculation

of matrix vector products, where the matrix is distributed over several processors

with local memory, requires local computation, global communication and global

update. Such operations are by now well documented, see for example [10], and

dependent on the local architecture employed for problem solution.

We reemphasize that the focus of this work is the development of a domain

decompostion approach and the study of its feasibility. The remaining sections of

this paper are organized as follows. We provide the development of the algorithms

and analyze their properties in Section 2. Convergence analysis is presented in

Section 3, computational considerations in Section 4, numerical experiments in

Section 5 and conclusions in Section 6.

2. Algorithms


2.1. Development. Domain decomposition applied to the Rayleigh quotient for-

mulation for the TLS problem, (1.4), suggests the local problems are given by

(2.5) zi = argminz∈Rni

‖Aiz − b(xi)‖22

1 + ‖z‖22 +

∑

j 6=i ‖xj‖22

,

where

(2.6) b(xi) = b−∑

j 6=i

Ajxj = r(x) +Aixi, r(x) = b−Ax.

Setting

(2.7) β2i = 1 +

∑

j 6=i

‖xj‖2, wi = zi/βi and bi = b(xi)/βi,

(2.5) is replaced by

(2.8) wi = argminw∈Rni

‖Aiw − bi‖22

1 + ‖w‖22

,

which now looks like (1.4) for TLS(Ai, bi). In other words, wi solves

minEi,fi

‖[Ei, fi]‖ subject to (Ai + Ei)wi = bi + fi.

Equivalently zi is the solution of the weighted TLS subproblem :

(2.9) minEi,fi

‖[Ei, β2i fi]‖

2F subject to (Ai +Ei)zi = b(xi) + β2

i fi.

2.2. Global Update. The parallel algorithm requires both a mechanism to find

local solutions Yi, = 1 . . . p, where Yi is vector x with component xi replaced by zi,

for each local problem, the parallelization phase, and the process by which a global

update is obtained, the synchronization phase. Of course, if parallelism is not

required, then a Gauss-Seidel update can be formed, in which the local problems

are solved using the most uptodate information from each subproblem. On the

other hand, for true parallelism, there are various approaches for the global update

at the synchronization step. These mimic the alternatives presented in [19] for the

multisplitting solution of the least squares problem.

Before listing all synchronization approaches, we introduce a mechanism for an

optimal global update at synchronization.

Theorem 2.1. Let D ∈ Rn×p and γ ∈ Rp, and define

ψ(γ) =‖A(x+Dγ) − b‖2

2

1 + ‖x+Dγ‖22

.


Then γmin = argminγ∈Rpψ(γ) satisfies

(2.10) (DTATAD − ψ ·DTD)γ = DTAT r + ψ ·DTx,

where r = b−Ax.

Proof. We introduce the notation ψ = ψ1/ψ2, where

ψ1(γ) = ‖A(x+Dγ) − b‖22 ≥ 0,(2.11)

ψ2(γ) = 1 + ‖x+Dγ‖22 ≥ 1.(2.12)

Then at a stationary point

ψ′1 − ψ · ψ′

2 = 0.

Replacing ψ′1, ψ

′2 by the corresponding gradient vectors, we immediately obtain

condition (2.10). �

This result shows how we find an optimal update using the updated local solu-

tions, specifically it leads to the updates S1 and Sp given below.

Synchronization approaches: At iteration k,solution x(k) is updated by one of

the following four methods.

• Block Jacobi(BJ) Given the local solutions z(k)i , i = 1 . . . p, form the

block update x(k) = z(k).

• Convex Update: Form the global update using the convex combination

(2.13) x(k)i = (1 − αi)x

(k−1)i + αiz

(k)i

for each block i, where the weights αi satisfy 0 < αi < 1, with∑p

i=1 αi = 1,

and without any optimality imposed it is practical to choose αi = 1/p.

• Line Search (S1): Define search direction d(k) = z(k) − x(k−1), and find

scalar α such that solution x(k) = x(k−1) + αd(k) solves the global mini-

mization:

PVDTLS-S1

(2.14) minα∈R

ψ(α), ψ(α) =‖A(x(k−1) + αd(k)) − b‖2

2

1 + ‖x(k−1) + αd(k)‖22

.


The solution α is a root of the quadratic equation %1α2 + %2α + %3 = 0,

where

%1 = (d(k))Tx(k−1) · ‖Ad(k)‖2 + (Ad(k))T (r(k−1)) · ‖d(k)‖2,

%2 = ‖Ad(k)‖2 · (β2(x(k−1)) − ‖r(k−1)‖2 · ‖d(k)‖2,

%3 = −(Ad(k))T r(k−1) · (β2(x(k−1)) − ‖r(k−1)‖2 · (d(k))Tx(k−1),

and is chosen such that ψ(α) is minimal. Here we introduce the use of the

residual r(k) = b−Ax(k).

• p−dimensional update (Sp): Search in a p−dimensional subspace Sp

as follows: Consider the update x(k) = x(k−1) +∑p

i=1 γid(k)i , where the

d(k)i = [0, · · · , 0, (z

(k)i )T − (x

(k−1)i )T , 0, · · · , 0]T are the components of the

global search direction, and the parameters γi determine the weight in each

subdirection. We use D(k) to denote the matrix with columns d(k)i , γ the

vector with components γi, and solve the global minimization:

PVDTLS-Sp

(2.15) minγ∈Rp

‖A(x(k−1) +D(k)γ) − b‖22

1 + ‖x(k−1) +D(k)γ‖22

.

By Theorem 2.1, the conditions for optimality yield

((D(k))TATAD(k) − ψ · (D(k))TD(k))γ(2.16)

= (D(k))TAT r(k−1) + ψ · (D(k))Tx(k−1),

where

(2.17) ψ =‖AD(k)γ − r(k−1)‖2

2

1 + ‖x(k−1) +D(k)γ‖22

.

Because p is assumed small relative to n we can assume that problems

(2.16) and (2.17) are small. Again γ is chosen such that ψ(γ) is minimal.

Remark 2.1. A secular equation with ψ as variable can be derived from equations

(2.16) and (2.17). Because we want to minimize the function ψ(x), the smallest

root of the secular equation is of interest. Thus the iteration may be initialized with

ψ0 = 0.

Remark 2.2. Observe that the two search techniques, S1 and Sp, are more general

forms of the update (2.13). In particular S1 corresponds to (2.13) with αi = α, but

without the constraint on each αi, and Sp finds an optimal set of αi in (2.13) each

step to minimize the global objective. Clearly, S1 is the specification of Sp for the


case p = 1. As presented the optimal updates S1 and Sp do not impose convexity

which is important for convergence in the LS case, but we will see is irrelevant for

TLS because of its lack of convexity.

Remark 2.3. We do not expect that use of the BJ update at each step would

generate a convergent algorithm. However, as in [19], it may speed up the iteration

if at a given step the BJ update is adopted if it has generated a smaller objective

value than that given by (2.13), and the sequence of objective function values is

decreasing.

2.3. Algorithm.

Algorithm 1. (PVDTLS) Given a tolerance τ and an initial vector x(0), set k = 0,

and calculate r(0) = b− Ax(0). Compute solutions x(k) iteratively until |φ(x(k)) −

φ(x(k−1))|/φ(x(k)) < τ as follows:

(1) While not converged Do

(a) Parallelization (slave processor i):

(i) k = k + 1.

(ii) Calculate b(x(k)i ) = r(k−1) +Aix

(k−1)i .

(iii) Find solution w(k)i of TLS(Ai, bi), and calculate zi using (2.7).

(b) Synchronization (master processor):

(i) Use a global update algorithm to find yopt and update x(k) =

x(k−1) + yopt, where

(2.18) φ(x(k)) < φ(x(k−1)).

(ii) Test for convergence. Break if converged.

(iii) Else Update necessary data values over all processors. Continue.

End Do

3. Convergence analysis

Discussion of convergence for the TLS PVD algorithms is far more complex than

that for the least squares PVD for which the objective function is convex. Consider


−5

0

5

−4−2024

0

0.2

0.4

0.6

0.8

1

Figure 1. Function φ(x) for (3.19) with maximum point

(2,−1, 1), minimum point (−1,−1, 0.1) and saddle point (0, 1, 0.5).

the following low-dimensional example, for which φ(x) is illustrated in Figure 1.

(3.19)

[A, b] =

−2√6

0 0√

33

1√6

√2

2 0√

33

1√6

−√

22 0

√3

3

0 0 −1 0

∗

1

0.5

0.1

0 0 0

∗

−2√6

0√

33

1√6

√2

2

√3

3

1√6

−√

22

√3

3

T

.

We see that not only is φ(x) not convex but it also possesses saddle points. Fig-

ure 2 illustrates two curves which cross the saddle point (0, 1) of φ(x) for (3.19).

Obviously, if the iteration starts from x(0) = (0, 1), the update will not move from

the saddle point, neither in the parallelization nor the synchronization step, (in the

latter D(1) = 0). Thus, even if the given algorithms converge to some point, this

point may not be at the global minimum, and may be a saddle point which traps

the subsequent iterative updates.

The difficulty exhibited by the example illustrated in Figure 1 is general, as

demonstrated by the following result for the stationary points of the non-concave

function φ(x).

Lemma 3.1. ( FACT 1-8 in [17] ) The Rayleigh quotient of a symmetric matrix is

stationary at, and only at, the eigenvectors of the matrix.


−4 −2 0 2 40.25

0.3

0.35

−4 −2 0 2 40.2

0.3

0.4

0.5

0.6

0.7

φ([0,x2]) φ([x

1,1])

Figure 2. Curves crossing saddle point (0, 1) in coordinate direc-

tions, φ((0, x2)) and φ((x1, 1)).

Lemma 3.2. ([20]) If the extreme singular values of the matrix [A, b] are simple

then φ(x) has one unique maximum point, one unique minimum point and n-1

saddle points.

In particular, this result shows that if the iteration defined by the preceding

algorithm starts at a stationary point, the iteration may not converge to the global

minimum of the objective.

3.1. Convergence Proof. First, note that when we relate a right singular vector

vi(1 : n+ 1) of matrix [A, b] to (xT ,−1)T we assume vi(n+ 1) 6= 0.

We use the notation that ∇2i φ(x) is the ni−dimensional Hessian matrix with

respect to xi ∈ Rni . Also, for a bounded sequence {x(k)} with accumulation point

x∗, we use the indices kj , j = 1, 2 · · · for the subsequence {x(kj)} → x∗, when

j → ∞.

Theorem 3.3. Suppose {x(k)} is the sequence generated by Algorithm 1. Then

(1) {φ(x(k)} converges.

(2) If {x(k)} is bounded, any accumulation point, x∗, is a stationary point of

φ(x) and ∇2iφ(x∗) ≥ 0 on the subspace Rni for i = 1, · · · , p.

(3) If {x(k)} is bounded and the singular values of matrix [A, b] are distinct

{x(k)} converges.

Proof. (1) By the synchronization step it is guaranteed that the objective func-

tion decreases with {x(k)}. Moreover, the objective function is bounded

below by 0 and thus converges.


(2) Because {x(k)} is bounded there exists at least one accumulation point x∗.

Denote φ(x∗) by φ∗. Suppose that this accumulation point is not at a

minimum of φ with respect to its first block component. Then there exists

xnew1 ∈ Rn1 such that

(3.20) φ(xnew1 , x∗

1) < φ∗.

Here 1 denotes the complement of element 1 in {1, · · · , p}, i.e., 1 = {2, · · · , p}.

By the requirement of objective function decrease, however, we must have

φ(x(kj )) ≤ φ(x(kj−1+1))

≤ φ(z(kj−1+1), x(kj−1)

1)

≤ φ(xnew1 , x

(kj−1)

1).

In the limit j → ∞ this yields

φ∗ ≤ φ(xnew1 , x∗

1),

which contradicts (3.20). The same argument applies with respect to the

other block components.

(3) We first prove that {x(k)} has an unique accumulation point provided that

the singular values of matrix [A, b] are distinct. Suppose the contrary, then

there exists two accumulation points x∗ and x∗∗, which by the previous

statement are stationary points of φ(x). Then, by Lemma 3.1, ((x∗)T ,−1)T

and ((x∗∗)T ,−1)T are two eigenvectors of matrix [A, b]T [A, b] corresponding

to two distinct eigenvalues σ2k < σ2

l . Moreover, φ(x∗) = σ2k < σ2

l = φ(x∗∗).

But, by statement (1), {φ(x(k))} converges, φ(x∗) = φ(x∗∗). Thus {x(k)}

must converge to an unique accumulation point. Otherwise, because of

the boundedness of the sequence we could construct another accumulation

point, which by the above is not possible.

�

4. Computational Considerations

We assess the maximal theoretical efficiency of PVDTLS for solution of (1.4)

through analysis of its computational cost in comparison to the traditional serial

direct method and its parallelization using PARPACK [23]. Comparison with indi-

rect techniques, [14], is not so immediate because the cost of each indirect algorithm

depends on both its specific implementation, and the condition of the underlying


problem. See, for example, [2] for a thorough discussion of approaches for imple-

mentation of the RQI to solve (1.4). Also, the use of an indirect algorithm for (2.5)

confounds the discussion in this paper, for which the major intent is to assess the

overall viability and stability of PVDTLS for large scale applications. Thus, while

we do not specifically exclude that indirect techniques can be useful for solving

each of the local problems, here we propose the use of a very efficient SVD update

algorithm.

Measuring computational cost only in terms of number of flops, without con-

sideration of any hardware related issues for measure passing, or memory or cache

usage, the direct SVD solution of (1.4) cost is Cs = 2mn2 + 12n3, see Algorithm

12.3.1 in [10]. It is efficient to use this same direct SVD algorithm for the solution

of each (2.5), because the first ni columns of Ai = [Ai, bi] are fixed over all outer

iterations. Hence, having calculated the initial SVD, the SVD update algorithm,

[11], can be utilized, yielding a total local cost Cl(K), where K is the number of

outer iterations to convergence,

(4.21) Cl(K) ≈ 2n2i (m+ 6ni) +K(2ni(m+ 5ni) + 4nim),

Here the last term is the cost of updating b(xi) and generatingAidi as needed for the

global update. The cost of the global update, other than by Sp, is negligible. The

major costs associated with the solution of (2.15) are Lp3/3, where L is the number

of iterations for the minimization, and 2m(p+1) which accounts for the initialization

of the products (AD(k))TAD(k) and (r(k−1))TAD(k). In the implementation, L ≤ 5

is limited under the assumption, verified numerically, that exact global solution each

outer iteration is unnecessary. Hence,

(4.22) CSp(K) ≈ K(2m(p+ 1) +

L

3p3),

and, using L = 5, the total parallel cost used in our theoretical estimates is

Cp(K) ≈ Cl(K) + CSp(K)(4.23)

≈ 2n2i (m+ 6ni) + 2K(5n2

i + 3nim+ (p+ 1)m+5

6p3).(4.24)

Not surprisingly, there must be a balance between the size of the local problems, in

order to reduce local cost, and maintaining p small enough that p3 stays relatively

small. This is illustrated in Figure 3, in which the maximal efficiency,

(4.25) Eff(p,K) =Cs

pCp(K),


0 20 40 60 80 100 120 140 160 180 20010

−1

100

101

102

K

effic

ienc

y

p=16

p=8

p=32

p=2

p=4

m=162, n=160.

0 50 100 150 200 250 300 350 400 450 50010

0

101

102

103

K

effic

ienc

y

p=2

p=4

p=8

p=16

p=32

m=3200, n=1600.

0 50 100 150 200 250 300 350 400 450 50010

0

101

102

103

K

effic

ienc

y

m=2050, n=2048

p=2

p=4

p=8

p=16

p=32

Figure 3. Theoretical efficiency of PVD-Sp against number of

iterations to convergence for different problem sizes, ignoring over-

heads of parallel communication and memory access.


is calculated for problems of increasing size and different numbers of domain blocks.

Note that ideal efficiency Eff(p) = 1 is potentially exceeded for problems with size

n = O(103), with p <= 32, and 500 � K � 1000.

A standard alternative to the design of a new algorithm for large scale prob-

lems is the parallelization of the most computationally demanding steps of a se-

rial algorithm. Here we consider the cost of PARPACK for the parallel implicit

restart Lanczos method (IRLM), [23],to calculate the smallest eigen pair of ma-

trix [A, b]T [A, b]. We use the default number, 20, of Lanczos basis vectors and

request only one eigenvalue needed for the explicit representation of the TLS solu-

tion. The parallelizable cost of the IRLM includes the initial cost of generating a

20-dimensional Krylov subspace, 20(n+ 1)(4m+ 9 + 2 ∗ 20), and the per iteration

cost of 20(n+ 1)(4m+ (6 + 9) + 4 ∗ 20), [1]. The serial cost in each iteration for 19

implicitly shifted QR steps is 6 ∗ 202 ∗ 19, [10]. Hence the total parallel cost for K

iterations is

(4.26) CIRLMp (K) =

80(K + 1)mn+ (960 + 1900K)n

p+ 45600K,

where, unlike PVDTLS, K is independent of the number of processors. This esti-

mate is used to compare the efficiencies for the numerical experiments reported in

Section 5.

5. Numerical Experiments

5.1. Evaluation of the PVDTLS Algorithms. To evaluate the algorithms we

use both a standard test problem, staar, [24] and a three different versions of a test

problem which can be modified to change the underlying condition of the problem.

In each case the problem size and number of subdomains are chosen so that each

subdomain is of equal size. All tests are run with Matlab 7.0 with x(0) = 0. We

design the test data in such a way that the exact TLS solution xTLS and optimal

objective function value φmin = σ2n+1 are known. For stopping criterion we use

tolerance τ = .00001. In each case we report the number of outer iterations K to

the given convergence, the relative errors of the converged x(K) to xTLS and of the

converged φ(x(K)) to φmin. Each case is tested for solution by Gauss-Seidel and

PVD techniques PVDTLS-S1 and PVDTLS-Sp. A selection strategy, denoted by

“Sel” in the results indicating Selection, is also tested in which at each step the

update chosen is that which gives the greatest reduction in the objective function

when chosen from the local solutions Yi, the BJ update z(k) and the convex update


given by (2.13). This is consistent with the approach presented for the convex cases

in [6], [19], and permits evaluation of whether the optimal update is necessary for

obtaining a good solution. For the second test problem, the third situation leads

to a case in which one block converges to a poor solution. In this case we evaluate

also the use of truncated TLS for regularization of the local solution on this block.

Test 1.

We consider problem staar [24] form = 162. The Rayleigh quotient is φ(xTLS) =

m, [14] (Chapter 2.4). For all cases the iteration converges to the correct solution

and minimal φ in just two steps.

Test 2.

Based on the sensitivity analysis presented in [9], Bjorck et al. [2] suggest that

the ratio σ′1/(σ

′n − σn+1) be used as an approximate measure of the condition

of a TLS problem. Here σi and σ′i represent the ith singular values of matrices

[A, b] and A respectively. The approximate condition numbers partially reveal the

local structure of the surface of φ(x) near the minimum point. A large condition

number predicts that locally the surface is flat which makes the problem difficult

to solve, while a small condition number indicates that the surface is locally steep.

The TLS condition number for Test 1 is just 30.7, suggesting that staar is rather

well-conditioned. In contrast, the following set up leads to examples with different

condition numbers such as to evaluate PVDTLS with respect to the condition of

the problem.

We construct [A, b] ∈ Rm×(n+1)

[A, b] = UΣV T ,

U = Im − 2χχT ,

V = In+1 − 2ςςT ,

Σ =

Σ1

0

,

where

χ(i) = sin(4πi/m), i = 0, 1, · · · ,m− 1,

ς(j) = cos(4πj/(n+ 1)), j = 0, 1, · · · , n,


0 50 100 150 200−0.03

−0.02

−0.01

0

0.01

0.02

0.03

0 50 100 150 200−10

0

10

20

30

40

0 50 100 150 200−10

0

10

20

30

40

50

0 50 100 150 200−10

0

10

20

30

40

50

Figure 4. The four lowest stationary points of φ(x) for Test 2.

The lower right is the solution at the lowest stationary point, xTLS.

and are normalized such that‖χ‖2 = ‖ς‖2 = 1. In the tests we set m = 162 and

n = 160 and Σ1 is one of the following three diagonal matrices:

a: Σ1 = diag(v1, v2, v3, v4, 0.001)), where v1, v2, v3 and v4 are constant row

vectors with length n/4 and values 4/n, 2/n, 4/(3n) and 1/n respectively.

This test has condition 4.76 and φ = σ2n+1 = 1.0e− 6.

b: Σ1 = diag(1, 1/2, · · · , 1/n, 0.001). The condition is 190.5 and φ = σ2n+1 =

1.0e− 6.

c: Σ1 = diag(1, 1/2, · · · , 1/(n + 1)). The condition is 2576 and φ = σ2n+1 =

(1/161)2.

In these examples, the stationary points are all the same, the ith, i ≤ n, of which

is close to [0, · · · , 0, hi, 0, · · · , 0], where hi is a positive number, see Figure 4. In

each example, however, the function values at these stationary points are different.

In turn the matrix spectra are different and the condition increases from Test 2 (a)

to (c). Results of the numerical experiments are detailed in Tables 1-3, in which

x.y(−t) indicates x.y × 10−t and N indicates that convergence has not occured by

the 500th iteration.


Table 1. Test 2 (a): Outer iteration steps / relative error to φmin

/relative error to TLS solution.

Method p ovlp 0 ovlp 1 ovlp 2 ovlp 5

2 10/ 6.5(−06)/7.5(−04) 10/4.4(−06)/6.6(−04) 10/ 2.5(−06)/5.2(−04) 8/ 3.6(−06)/6.3(−04)

P S1 4 26/3.4(−05)/3.8(−03) 25/2.0(−05)/3.2(−03) 25/3.4(−05)/4.2(−03) 24/2.5(−05)/3.5(−03)

8 54/9.8(−05)/6.8(−03) 50/8.8(−05)/6.4(−03) 49/7.9(−05)/6.1(−03) 46/8.2(−05)/6.2(−03)

2 8/7.3(−07)/2.9(−04) 8/3.9(−06)/8.8(−04) 8/5.7(−07)/3.5(−04) 7/1.7(−07)/1.9(−04)

V Sp 4 2/4.2(−16)/3.7(−12) 2/4.2(−16)/3.7(−12) 2/4.2(−16)/3.7(−12) 2/0.0(+00)/3.7(−12)

8 2/0.0(+00)/3.7(−12) 2/0.0(+00)/3.7(−12) 2/4.2(−16)/3.6(−12) 2/4.2(−16)/3.7(−12)

2 7/7.3(−06)/1.0(−03) 7/3.5(−06)/7.3(−04) 7/1.6(−06)/5.3(−04) 7/1.7(−07)/2.6(−04)

D Sel 4 32/6.4(−05)/5.8(−03) 31/2.8(−05)/3.7(−03) 25/5.6(−05)/5.3(−03) 22/3.3(−05)/3.8(−03)

8 45/1.7(−04)/7.5(−03) 46/4.7(−05)/4.8(−03) 41/6.3(−05)/5.1(−03) 39/9.9(−05)/6.4(−03)

G 2 6/9.1(−08)/1.1(−04) 6/3.1(−07)/2.0(−04) 6/5.2(−08)/7.6(−05) 5/2.5(−08)/5.8(−05)

S1 4 8/2.1(−07)/3.1(−04) 8/5.8(−07)/4.1(−04) 6/8.1(−07)/5.8(−04) 7/1.8(−08)/7.5(−05)

S 8 12/1.2(−06)/7.6(−04) 11/6.4(−06)/1.4(−03) 11/3.3(−06)/1.0(−03) 11/3.7(−06)/1.1(−03)

Table 2. Test 2 (b): Outer iteration steps / relative error to φmin


Method p overlap 0 overlap 1 overlap 2 overlap 5

2 N/7.5(−03)/2.5(−02) 54/3.7(−05)/1.8(−03) 24/9.0(−06)/8.8(−04) 17/1.3(−05)/1.0(−03)

P S1 4 N/2.9(−01)/3.0(−01) 80/5.6(−05)/3.6(−03) 60/2.3(−04)/8.6(−03) 37/1.3(−03)/2.0(−02)

8 N/ 8.2(−01) / 5.1(−01) 442 / 1.8(−02) / 8.1(−02) 338/2.2(−03)/2.9(−02) 225/1.3(−02)/6.7(−02)

2 364/1.4(−03)/1.1(−02) 17/6.9(−07)/2.6(−04) 13/5.1(−07)/2.4(−04) 8/5.0(−06)/1.4(−03)

V Sp 4 N/1.3(−02)/7.2(−02) 57/2.5(−04)/9.2(−03) 28/5.2(−05)/4.6(−03) 24/2.4(−05)/2.9(−03)

8 79/3.6(−04)/1.0(−02) 202/1.0(−03)/1.9(−02) 219/1.4(−03)/2.2(−02) 196/1.2(−03)/2.0(−02)

2 10/1.6(−03)/1.2(−02) 22/1.0(−04)/2.4(−03) 16/4.7(−05)/1.1(−03) 9/9.2(−07)/4.5(−04)

D Sel 4 N/6.8(−01)/5.3(−01) 195/1.1(−03)/1.9(−02) 108/1.2(−02)/5.7(−02) 150/3.1(−04)/9.3(−03)

8 N/1.3(+00)/7.7(−01) N/6.8(−01)/5.4(−01) 465/2.2(−01)/3.1(−01) N/3.9(−01)/3.9(−01)

G 2 33/1.2(−03)/1.0(−02) 14/4.0(−12)/6.4(−07) 8/8.8(−06)/7.2(−04) 5/2.3(−08)/3.3(−05)

S1 4 91/1.3(−03)/1.8(−02) 32/2.8(−04)/8.3(−03) 26/2.3(−06)/7.0(−04) 12/7.1(−06)/1.5(−03)

S 8 119/2.5(−03)/2.7(−02) 73/1.7(−04)/6.5(−03) 67/6.8(−05)/4.6(−03) 69/1.6(−05)/1.9(−03)

Also illustrated in Figure 5 is the estimate of the q-factor in each case, which

demonstrates the q-linear convergence behavior, [15]. Indeed, in most cases the

convergence rate is nearly linear, which is consistent with the result for strongly

convex problems, Theorem 2.3 of [4]. Theoretically, the converged point may be

a saddle point, see Figure 1, but we did not encounter this in any of our tests.


Table 3. Test 2 (c): Outer iteration steps / relative error to φmin


Method p overlap 0 overlap 1 overlap 2 overlap 5

2 329/1.9(−03)/9.1(−02) 29/9.2(−05)/2.0(−02) 21/3.2(−05)/1.1(−02) 5/9.1(−04)/1.3(−01)

P S1 4 454/7.7(−03)/5.3(−01) 78/8.4(−05)/7.6(−02) 35/4.5(−03)/4.2(−01) 41/4.3(−05)/4.6(−02)

8 410/1.2(−02)/6.4(−01) 54/8.2(−03)/7.2(−01) 47/8.4(−03)/8.0(−01) 33/8.7(−03)/9.6(−01)

2 6/1.6(−03)/8.4(−02) 14/3.5(−09)/1.3(−04) 10/3.2(−06)/3.7(−03) 6/5.7(−06)/8.8(−03)

V Sp 4 16/6.3(−03)/6.7(−01) 38/3.2(−04)/1.4(−01) 28/2.8(−04)/1.2(−01) 18/6.8(−05)/6.1(−02)

8 20/2.5(−03)/8.0(−01) 8/2.7(−03)/8.3(−01) 9/2.6(−03)/8.0(−01) 10/2.9(−03)/8.1(−01)

2 8/4.4(−05)/2.1(−02) 8/2.0(−05)/1.3(−02) 8/7.9(−06)/1.0(−02) 2/3.4(−02)/7.7(−01)

D Sel 4 3/1.5(−02)/1.1(+00) 10/1.1(−02)/8.6(−01) 48/2.9(−03)/3.7(−01) 157/1.3(−04)/6.9(−02)

8 2/7.7(−03)/1.5(+00) 2/7.7(−03)/1.5(+00) 3/7.7(−03)/1.5(+00) 3/7.7(−03)/1.5(+00)

G 2 24/1.2(−03)/7.4(−02) 10/1.3(−05)/7.3(−03) 5/2.0(−06)/2.6(−03) 4/5.6(−08)/5.8(−04)

S1 4 8/2.2(−01)/2.2(+00) 14/2.6(−02)/1.5 10/4.0(−02)/1.9(+00) 14/1.7(−04)/9.2(−02)

S 8 6/6.0(−02)/6.5(+00) 4/1.1(−01)/9.9(+01) 2/1.1(−01)/8.8(+01) 2/1.1(−01)/6.7(+01)

Instead, we did find some cases that converged to a point which is neither a saddle

nor the true solution. Illustrated in Figure 6 is one such case, Test 2 (c), using

S1 for the global update, with p = 2 and overlap of 5. Here, it is apparent that

the converged solution is close to the true solution, and even iterating to 50000

iterations, the solution is still not improved. This does not contradict Theorem 3.3

because the structure of objective function is so flat near to the true solution that

(2.18) can not be satisfied to double float accuracy. In this case, it is possible to

obtain a solution which does converge with relative error less than 1.0 × 10−11 if

we pick the initial guess as the true TLS solution with 10% noise added.

5.1.1. Discussion. The numerical results show

(1) a simple selection strategy, “Sel”, does not provide satisfactory convergence

behavior for PVDTLS. When the TLS condition number is not small, and

the number of blocks increases, this approach does not succeed, even if

regularization is applied to the “bad” block. In particular it is essential

that the global update is optimized using an approach such as S1 or Sp.

This contrasts the LS case in which convex updates or BJ updates were

often sufficient to maintain satisfactory convergence at less cost than the

optimal recombination global update.


0 500 1000 1500 2000 2500 3000 3500 4000 4500 500010

−16

10−14

10−12

10−10

10−8

10−6

10−4

10−2

100

102

Test1(b)PVD−S1p=2, ovlp=0q−factor=0.994

0 10 20 30 40 5010

−16

10−14

10−12

10−10

10−8

10−6

10−4

10−2

100

TEst1(a)PVD−Spp=2, ovlp=5q−factor≈0.34

0 20 40 60 80 10010

−20

10−15

10−10

10−5

100

105

Test1(b)PVD−Spp=2, ovlp=5q−factor≈0.60

0 5 10 15 20 25 3010

−12

10−10

10−8

10−6

10−4

10−2

100

Test1(c)PVD−Spp=2, ovlp=5q−factor≈0.42

Figure 5. Convergence history for 4 examples. In each case the

solid line indicates the relative error of x(k) to xTLS (solid) and

the dotted line the relative error of φ(k) to φmin (dots). The ap-

proximate q-factors of linear convergence are also presented.

0 10 20 30 40 5010

−20

10−15

10−10

10−5

100

105

0 20 40 60 80 100 120 140 160−0.03

−0.02

−0.01

0

0.01

0.02

0.03

0.04

true solutioncalculated solution

Figure 6. Test 2 (c), PVD-S1, p = 2, overlap 5. Left: convergence

history, the solid line indicates the relative error of x(k) to xTLS,

the dotted line the relative error of φ(k) to φmin and the crosses the

estimated relative error of objective |φ(x(k)) − φ(x(k−1))|/φ(x(k))

Right: converged solution and true TLS solution.


(2) Use of a p-dimensional optimal update reduces the total number of itera-

tions, and is more reliable at obtaining a good converged solution, than the

line search S1, for minimal extra cost, when p is small.

(3) When using a Gauss-Seidel approach, results show that it is essential to do

some global update optimization, as indicated here by use of the line search

S1. Moreover, as for the selection strategy, the Gauss Seidel approach

is not successful when the TLS condition number increases and a given

block needs some regularization. Use of Sp may be useful to improve the

convergence of GS.

(4) As p increases the number of iterations to convergence increases.

(5) Increasing overlap can be used to reduce the number of iterations to con-

vergence.

These latter two results are completely expected for any parallel appproach. The

former results are specific to the PVD for TLS.

Our implementation does not use the forget-me-not term in the global update

as used in [4, 3]. While its inclusion might increase the speed of the convergence,

it would also exclude the use of the efficient SVD update scheme. Moreover, the

formulation as presented, also easily permits, independently of other domains, the

inclusion of regularization for a specific subdomain, [13], [5].

5.2. Comparison with IRLM. To compare with parallel IRLM we also solve the

second test case to achievable accuracy by both the parallel IRLM and PVDTLS-

Sp and evaluate the relative efficiency. We present the number of iterations to find

in each case the best achievable accuracy measured in terms of the relative error

to the true TLS solution. Representative cases for PVDTLS-Sp with overlap of 5,

and 2, 4 and 8 subdomains are given in Table 4. From these results we observe

that the PVD has better achievable accuracy for the Test 2 (b), regardless of the

number of subdomains. Moreover, it is apparent by the high number of iterations

required to achieve a converged solution, that for this specific test subdomain size is

too small when the global problem is split into 8 local problems. The convergence

history for parallel IRLM, shown in Figure 7, confirms that the IRLM does not

improve with iteration after a certain number of iterations. On the other hand, in

terms of accuracy parallel IRLM outperforms PVDTLS in two cases. Moreover the

convergence history for PVDTLS in these cases has also stagnated, Figure 5 (a)-(d).


0 100 200 300 400 500 600 700 800 90010

−15

10−10

10−5

100

105

Test1cTest1b

Figure 7. Convergence history for Test 2 (b) and Test 2 (c) with IRLM.

Table 4. Best reachable accuracy for PVDTLS Sp with overlap

5 and IRLM: iteration steps / relative error to TLS solution.

Method Test 2 (a) Test 2 (b) Test2 (c)

IRLM 2/5.36(−15) 41/9.61(−12) 814/1.72(−08)

PVDTLS p = 2 25/1.03(−12) 60/9.12(−14) 12/8.71(−05)

p = 4 2/3.67(−12) 161/7.78(−14) 72/9.78(−05)

p = 8 2/3.67(−12) 4446/6.61(−13) 9450/5.48(−05)

MAY NEED SOME OTHER ESTIMATES OF EFFICIENCY - STILL THINK

WE NEED THIS NOT RELATIVE TO Cs but SAY RELATIVE TO P = 2 cases,

or IRLM relative to PVD, as we discussed Table 5 shows the efficiency to reach the

optimal accuracy, demonstrating that parallel IRLM is less efficient than PVDTLS-

Sp for the examples with higher condition number.

STILL NEED TO ADDRESS THIS : However, it cost to much, if not impossible,

to determine whether the converged solution is a saddle point. To do this, we have

to solve the eigenproblem of Hessian matrix with the same size as oringinal problem!


Table 5. Efficiency calculated by (4.25) using iteration steps for

which the best accuracy is reached, refer to Table 4.

Method Test 2 (a) Test 2 (b) Test2 (c)

IRLM p = 2 8.02 4.75 0.03

p = 4 7.82 4.61 0.028

PVDTLS p = 2 2.44 1.71 2.90

p = 4 10.26 1.38 2.68

6. Conclusion

We have presented a domain decomposition strategy for solution of the TLS

problem, using PVD techniques combined with optimal global update. The method

can be implemented in a serial GS manner, or a parallel BJ approach. To obtain

a satisfactory convergence history it is essential to optimize the global update at

each outer iteration. The use of inexact local solutions is not needed because the

local solutions use an optimal SVD update scheme. The algorithm can be adapted

to handle problems in which some regularization is needed to on a local subdo-

main because of ill-conditioning of the local problem. As compared to a parallel

implementation of the implicitly restarted Lanczos algorithm, PVDTLS-Sp offers a

viable strategy for those large problems which are also poorly conditioned.

References

1. Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst, Templates for the solution

of algebraic eigenvalue problems: A practical guide, SIAM, Philadelphia, 2000.

2. A. Bjorck, P. Heggernes, and P. Matstoms, Methods for large scale total least squares problems,

SIAM Journal on Matrix Analysis and Applications 22 (2000), no. 2, 413–429.

3. J.E. Dennis and T. Steihaug, A Ferris-Mangasarian technique applied to linear least squares

problems, Tech. Report CRPC-TR 98740, Rice University, 1998.

4. M. C. Ferris and O. L. Mangasarian, Parallel variable distribution, SIAM Journal on Opti-

mization 4 (1994), no. 4, 815–832.

5. R. D. Fierro, G. H. Golub .and P. C. Hansen, and D. P. O’Leary, Regularization by truncated

total least squares, SIAM Journal on Scientific Computing 18 (1997), 1223–1241.

6. A. Frommer and R. Renaut, A unified approach to parallel space decomposition methods,

Journal of Computational and Applied Mathematics (1999), 205–223.

7. G. Golub, Some modified matrix eigenvalue problems, SIAM Review 15 (1973), 318–334.

8. G. H. Golub, P. C. Hansen, and D. P. O’Leary, Tikhonov regularization and total least squares,

Numerical Linear Algebra with Applications 21 (1999), no. 1, 185–194.


9. G. H. Golub and C. Van Loan, An anlysis of the total least squares problem, SIAM Journal

on Numerical Analysis 17 (1980), 883–893.

10. G. H. Golub and C. van Loan, Matrix computations, third ed., John Hopkins Press, Baltimore,

1996.

11. M. Gu and S. C. Eisenstat, A stable and fast algorithm for updating the singular value de-

composition, Tech. Report YALEU/DCS/RR-966, Yale University, 1994.

12. H. Guo, IRR: An algorithm for computing the smallest singular value of large-scale matrices,

International Journal of Computer Mathematics 77 (2001), no. 2, 89–104.

13. H. Guo and R. A. Renaut, A regularized total least squares algorithm, in Sabine Van Huffel

and Philippe Lemmerling, total least squares and errors-in-variables modeling: Analysis, al-

gorithms and applications, pp. 57–66, Kluwer Academic Publishers, 2002.

14. S. Van Huffel and J. Vandewalle, The total least squares problem: computational aspects and

analysis, SIAM, Philadelphia, 1991.

15. C. T. Kelley, Iterative methods for optimization, Frontiers in Applied Mathematics, SIAM,

1999.

16. N. Masttronardi, P. Lemmerling, and S. Van Huffel, Fast structured total least squares al-

gorithm for solving the basic deconvolution problem, SIAM Journal on Matrix Analysis and

Applications 22 (2000), no. 2, 533–553.

17. B. N. Parlett, The symmetric eigenvalue problem, Prentice-Hall, 1980.

18. B. Philippe and M. Sadkane, Computation of the singular subspace associated with the small-

est singular values of large matrices, Tech. Report 754, IRISA, 1993.

19. R. A. Renaut, A parallel multisplitting solution of the least squares problem, Numerical Linear

Algebra with Applications 5 (1998), 11–31.

20. R. A. Renaut and H. Guo, Efficient solution of regularized total least squares, SIAM Journal

on Matrix Analysis and Applications , to appear.

21. A. H. Sameh and J. A. Wisniewski, A trace minimization algorithm for the generalized eigen-

value problem, SIAM Journal on Numerical Analysis 19 (1982), no. 6, 1243–1259.

22. G. L. G. Sleijpen and H. A. Van der Vorst, A Jacobi–Davidson iteration method for linear

eigenvalue problems, SIAM Review 42 (1996), no. 2, 267–293.

23. D. C. Sorensen, Implicit application of polynomial filters in a k-step arnoldi method, SIAM

J. Matrix Anal. Appl. 13 (1992), 357–385.

24. J. Staar, Concepts for reliable modelling of linear system with application to on-line identifi-

cation of multivariable state space descriptions, Ph.D. thesis, Katholieke Universiteit Leuven,

Leuven,Belgium, June 1982.

Department of Mathematics and Statistics, Arizona State University, Tempe, AZ

85287 1804

Documents

PARALLEL VARIABLE DISTRIBUTION FOR TOTAL LEAST SQUARESrosie/mypapers/vdtls_rev5.pdf · in-variables regression" for a long time as a natural generalization of least squares (LS),