Captulo 12 FACTOR ANALYSIS

Capítulo 12

FACTOR ANALYSIS

Charles Spearman (1863-1945)British psychologist. Spearman was an officer in the British Army in India and upon his

return at the age of 40, and influenced by the work of Galton, he decided to do his doctoralthesis on the objective measurements of intelligence. He proposed the first factor analysismodel based on a common factor, the g factor, and a specific component. Upon receiving hisPhD he was named full professor and occupied the first Chair in Psychology at UniversityCollege, London.

12.1 INTRODUCTION

Factor analysis is used to explain variability among a set of observed variables in terms of asmall number of latent, or unobserved variables, called factors. For example, suppose thatwe take twenty body measurements from a person: height, length of torso and extremities,shoulder width, weight, etc. It is intuitive to think that these measurements are not indepen-dent from each other, and that if we know one of them we can predict the others with a small

107

108 CAPÍTULO 12. FACTOR ANALYSIS

margin of error. One explanation for this fact is that these measurements are determinedby the same genes and therefore they are highly correlated. Thus if we know some of them,we can predict with small error the values of the other variables. As a second example, sup-pose we are interested in studying human development around the world and that we havemany economic, social and demographic variables available, all in general, interdependent,and related to development. We can ask ourselves whether the development of a countrydepends on a small number of factors such that, if we knew their values we could predictthe set of variables for each country. As a third example, we use different tests to measurethe intellectual capacity of an individual to process information and solve problems. We canask ourselves if there are factors, not directly observable which explain the set of observedresults. The set of these factors is what we will call intelligence and it is important to knowhow many different dimensions this concept has and how to characterize and measure them.Factor analysis came about thanks to the interest of Karl Pearson and Charles Spearman inunderstanding the dimensions of human intelligence in the 1930’s; as a result, many of theiradvances were produced in the area of psychometry.Factor analysis is related to principal components, but there are certain differences. First,

principal components are constructed to explain variance, whereas factors are constructed toexplain the covariance or correlations between the variables. Second, principal componentsis a descriptive tool, while factor analysis assumes a formal statistical model. On the otherhand, principal components can be seen as a particular case of Factor analysis, as we willsee later

12.2 THE FACTOR MODEL

12.2.1 Basic Hypothesis

Suppose that we observe a vector of variables x, of dimensions (p × 1), in elements ofa population. The factor analysis model establishes that this vector is generated by theequation:

x = μ+ Λf + u (12.1)

where:

1. f is a vector (m × 1) of latent variables, or unobserved factors. We assume that itfollows a distribution Nm(0, I), that is, the factors have zero mean, are independentfrom each other and have a normal distribution.

2. Λ is a matrix (p×m) of unknown constants (m < p). It contains the coefficients whichdescribe how the factors, f , affect the observed variables, x, and is called the loadingmatrix.

3. u is a vector (p × 1) of unobserved perturbations. It includes the effect of all thosevariables which are different from the factors influencing x. We assume that u has adistribution Np(0,ψ) where ψ is diagonal, and that the perturbations are uncorrelatedwith the factors f .

12.2. THE FACTOR MODEL 109

With these three hypotheses we deduce that:(a) μ is the mean of the variable x, since both the factors and the perturbations have a

zero mean;(b) x has a normal distribution, being the sum of the normal variables, and letting V be

the covariance matrixx ∼Np(μ,V).

The equation (12.1) implies that given a random sample of n elements generated by thefactor model, each piece of data xij can be written as:

xij = μj + λj1f1i + ...+ λjmfmi + uij i = 1, ..., n j = 1, ..., p

which decomposes xij, the observed value in the individual i of the variable j, as a sum ofm+2 terms. The first is the mean of the variable j, μj, from the second tom+1 contains theeffect of the m factors, and the last is a specific perturbation of each observation, uij. Theeffects of the factors on xij are the product of the coefficients λj1, ...,λjm, which depend onthe relationship between each factor and the variable j, (and that they are the same for allthe items in the sample), times the values of the m factors in the sampling item i, f1i, ..., fmi.Joining the equations for all of the observations, the data matrix, X, (n× p), can be writtenas:

X = 1μ0 + FΛ0+U

where 1 is a n× 1 vector of ones, F is a matrix (n×m) which contains the m factors for then items of the population, Λ0 is the transpose of the loading matrix (m× p) whose constantcoefficients relate the variables and the factors and U is a matrix (n× p) of perturbations.

12.2.2 Properties

The loading matrix Λ contains the covariances between the factors and the observed vari-ables. Note that the covariance matrix (p × m) between the variables and the factors isobtained by multiplying (12.1) by f 0 on the right and taking expected values:

E£(x− μ)f 0

¤= ΛE [ff 0] +E [uf 0] = Λ

since, from the hypothesis, the factors are uncorrelated (E [ff 0] = I), have a mean of zero,and are uncorrelated with the perturbations (E [uf 0] = 0). This equation indicates thatthe terms λij of the loading matrix, Λ, represent the covariance between the variable xi andthe factor fj, and, as the factors have unit variance, they are also the regression coefficientswhen we explain the observed variables by the factors. In the particular case in which the xvariables are standardized, the terms λij are also the correlations between the variables andthe factors.The covariance matrix between the observations verifies that, according to (12.1):

V = E£(x− μ)(x− μ)0

¤= ΛE [ff 0]Λ0 +E [uu0]

since E[fu0] = 0 the factors and the noise are uncorrelated. Thus, we obtain the fundamentalproperty:

V = ΛΛ0 +ψ, (12.2)


which establishes that the covariance matrix of the observed data can be decomposed as thesum of two matrices:(1) The first, ΛΛ0, is a symmetric matrix of rank m < p. This matrix contains the part

which is common to the set of variables and depends on the covariance between the variablesand the factors.(2) The second, ψ, is diagonal, and contains the specific part of each variable, which is

independent from the rest.This decomposition implies that the variances of the observed variables can be written

as:

σ2i =mXj=1

λ2ij + ψ2i , i = 1, . . . , p.

where the first term is the sum of the effects of the factors and the second is the variance ofthe perturbation. Letting

h2i =mXj=1

λ2ij,

be the sum of the factor effects, which we will call communality, we get

σ2i = h2i + ψ2i , i = 1, . . . , p. (12.3)

This equality can be interpreted as a decomposition of the variance in:

Observed variance = Common variance + Specific variance(Communality)

which is analogous to the classical decomposition of variability in an explained part andanother unexplained, which is carried out in the analysis of variance. In the factor modelthe explained part is due to the factors and the unexplained is due to noise. This equationis the basis for the analysis which follows.Example: Let us assume that we have three variables generated by two factors. The

covariance matrix must verify⎡⎣ σ11 σ12 σ13σ21 σ22 σ23σ31 σ32 σ33

⎤⎦ =⎡⎣ λ11 λ12

λ21 λ22λ31 λ32

⎤⎦∙ λ11 λ21 λ31λ12 λ22 λ32

¸+

⎡⎣ ψ11 0 00 ψ22 00 0 ψ33

⎤⎦This equality provides 6 different equations (remember that since V is symmetric it has only6 different terms). This first is:

σ11 = λ211 + λ212 + ψ11

We let h21 = λ211+λ212 be the contribution of the two factors in variable 1. The six equationsare:

σii = h2i + ψ2i i = 1, 2, 3

σij = λi1λj1 + λi2λj2 i = 1, 2, 3i 6= j


12.2.3 Uniqueness of the model

In the factor model, neither the loading matrix, Λ, nor the factors, f , are observable. Thisposes the problem of indeterminacy: we will say that two representations (Λ, f) and (Λ∗, f∗)are equivalent if

Λf = Λ∗f∗

This situation leads to two types of indeterminacy.(1) A set of data can be explained with the same accuracy with correlated or uncorrelated

factors.(2) The factors are not determined uniquely.Let us analyze these two indeterminacies. To show the first, note that if H is any

nonsingular matrix, the representation (12.1) can be written as

x = μ+ΛHH−1f + u (12.4)

and letting Λ∗ = ΛH be the new loading matrix, and f∗ = H−1f be the new factors:

x = μ+Λ∗f∗ + u, (12.5)

where the new factors f∗ now have a distribution N¡0,H−1(H−1)0

¢and, thus, they are

correlated. Analogously, starting from the correlated factors, f ∼ N(0,Vf), we can alwaysfind an equivalent expression of the variables using a model with uncorrelated factors. Toshow this, let A be a matrix so that Vf= AA

0. (This matrix always exists if Vf is positivedefinite), then A−1Vf (A

−1)0 = I, and writing

x = μ+Λ(A)(A−1)f + u,

and taking Λ∗= ΛA as the new coefficient matrix of the factors, and f∗ = A−1f as the newfactors, the model is equivalent to another with uncorrelated factors. This indeterminacy issolved in the hypothesis of the model by always taking the uncorrelated factors.The second type of indeterminacy appears because ifH is orthogonal, the model x = μ+Λf+

u and the x = μ+ (ΛH)(H0f) + u are indistinguishable. Both contain uncorrelated factors,with an identity covariance matrix. In this sense, we say that the factor model is indetermi-nate to rotations. This indeterminacy is solved by imposing restrictions on the componentsof the loading matrix, as we will see in the next section.Example: We assume that x = (x1, x2, x3)0, and the following factorial model M1:

x =

⎡⎣ 1 10 11 0

⎤⎦∙ f1f2

¸+

⎡⎣ u1u2u3

⎤⎦and the factors are uncorrelated. We are going to write this model as another equivalent

model of uncorrelated factors. Taking H = 1√2

∙1 11 −1

¸, this matrix is orthogonal since

H−1 = H0 = H. Thus

x =

⎡⎣ 1 10 11 0

⎤⎦ 1√2

∙1 11 −1

¸1√2

∙1 11 −1

¸ ∙f1f2

¸+ [u] .


Denoting this model as M2, it can also be written as:

x =

⎡⎢⎣2√2

01√2− 1√

21√2

1√2

⎤⎥⎦∙ g1g2¸+ [u]

and the new factors, g, are related to the previous ones, f , by:∙g1g2

¸=³√2´−1 ∙ 1 1

1 −1

¸ ∙f1f2

¸and are thus a rotation of the initial factors. We prove that these new factors are alsouncorrelated. Their variance matrix is:

Vg =³√2´−1 ∙ 1 1

1 −1

¸Vf

∙1 11 −1

¸³√2´−1

and if Vf = I⇒ Vg = I, from which it is deduced that the models M1 and M2 areindistinguishable.

12.2.4 Normalization of the factor model

Since the factor model is indeterminate to rotations, the matrix Λ is unidentified. Thisimplies that although we observe the whole population, and μ, andV are known, we can notdetermine Λ uniquely. The solution is to impose restrictions on its terms. The two principalestimation methods which we will study next use some of the two following normalizations:

Criterion 1:

Requires:

Λ0m×pΛp×m = D = Diagonal (12.6)

With this normalization, the vectors which define the effect of each factor over the pobserved variables are orthogonal. In this way, besides being uncorrelated the factors producethe most distinct effects in the variables. We are going to prove that this normalizationdefines a loading matrix uniquely. First we assume that we have a matrix Λ such that theproduct Λ0Λ is not diagonal. We transform the factors with Λ∗ = ΛH, where H is thematrix which contains the columns of the eigenvectors of Λ0Λ. Then:

Λ0∗Λ∗ = H0Λ0ΛH (12.7)

and since H diagonalizes Λ0Λ the matrix Λ∗ verifies the condition (12.6), and we see nowthat this is the only matrix which does so. Suppose that we rotate this matrix and letΛ∗∗ = ΛC where C is orthogonal. Then the matrix Λ∗∗0Λ∗∗ = C0Λ0∗Λ∗C will not bediagonal. Analogously, if we start with a matrix which verifies (12.6) and we rotate it, it willno longer verify this condition.


When this normalization is verified, postmultiplying the equation (12.2) by Λ, allows usto write

(V−ψ)Λ = ΛD,

which means that the columns of Λ are eigenvectors of the matrix V−ψ, which has thediagonal terms of D as its eigenvalues. This property is used in the estimation by theprincipal factor method.

Criterion 2:

Requires:

Λ0ψ−1Λ = D =Diagonal (12.8)

In this normalization the effects of the factors on the variables, weighted by the variancesof the perturbations of each equation become uncorrelated. As before, this normalizationdefines a loading matrix uniquely. To show this, we assume that Λψ−1Λ is not diagonal,and we transform with Λ∗ = ΛH. Then:

Λ0∗ψ−1Λ∗ = H0 ¡Λ0ψ−1Λ¢H (12.9)

and since Λ0ψ−1Λ is a symmetric matrix and non-negative definite, it can always be diago-nalized if we choose as H the matrix which contains the eigenvectors of Λ0ψ−1Λ in columns.Analogously, if (12.8) is verified from the beginning and we rotate the loading matrix thiscondition is no longer verified. This is the normalization used in maximum likelihood esti-mation. Its justification is that in this way the factors are conditionally independent giventhe data, as is shown in Appendix 12.4.With this normalization, postmultiplying the equation (12.2) by ψ−1Λ, we get

Vψ−1Λ−Λ = ΛD

and premultiplying by ψ−1/2, the result is:

ψ−1/2Vψ−1Λ−ψ−1/2Λ = ψ−1/2

ΛD

which impliesψ−1/2Vψ−1/2ψ−1/2Λ = ψ−1/2Λ (D+ I)

and we conclude that the matrix ψ−1/2Vψ−1/2 has eigenvectors ψ−1/2Λ with eigenvaluesD+ I. This property is used in maximum likelihood estimation.

12.2.5 Maximum number of factors

If we replace the theoretical covariance matrix, V, in (12.2) with the sampling matrix, S,the system will be identified if it is possible to solve it uniquely. In order to do so there isa restriction on the number of possible factors. The number of equations which we obtainfrom (12.2) is equal to the set of terms of S, which is p + p(p − 1)/2 = p(p + 1)/2. The


number of unknowns in the second term is pm, the coefficients of the matrix Λ, plus the pterms of the diagonal of ψ, minus the restrictions imposed in order to identify the matrixΛ. Assuming that Λ0ψ−1Λ is diagonal, this supposes m(m− 1)/2 restrictions on the termsof Λ.For the system to be determined there must be a number of equations equal to or greater

than the number of unknowns. If there are fewer equations than unknowns it is impossible tofind a single solution and the model is unidentified. If the number of equations is exactly equalto the number of unknowns there will be a single solution. If there are more equations thanunknowns, we can solve the system using least squares, finding the values of the parameterswhich minimize the estimation errors. Therefore:

p+ pm− m(m− 1)2

≤ p(p+ 1)2

which assumes:p+m ≤ p2 − 2pm+m2,

that is(p−m)2 1 p+m.

The reader can prove that this equation implies that when p is not large (less than 10)then approximately the maximum number of factors must be less than half the number ofvariables minus one. For example, the maximum number of factors with 7 variables is 3.

12.3 THE PRINCIPAL FACTOR METHOD

The principal factor method is a method for estimating the loading matrix based on principalcomponents. It avoids the need to solve maximum likelihood equations which are more com-plex. It has the advantage that the dimension of the system can be identified approximately.Because of its simplicity it is used in many computer programs. Its basis is the following:suppose that we can obtain an initial estimation of the variance matrix of the perturbationsbψ. Then, we can write

S− bψ = ΛΛ0, (12.10)

and as S− ψ is symmetric, it can always be decomposed as:

S− bψ = HGH0 = (HG1/2)(HG1/2)0 (12.11)

where H is square of order p and orthogonal, G is also of order p, diagonal and contains theeigenvalues of S− bψ. The factor model establishes that G must be diagonal of type:

G =

∙G1m×m Om×(p−m)

O(p−m)×m O(p−m)×(p−m)

¸since S − bψ has rank m. Thus, if we let H1 be the matrix p × m which contains theeigenvectors associated with the non-null eigenvalues of G1 we can build an estimator of Λi

by the p×m matrix: bΛ = H1G1/21 (12.12)

12.3. THE PRINCIPAL FACTOR METHOD 115

Note that the resulting normalization is:

bΛ0bΛ = G1/21 H0

1H1G1/21 = G1 = Diagonal (12.13)

since the eigenvectors of symmetric matrices are orthogonal. Thus H01H1 = Im. Therefore,

with this method we obtain an estimate of the matrix bΛ with columns which are orthogonalto each other.In practice, the estimation is carried out iteratively as follows:

1. Start (i = 1) from an initial estimation of bΛi and compute bψi by using bψi = diag³S−bΛbΛ0

´.

2. Calculate the square and symmetric matrix Qi = S−bψi.

3. Obtain the spectral decomposition of Qi so that

Qi = H1iG1iH01i +H2iG2iH

02i

where G1i contains the m greatest eigenvalues of Qi and H1i its eigenvectors. Wechoose m so that the rest of the eigenvectors contained in G2i are all small and ofsimilar size. The matrix Qi may not be positive definite and some of its eigenvaluescan be negative. This is not a serious problem if these eigenvalues are small and wecan assume them to be near zero.

4. Take bΛi+1 = H1iG1/21i and go back to (1). Iterate until reaching convergence, that is,

until kΛn+1 −Λnk < ².

The estimators obtained will be consistent but not efficient as in the case of maximumlikelihood. Neither are they invariant to linear transformations as are the ML, that is,the same results are not necessarily obtained with the covariance matrix and with that ofcorrelations.To put this idea into practice, we have to specify how to obtain the initial estimator bΛi

or bψi, a problem which is known as communality estimation.

12.3.1 Communality estimation

Estimating the terms ψ2i is equivalent to defining values for the diagonal terms, h2i , of ΛΛ

0,

since h2i = s2i − bψ2i . The following alternatives are used:

1. Take bψi = 0. This is equivalent to extracting the principal components of S. Thissupposes taking bh2i = s2i (in the case of correlations bh2i = 1 ), which is clearly itsmaximum value, so that we can begin with a significant bias.

2. Take bψ2j = 1/s∗jj, where s∗jj is the j-th diagonal element of the precision matrix S−1.According to Appendix 3.2, this is equivalent to taking h2j as:bh2j = s2j − s2j(1−R2j) = s2jR2j , (12.14)


where R2j is the multiple correlation coefficient between xj and the rest of the variables.

Intuitively, the greater R2j is, the greater the communality bh2j . With this method westart with a low biased estimation of h2i , since bh2i ≤ h2i . To show this, suppose, forexample, that the true model for the variable x1 is

x1 =mXj=1

λ1jfj + u1 (12.15)

which is associated with the decomposition σ21 = h21+ψ21. The proportion of explained

variance is h21/σ21. If we write the regression equation

x1 = b2x2 + . . .+ bpxp + ²1

replacing each variable with its expression in terms of the factors we have:

x1 = b2³X

λ2jfj + u2´+ . . .+ bp

³Xλpjfj + up

´+ ². (12.16)

which leads to a decomposition of the variance σ21 = bh21+ bψ21. Clearly bh21 ≤ h21, since in(12.16) we force the noise u1, . . . , up of each equation to appear as a regressor of thefactors as in (12.15). Moreover, it is possible that a factor affects x1 but not the rest, sothat it will not appear in equation (12.16). To summarize, the estimated communalityin (12.16) will have a lower bound than the true value of the communality.

Example: In this example we will show in detail the iterations of the principal factoralgorithm for the ACCIONES data in Annex I. The covariance matrix of these data inlogarithms is,

S =

⎡⎣ 0.35 0.15 −0.190.15 0.13 −0.03−0.19 −0.03 0.16

⎤⎦In order to estimate the loading matrix we carry out the steps of the principal factor algorithmas described above. Before starting the algorithm we need to set the bound to decide theconvergence. We make ε large, 0.05, so that in few iterations the algorithm converges despitethe accumulated rounding errors.

Step 1. Taking the second alternative for the initial estimation of the commonalities diag(bψ2i ) =1/s∗jj . where s

∗jj is the j-th element of matrix S

−1

S−1 =

⎡⎣ 52.094 −47.906 52.88−47.906 52.094 −47.1252.88 −47.12 60.209

⎤⎦bψ2i =

⎡⎣ 1/52.094 0 00 1/52.094 00 0 1/60.209

⎤⎦ =⎡⎣ 0.019 0 0

0 0.019 00 0 0.017

⎤⎦


Step 2. We calculate the square and symmetric matrix Qi = S−bψi

Qi =

⎡⎣ 0.13 0.15 −0.190.15 0.13 −0.03−0.19 −0.03 0.16

⎤⎦−⎡⎣ 0.019 0 0

0 0.019 00 0 0.017

⎤⎦ =⎡⎣ 0.111 0.15 −0.190.15 0.111 −0.03−0.19 −0.03 0.143

⎤⎦Step 3. Spectral decomposition ofQi and separation into two termsH1iG1iH

01iandH2iG2iH

02i.

The eigenvalues of Qi are 0.379, 0.094, and −0.108. We observe that one of them isnegative, thus the matrix is not positive definite. Since there is an eigenvalue that ismuch larger that the rest we will take a single factor. This implies the decomposition⎡⎣ 0.111 0.15 −0.19

0.15 0.111 −0.03−0.19 −0.03 0.143

⎤⎦ =⎡⎣ −0.670−0.4420.596

⎤⎦× 0.379×⎡⎣ −0.670−0.4420.596

⎤⎦0

+

+

⎡⎣ −0.036 0.741−0.783 −0.438−0.621 0.508

⎤⎦∙ 0.094 00 −0.108

¸⎡⎣ −0.036 0.741−0.783 −0.438−0.621 0.508

⎤⎦0

Step 4. We calculate bΛi+1 = H1iG1/21i

bΛi+1 =

⎡⎣ −0.670−0.4420.596

⎤⎦×√0.379 =⎡⎣ −0.412−0.2720.367

⎤⎦This is the first estimation of the loading matrix. We are going to iterate to improvethis estimation and to do this, we return to Step 1.

Step 1. We estimate the terms in the diagonal of bψi using bψi = diag³S−bΛbΛ0

´

bψi = diag

⎧⎨⎩⎡⎣ 0.13 0.15 −0.190.15 0.13 −0.03−0.19 −0.03 0.16

⎤⎦−⎡⎣ −0.412−0.2720.367

⎤⎦ £ −0.412 −0.272 0.367¤⎫⎬⎭

=

⎡⎣ 0.180 0 00 0.056 00 0 0.0253

⎤⎦Step 2. We calculate the square, symmetric matrix Qi = S−bψi

Qi =

⎡⎣ 0.13 0.15 −0.190.15 0.13 −0.03−0.19 −0.03 0.16

⎤⎦−⎡⎣ 0.180 0 0

0 0.056 00 0 0.0253

⎤⎦ =⎡⎣ −0.05 0.15 −0.190.15 0.074 −0.03−0.19 −0.03 0.135

⎤⎦


Step 3. Spectral decomposition of Qi = H1iG1iH01i +H2iG2iH

02i⎡⎣ −0.05 0.15 −0.19

0.15 0.074 −0.03−0.19 −0.03 0.135

⎤⎦ =⎡⎣ −0.559−0.4500.696

⎤⎦× 0.307×⎡⎣ −0.559−0.4500.696

⎤⎦0

+

+

⎡⎣ 0.081 0.8250.806 −0.3850.586 0.414

⎤⎦∙ 0.067 00 −0.215

¸⎡⎣ 0.081 0.8250.806 −0.3850.586 0.414

⎤⎦0

Step 4. We calculate bΛi+1 = H1iG1/21i

bΛi+1 =

⎡⎣ −0.559−0.4500.696

⎤⎦×√0.307 =⎡⎣ −0.310−0.2490.386

⎤⎦and check whether the convergence criterion is fulfilled, kΛn+1 −Λnk < ².°°°°°°

⎡⎣ −0.310−0.2490.386

⎤⎦−⎡⎣ −0.412−0.2720.367

⎤⎦°°°°°° = 0.106 ≥ ² = 0.05we return to Step 1 until the criterion is fulfilled.

Step 1. We estimate again bψi = diag³S−bΛbΛ0

´bψi = diag

⎧⎨⎩⎡⎣ 0.35 0.15 −0.190.15 0.13 −0.03−0.19 −0.03 0.16

⎤⎦−⎡⎣ −0.310−0.2490.386

⎤⎦ £ −. 31 −. 249 . 386¤⎫⎬⎭

=

⎡⎣ 0.254 0 00 0.068 00 0 0.011

⎤⎦Step 2. We calculate the square symmetric matrix Qi = S−bψi

Qi =

⎡⎣ 0.13 0.15 −0.190.15 0.13 −0.03−0.19 −0.03 0.16

⎤⎦−⎡⎣ 0.254 0 0

0 0.068 00 0 0.011

⎤⎦ =⎡⎣ −0.124 0.15 −0.19

0.15 0.062 −0.03−0.19 −0.03 0.149

⎤⎦Step 3. Spectral decomposition of Qi. We indicate only the first eigenvector and eigenvalue.⎡⎣ −0.124 0.15 −0.19

0.15 0.062 −0.03−0.19 −0.03 0.149

⎤⎦ =⎡⎣ −0499−0.4250.755

⎤⎦× 0.291×⎡⎣ −0499−0.4250.755

⎤⎦0

+H2iG2iH02i


Step 4. Calculate bΛi+1 = H1iG1/21i

bΛi+1 =

⎡⎣ −0499−0.4250.755

⎤⎦×√0.291 =⎡⎣ −0.269−0.2290.407

⎤⎦we check whether the convergence criterion is fulfilled

°°°Λn+1 − Λn

°°° < ².°°°°°°⎡⎣ −0.269−0.2290.407

⎤⎦−⎡⎣ −0.310−0.2490.386

⎤⎦°°°°°° = 0.05 ≥ ² = 0.05The convergence criterion has been fulfilled and the model with the estimated parameters

is:

x =

⎡⎣ −0.269−0.2290.407

⎤⎦ f1 +⎡⎣ u1u2u3

⎤⎦⎡⎣ u1u2u3

⎤⎦ ∼ N3

⎛⎝⎡⎣ 000

⎤⎦ ,⎡⎣ 0.254 0 0

0 0.068 00 0 0.011

⎤⎦⎞⎠We see that the equation of the factor obtained is quite different from the first component

that was obtained in exercise 5.1.Example: For the INVEST database a descriptive analysis was carried out in Chapter

4 in which a logarithmic transformation for all the variables and the elimination of the USdata was proposed. Using this set of data, once it has been standardized, we are going toillustrate the calculations of a single factor via the principal factor method (in the followingexample 2 factors are considered). We are going to compare the two proposed methods toinitiate the algorithm with the standardized data. In the first case we start the iterationswith bψj = 0 =⇒ bh2(0) = 1,and the number of iterations needed before converging is 6. The stop criterion in step k ofthe algorithm is, in this case, that the maximum differences between the commonalities in kand k−1 be less than 0.0001. The following table shows the estimations of the commonalitiesfor steps i = 0, 1, 2, 3, 6.

h2(0) h2(1) h2(2) h2(3) h2(6)INTER.A 1 0.96 0.96 0.96 0.96INTER.B 1 0.79 0.76 0.75 0.75AGRIC. 1 0.94 0.94 0.94 0.94BIOLO. 1 0.92 0.91 0.91 0.91MEDIC. 1 0.97 0.97 0.97 0.97CHEMIS. 1 0.85 0.83 0.82 0.82ENGIN. 1 0.9 0.88 0.88 0.88PHYSIC. 1 0.94 0.93 0.93 0.93


The final result once the algorithm has converged is shown in bold.If we begin the algorithm with the second method,

bψj = 1−R2j =⇒ bh2(0) = R2j ,the number of iterations before convergence is 5. The following table shows how the estima-tions of the commonalities vary for steps i = 0, 1, 2, 3, 5.

h2(0) h2(1) h2(2) h2(3) h2(5)INTER.A 0.98 0.96 0.96 0.96 0.96INTER.B 0.82 0.76 0.75 0.75 0.75AGRIC. 0.95 0.94 0.94 0.94 0.94BIOLO. 0.97 0.92 0.91 0.91 0.91MEDIC. 0.98 0.97 0.97 0.97 0.97CHEMIS. 0.85 0.82 0.82 0.82 0.82ENGIN. 0.93 0.89 0.88 0.88 0.88PHYSIC. 0.97 0.94 0.93 0.93 0.93

The figures in bold show the final result once convergence of the algorithm has been reached.Having initiated the algorithm in the point nearest the end, the convergence was faster,and in the second iteration the result is quite close to the end. We can see how the initialestimation of the commonalities, h2(0), is the upper bound of the final estimation, h

2(5). In the

following table we present the estimation of Λ(0) which we started from in both methods andthe estimation of the final loadings obtained.

bψj = 0 bψj = 1−R2j FinalFactor1 Factor1 Factor1

INTER.A 0.97 0.97 0.98INTER.B 0.89 0.87 0.87AGRIC. 0.97 0.97 0.97BIOLO. 0.96 0.96 0.95MEDIC. 0.98 0.98 0.99CHEMIS. 0.92 0.90 0.91ENGIN. 0.94 0.94 0.94PHYSIC. 0.96 0.97 0.97

The second method gives us a Λ(0) closer to the final result, especially for those variableswhere the specific variability is greater.

12.3.2 Generalizations

The principal factor method is a procedure for minimizing the function:

F = tr (S−ΛΛ0 −ψ)2 . (12.17)


Note that this function can be written as

F =

pXi=1

pXj=1

(sij − vij)2 (12.18)

where vij are the elements of the matrix V = ΛΛ0+ψ. However, using spectral decomposi-tion, given a squared symmetric and non-negative matrix S, the best approximation as faras least squares (12.18) using a matrix of rank m, AA0 is obtained by taking A = HD1/2,where H contains the eigenvectors and D1/2 the roots of the eigenvalues of S (see Appendix5.2), which is what the principal factor method does.Harman (1976) developed the MINRES algorithm which minimizes (12.17) more efficient-

ly than the principal factor method, and Joreskog (1976) proposed the ULS (unweightedleast squares), which is based on taking the derivative in (12.17), obtaining bΛ as a functionof ψ and then minimizing the resulting function with a Newton-Raphson type non-linearalgorithm.Example: With the INVEST data, used in the previous example, we present the facto-

rial analysis for two factors performed with a computer program using the principal factormethod. Table 12.1 indicates the variability of both factors. The second factor explainslittle variability (2%) but has been included because of its clear interpretation.

Factor1 Factor2Variability 7.18 0.17

Ph 0.89 0.02Phi=1 Ph 0.89 0.91

Tabla 12.1: Variability explained by the first two factors estimated using the principal factormethod.

The principal factor algorithm begins with bψj = 1 − R2j , and 14 iterations have beenperformed before converging at the weights presented in Table 12.2.

Factor1 Factor2 ψ2iINTER.A 0.97 -0.06 0.04INTER.B 0.87 0.16 0.22AGRIC. 0.97 -0.03 0.06BIOLO. 0.95 -0.24 0.02MEDIC. 0.99 -0.10 0.02CHEMIS. 0.91 -0.09 0.17ENGIN. 0.94 0.21 0.06PHYSIC. 0.97 0.17 0.03

Tabla 12.2: Loading matrix of the factors and communalities

The first factor is the sum of the publications in all the databases, and it gives anidea of the volume. According to this factor the countries would be ordered according totheir scientific output. The second factor contrasts biomedical research with technological


research. This second component separates Japan and the UK, countries with heavy scientificoutput.Figure 12.1 shows a graph of the distribution of the countries over these two factors.

The reader should compare these results with those obtained in Chapter 5 (exercises 5.6 and5.10) with principal components.

Figura 12.1: Representation of the countries on the plane formed by the first two factors.

12.4 MAXIMUM LIKELIHOOD ESTIMATION

12.4.1 ML estimation of parameters

Direct approach

The matrices of the parameters can be formally estimated using maximum likelihood. Thedensity function of the original observations is Np(μ,V). Therefore, the likelihood is thatwhich we saw in Chapter 10. Replacing μ with its estimator, x, the support function for Vis:

log(V|X) = −n2log |V|− n

2tr¡SV−1

¢, (12.19)

and replacing V with (12.2) the support function of Λ and ψ is :

L(Λ,ψ) = −n2

¡log |ΛΛ0 +ψ|+ tr(S(ΛΛ0 +ψ)−1

¢. (12.20)

12.4. MAXIMUM LIKELIHOOD ESTIMATION 123

The maximum likelihood estimators are obtained by maximizing (12.20) with respectto the matrices Λ and ψ. Taking the derivative with respect to these matrices and aftercertain algebraic manipulations which are shown in Appendix 12.1, (see Anderson, 1984, pp.557-562; or Lawley and Maxwell, 1971), the following equations are obtained:

bψ = diag (S− ΛΛ0) (12.21)³bψ−1/2 (S− I) bψ−1/2´³bψ−1/2bΛ´ = ³bψ−1/2bΛ´D (12.22)

where D is the normalization matrix

bΛ0bψ−1bΛ = D =diagonal. (12.23)

These three equations allow us to solve the system by a Newton-Raphson type iterativealgorithm. The numerical solution is difficult to find because there may not be a solutionwhereby bψ is a positive definite, and it is then necessary to turn to the estimation withrestrictions. We observe that (12.22) leads to an equation of eigenvalues: it tells us thatbψ−1/2bΛ contains the eigenvectors of the symmetric matrix

³bψ−1/2 (S− I) bψ−1/2´ and thatD contains the eigenvalues.The iterative algorithm for solving these equations is:

1. Start with an initial estimation. If we have an estimation bΛi, (i = 1 the first time),from the principal factor method, for example, the matrix bψi is calculated usingbψi = diag

³S−bΛi

bΛ0i

´. Alternatively, we can estimate the matrix bψi directly as in

the principal factor method.

2. The squared symmetric matrix Ai is calculated as Ai = bψ−1/2i

³S−bψi

´ bψ−1/2i =bψ−1/2Sbψ−1/2 − I. This matrix weights the terms of S by their importance in terms ofspecific components.

3. The spectral decomposition of Ai is obtained such that

Ai = H1iG1iH01i +H2iG2iH

02i

where the m greatest eigenvalues of Ai are in the (m×m) diagonal matrix, G1i andthe p−m smallest of G2i and H1i and H2i contain the corresponding eigenvectors.

4. Take bΛi+1 = bψ1/2

i H1iG1/21i and substitute it in the likelihood function, which is max-

imized with respect to ψ. This part is easy to do with an algorithm of non-linearoptimization. With this result, go back to (2), iterating until convergence is reached.

It is possible for this algorithm to converge at a maximum point where some of the termsof the matrix ψ are negative. This inadequate solution is sometimes called a Heywoodsolution. The existing programs change these values for positive numbers and attempt tofind another maximum point, although the algorithm does not always converge.


Appendix 12.1 shows the proof that the ML estimation is invariant to linear transforma-tions of the variables. Therefore, the result of the estimation does not depend - as happenswith principal components - on the use of the covariance or correlation matrix. An addi-tional advantage of maximum likelihood is that we can obtain asymptotic variances of theestimators using the information matrix in the optimum.We see that when the diagonal terms of matrix bψ are approximately equal, the ML

estimation will lead to results similar to those of the principal factor method. Note thatby substituting bψ = kI in the the ML estimator equations, both methods use the samenormalization and equation (12.22) is analogous to (12.11), which is solved in the principalfactor method.

The EM algorithm

An alternative procedure for maximizing the likelihood is to consider the factors as miss-ing values and to apply the EM algorithm. The joint likelihood function of the data andthe factors can be written as f(x1, ...,xn, f1, ..., fn) = f(x1, ...,xn|f1, ..., fn) f(f1, ..., fn). Thesupport for the whole sample is

log(ψ,Λ|X,F) = −n2log |ψ|− 1

2

X(xi −Λf i)0ψ−1 (xi −Λf i)−

1

2

Xfif

0i , (12.24)

where we assume that the mean of variables xi is zero, which is equivalent to substituting themean with its sample mean estimator. We observe that, given the factors, the estimation ofΛ could be done as a regression. On the other hand, given the parameters we could estimatethe factors, as we will see in section 12.7. In order to apply the EM algorithm we need:(1) Step M: maximize the complete likelihood with respect to Λ and ψ, assuming that

the fi values of the factors are known. This is easy to do since the rows of Λ are obtainedby doing regressions between each variable and the factors, and the diagonal elements of ψare the residual variances in these regressions.(2) Step E: Calculate the expectation of the complete likelihood with respect to the

distribution of the fi given the parameters. Developing (12.24) it can be shown that theexpressions which appear in the likelihood are the covariance matrix between the factorsand the covariance matrix between the factors and the data. The details of this estimationcan be seen in Bartholomew and Knott (1999, p.49)

12.4.2 Other estimation methods

Because the maximum likelihood method is complicated, other approximate methods havebeen proposed for calculating estimators with similar asymptotic properties but with simplercalculations. One of these is the generalized least squares which we will present next. Tojustify it, we observe that the ML estimation can be reinterpreted as follows: if no restrictionsexist for V the ML estimator of this matrix is S and substituting this estimation in (12.19)the support function in the maximum is:

−n2log |S|− n

2p.

12.4. MAXIMUM LIKELIHOOD ESTIMATION 125

Maximizing the support function is equivalent to minimizing with respect to V the dis-crepancy function obtained by subtracting the support from the previous maximum value(12.19). The function obtained with this difference is:

F =n

2tr¡SV−1

¢− n2p− log

¯SV−1

¯which indicates that we want to make V as close to S as possible, measuring the distancebetween both matrices by the trace and the determinant of the product SV−1. We observethat since V is estimated with restrictions

¯SV−1

¯≤ 1 and the logarithm will be negative

or null. If we concentrate on the first two terms and reject the determinant, the function tobe minimized is:

F1 = tr¡SV−1

¢− p = tr

¡SV−1 − I

¢= tr

£(S−V)V−1

¤which computes the differences between the observed matrix S and the estimated V, butgives a weight to each difference which depends on the size of V−1. This leads to the ideaof generalized least squares (GLS), where we minimize

tr£(S−V)V−1

¤2and it can be proved that if we iterate the GLS procedure we obtain efficient asymptoticestimators.Example: We are going to illustrate the ML estimation for the INVEST data. Assuming

two factors we obtain the results shown in the following tables:

Factor1 Factor2Variability 6.80 0.53

Ph 0.85 0.06Phi=1 Ph 0.85 0.91

Tabla 12.3: Variability explained by the first two factors estimated using maximum likeli-hood.

Factor1 Factor2 ψ2iINTER.A 0.95 0.25 0.02INTER.B 0.85 0.08 0.26AGRIC. 0.92 0.26 0.07BIOLO. 0.88 0.45 0.01MEDIC. 0.93 0.3 0.02CHEMIS. 0.86 0.29 0.17ENGIN. 0.95 0.05 0.09PHYSIC. 1 0 0

Tabla 12.4: Loading matrix of the factors


If we compare these results with those obtained using the principal factor method (exer-cise 12.5) we see that the first factor is similar, although it increases the weight of physicsand there are relative differences between the weights of the variables. The second factor hasmore changes but its interpretation is similar as well. The variances of the specific compo-nents display few changes in both approaches. Figures 12.2 and 12.3 show the weights andthe projection of the data over the plane of the factors.

Figura 12.2: Weights of the INVEST variables in the two factors using ML estimation.

12.5 DETERMINING THENUMBEROF FACTORS

12.5.1 Likelihood test

Suppose that a model with m factors has been estimated. The test to see whether thedecomposition holds can be considered as a likelihood ratio test:

H0 : V = ΛΛ0+ψ

H1 : V 6= ΛΛ0+ψ.

This test is similar to the partial sphericity which we studied in Chapter 10, althoughthere are differences due to the fact that we do not require the specific components to haveequal variance. Let V0 be the value of the covariance matrix of the data estimated underH0. Then, the likelihood ratio test is:

λ = 2 (ln(H1)− ln(H0))

12.5. DETERMINING THE NUMBER OF FACTORS 127

Figura 12.3: Projection of the countries onto the plane of the two factors using ML estima-tion.

By (12.19), the likelihood function under H0 is:

ln(H0) = −n

2log¯ bV0

¯− n2tr³SbV−10 ´ (12.25)

whereas under H1 the estimator of V is S and we have:

ln(H1) = −n

2log |S|− n

2tr(SS−1) = −n

2log |S|− np

2, (12.26)

the likelihood ratio can be written using these two expressions:

λ = n(log |bV0|+ tr(SbV−10 )− log |S|− p)It can be proved (Appendix 12.1), that the ML estimator of V under H0, V0 , minimizes

the distance to S measured with the trace, that is:

tr³SbV−10 ´ = p (12.27)

therefore, the likelihood ratio is

λ = n log|bV0||S| (12.28)

and, thus, measures the distance between V0 and S in terms of the determinant,−n log¯SbV−10 ¯ ,

which is the second term of the likelihood.


The test rejects H0 when λ is greater than the percentile 1− α of a distribution χ2g withg degrees of freedom, given by g = dim(H1) − dim(H0). The dimension of the parametricspace of H1 is p +

µp2

¶= p(p + 1)/2, equal to the number of different elements in V.

The dimension of H0 is pm — by matrix Λ — plus the p elements of ψ, minus m(m − 1)/2restrictions resulting from the condition that Λ0ψ−1Λ must be diagonal. Therefore:

g = p+ p(p− 1)/2− pm− p+m(m− 1)/2 = (12.29)

= (1/2)¡(p−m)2 − (p+m)

¢Bartlett (1954) showed that the asymptotic approximation of the χ2 distribution improves

in finite samples by introducing a correction factor. With this modification, the test rejectsH0 if

(n− 1− 2p+ 4m+ 56

) ln|ΛΛ0

+ ψ||S| > χ2[((p−m)2−(p+m))/2](1− α) (12.30)

Generally, this test is applied sequentially: The model is estimated with a small value,m = m1 (which can be m1 = 1) and H0 is tested. If it is rejected, we re-estimate withm = m1 + 1, continuing until H0 is accepted.An alternative procedure proposed by Joreskog (1993), which works better against mod-

erate deviations from normality is the following: calculate the statistic (12.30) for m =1, . . . ,mmax. Let X2

1 , . . . ,X2mmax

be their values, and g1, . . . , gmmax their degrees of freedom.We calculate the differences X2

m − X2m+1 and consider these differences to be values of a

χ2 with gm − gm+1 degrees of freedom. If the value obtained is significant we increase thenumber of factors and proceed in this way until we cannot find a significant improvement inthe fit of the model.The test (12.28) allows for an interesting interpretation. The factor model establishes

that the difference between the covariance matrix, S (p× p), and a diagonal matrix of rankp, ψ, is approximately a symmetric matrix of rank m, ΛΛ0, that is:

S−bψ ' bΛbΛ0.

Premultiplying and postmultiplying by bψ−1/2 we get that matrix A, given by:A = bψ−1/2Sbψ−1/2 − I, (12.31)

must be asymptotically equal to the matrix:

B = bψ−1/2bΛbΛ0bψ−1/2, (12.32)

and have rankm asymptotically, instead of rank p. It is shown in Appendix 12.2 that the test(12.28) is equivalent to checking if the matrix A has rank m, which must be asymptoticallytrue by (12.32), and the test (12.28) can be written as

λ = −npX

m+1

log(1 + di) (12.33)

12.5. DETERMINING THE NUMBER OF FACTORS 129

where di are the p−m smaller eigenvalues of the matrix A. The null hypothesis is rejected ifλ is too large compared to the χ2 distribution with (1/2)((p−m)2−p−m). In Appendix 12.2it is shown that this test is a particular case of the likelihood test over the partial sphericityof a matrix that we presented in 10.6.When the sample size is large and m is small compared to p, if the data do not follow

a multivariate normal distribution then the test generally leads to a rejection of H0. Thisis a frequent problem in hypothesis testing for large samples where we tend to reject H0.Therefore, it is necessary when it comes to deciding on the number of factors, to differentiatebetween practical significance and statistical significance, just as with any hypothesis test.This test is very sensitive to deviations from normality so in practice, the statistic (12.28) isused more as a measure of the fit of the model rather than as a formal test.

12.5.2 Selection criteria

An alternative to selecting the number of factors by a test is to consider the problem asa choice of models. We then estimate the factor model for different numbers of factors,calculate the support function in the maximum for each model and, applying the Akaikecriterion, we choose the model where

AIC(m) = −2L(H0,m) + 2np

is minimum. In this expression 2L(H0,m) is the support function for the model which es-tablishes m factors particularized in the ML estimators, which are given by (12.25), andnp is the number of parameters in the model. We observe that this expression takes intoaccount that by increasingm the likelihood of L(H0,m) increases, or the deviance −2L(H0,m)decreases, but this effect is balanced by the number of parameters, which appear penalizingthe above equation. Using the likelihood equation (12.20) and the condition (12.27), sincenp = mp+ p−m(m− 1)/2, we have

AIC(m) = n log¯ΛΛ

0+ ψ

¯+ np− 2m [p+ p/m− (m− 1)/2]

The AIC criterion can be described as minimizing the differences AIC(m) − AIC(H1),where in all the models we subtract the same quantity, AIC(H1), the value of AIC forthe model which assumes there is no factor structure and estimates the covariance matrixwithout restrictions. Then the function to minimize is

AIC∗(m) = 2(L(H1)− L(H0,m))− 2g = λ(m)− 2g

where λ(m) is the difference of the supports (12.28), where in this expression bV0 is estimatedwith m factors, and g is the number of degrees of freedom given by (12.30) (12.29).An alternative criterion is the BIC which we saw in Chapter 11. With this criterion,

instead of penalizing the number of parameters with 2 we do it with logn. This criterionapplied to the factor model using the differences of the support is:

BIC(m) = λ(m)− g logn


Example: We apply the maximum likelihood method to the INVEST data to performa test on the number of factors. If we base the test on the equation (12.28) we obtain thefollowing table,

m λ gm p− value AIC BIC1 31.1 20 0.053 −8.9 −29.792 11.73 13 0.55 −14.27 −27.843 6.49 7 0.484 −7.51 −14.824 5.27 2 0.072 1.27 −0, 73

for example, ifm = 1 the number of degrees of freedom is (1/2) ((7)2 − (9)) = 20.We see thatfor α = 0.05 we cannot reject the null hypothesis that one factor is sufficient. Nevertheless,the Akaike criterion indicates that the minimum is obtained with two factors, and the BICcriterion confirms, with little difference, the choice of one factor.Since the p−value of the test is at the limit we are going to compare this test with the

procedure proposed by Joreskog. The first step is to use the correction proposed by Bartlett,which we carry out multiplying the statistics χ2m by (n−1−(2p+4m+5)/6)/n. For example,the corrected statistic for p = 1 is,

X21 = ((20− 1− (2 ∗ 8 + 4 ∗ 1 + 5)/6)/20) ∗ 31.1 = 23.06

and the following table shows the results.

p X2m X2

m −X2m+1 gm − gm+1 p− value

1 23.06 14.76 7 0.0392 8.30 3.92 6 0.6873 4.38 1 5 0.9624 3.38

This method indicates that we reject the hypothesis of one factor, but we cannot rejectthe hypothesis of two factors, thus we conclude that the number of factors, chosen withJoreskog’s method, is equal to two. As we see, in this example Joreskog’s criterion coincideswith that of Akaike’s.Example: For the HBS data from the Household Budget Survey in Annex ??, we apply

the factor model estimated by maximum likelihood. The data have been transformed intologarithms to improve the asymmetry, the same as was done in the Principal Componentsanalysis presented in examples 4.2 and 4.3. For this analysis we have also standardized theobservations.We accept the test of a single factor given that the p-value is 0.242. The estimation of

the weights of this factor is approximately a weighting with less weight given to the sectionsof food, clothing and footwear, as shown in Table 12.5

X1 X2 X3 X4 X5 X6 X7 X8 X9Factor 1 0.61 0.64 0.86 0.88 0.82 0.84 0.93 0.89 0.72

Tabla 12.5: Loading vector of the factors

12.6. FACTOR ROTATION 131

12.6 FACTOR ROTATION

As we saw in section 12.2.3, the loading matrix is not identified through multiplying byorthogonal matrices, which are equivalent to rotations. In factor analysis the space of thecolumns in the loading matrix is defined, but any base in this space can be a solution. Inorder to choose from among possible solutions, the interpretation of the factors is takeninto account. Intuitively, it is easier to interpret a factor when it is associated with a blockof observed variables. This occurs if the columns of the loading matrix, which representthe effect of each factor on the observed variables, contain high values for certain variablesand small ones for others. This idea can be considered in different ways which give rise todifferent criteria for defining the rotation. The coefficients of the orthogonal matrix whichdefine the rotation are obtained by minimizing an objective function. This expresses thedesired simplicity in the representation which we get as a result of the rotation. The mostfrequently used criterion is the Varimax.

Varimax Criteria

The interpretation of the factors is made easier if those which affect some variables do notaffect others and vice versa. This objective leads to the criterion of maximizing the varianceof the coefficients which define the effects of each factor on the observed variables. In orderto specify this criterion, we let δij be the coefficients of the loading matrix associated withthe factor j in the i = 1, ..., p equations after the rotation and δj is the vector which is thecolumn j of the loading matrix after the rotation. We want the variance of the squaredcoefficients of this vector to be maximum. The squared coefficients are taken in order toeliminate the signs since we are interested in their absolute value. Letting δ.j =

Pδ2ij/p be

the mean of the squares of the components of the vector δj, the variability for factor j is:

1

p

pXi=1

(δ2ij − δ.j)2 =

1

p

pXi=1

δ4ij − (1/p)2(pXi=1

δ2ij)2, (12.34)

and the criterion is to maximize the sum of the variances for all of the factors given by:

q = (1/p)mXj=1

pXi=1

δ4ij − (1/p)2mXj=1

(

pXi=1

δ2ij)2. (12.35)

Let Λ be the loading matrix estimated initially. The problem is to find an orthogonalmatrixM such that the matrix δ given by

δ = ΛM,

and whose coefficients δij are given by

δij = λ0imj

where λ0i is the row i of the matrix Λ andmj is the column j of the matrixM we are lookingfor, verifies the condition that these coefficients maximize (12.35). The terms of the matrix


M are derived from (12.35) for each of the terms mij taking into account the restrictions oforthogonality m0

imi = 1; m0imj = 0 (i 6= j). The result is the varimax rotation.

Example: If we apply a varimax rotation to the ML estimation of the INVEST data fromexample 7.6 we obtain the result shown in Figure 12.4. This new loading matrix is the result

Figura 12.4: The result of applying a varimax rotation to the INVEST factors.

of multiplying the coefficients of the orthogonal matrix that defines the rotation, M, by theloading matrix obtained through the ML estimation and shown in example 12.6

δ =

Λ⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

0.95 0.250.85 0.080.92 0.260.88 0.450.93 0.30.86 0.290.95 0.051 0

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦M∙

0.53 0.850.85 −0.53

¸=

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

0.71 0.670.52 0.680.71 0.640.85 0.510.75 0.630.70 0.580.55 0.780.53 0.85

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦Oblique rotations

The factor model is indeterminate not only to orthogonal rotations but to oblique rotationsas well. In fact, as we saw in section 6.1 the model can be established with correlated oruncorrelated factors. The solution obtained from the estimation of Λ always corresponds touncorrelated factors, but we can ask ourselves if there is a solution with correlated factors

12.7. FACTOR ESTIMATION 133

which has a more interesting interpretation. Mathematically this implies the definition ofnew factors f∗ = Hf , where H is a nonsingular matrix which can, in general, be interpretedas an oblique rotation. The new covariance matrix of the factors is V∗f = HH

0.There are various procedures for obtaining oblique rotations, such as Quartmin, Oblimax,

Promax, etc. more information on which can be found in specialized literature. The problemwith oblique rotations is that the factors are correlated and thus cannot be interpretedindependently.

12.7 FACTOR ESTIMATION

In many problems the interest of factor analysis lies in determining the loading matrix, notin the particular values of the factors in the elements of the sample. Nevertheless, in othercases we want to obtain the values of the factors over the observed elements. There aretwo procedures for estimating the factors: the first, introduced by Bartlett, supposes thatthe vector of the values of the factors for each observation is a parameter to be estimated.The second supposes that this vector is a random variable. Next we will briefly review bothprocedures.

12.7.1 The factors as parameters

The (p × 1) vector of the values of the variables in the individual, i, xi, has a normaldistribution with mean Λf i, where fi is the (m× 1) vector of the factors for the element i inthe sample, and the covariance matrix ψ,

xi ∼ Np(Λfi,ψ)

The parameters fi can be estimated by maximum likelihood as shown in Appendix 12.3.The resulting estimator is that of generalized least squares, given by

bfi = ³bΛ0bψ−1bΛ´−1 bΛ0bψ−1xi. (12.36)

which has a clearly intuitive interpretation: if we know Λ, the factor model

xi = Λf i + ui

is a regression model with dependent variable xi, explicative variables in the columns of Λand parameters fi. Since the perturbation, ui, is not distributed as N(0, I) but as N(0,ψ),we have to use generalized least squares, which leads to (12.36).

12.7.2 The factors as random variables

The second method is to assume that the factors are random variables, and to look for alinear predictor which minimizes the mean square error of prediction. As before, we let fibe the values of the factors in the individual i and xi is the vector of observed variables,


the vector (fi,xi) will have a multivariate normal distribution and the objective is to findE [fi|xi]. Using the results from section 8.5.1 we get:

E [fi|xi] = E [fi] + Cov (fi, xi)V ar (xi)−1 (xi −E (xi))

Since E [fi] = 0 and the covariances between the factors and the variables are the termsof the loading matrix, we can write, assuming the variables have a mean of zero and thatthe parameters are known: bfi = E [fi|xi] = Λ0V−1xi (12.37)

which is the linear regression predictor of the factors over the data. In fact, Λ0 representsthe covariances between factors and variables and V represents the covariances betweenvariables. This equation can also be written (see Appendix 12.3) asbfi = (I+Λ0ψ−1Λ)−1Λ0ψ−1xi. (12.38)

Comparing (12.36) and (12.37) we see that the latter method can be interpreted as aridge regression. It involves adding the unit to the diagonal elements of the matrix Λ0ψ−1Λ.This estimator also has an interesting Bayesian interpretation which is given in Appendix12.4.If in equations (12.36) and (12.38) we substitute the estimated values in place of the

theoretical values, we obtain a vector bfi which represents the estimation of the value of them factors in the individual i. By successively applying these equations to the n samplingdata, x1, ...,xn, we obtain the values of the factors for the n individuals, f1, ..., fn, where eachfi is an (m× 1) vector .Example: Using the ACCIONES data we estimate the values of the factor assuming the

loading matrix estimated in example 12.3 for the variables in logarithms. We are going toexplain in detail how it was obtained for the first 5 stocks in that example. The data matrixX contains these 5 observations.

X =

⎡⎢⎢⎢⎢⎣1.22 4.5 3.411.63 4.02 2.291.5 3.96 2.441.25 3.85 2.421.77 3.75 1.95

⎤⎥⎥⎥⎥⎦We start with the first method, generalized least squares. The estimators of bΛ, bψ−1,

obtained in example 12.3 are,

bΛ =⎡⎣ −0.269−0.2290.407

⎤⎦ ; bψ−1 =⎡⎣ 1.984 0 0

0 3.834 00 0 9.534

⎤⎦and applying the formulas we get,

³bΛ0bψ−1bΛ´−1 = 0.5; ³bΛ0bψ−1bΛ´−1 bΛ0bψ−1 =⎡⎣ −0.55−1.7519.24

⎤⎦0

12.8. DIAGNOSIS OF THE MODEL 135

The first 5 values of the first factor are calculated with bfi = X µ³bΛ0bψ−1bΛ´−1 bΛ0bψ−1¶,

bfi =⎡⎢⎢⎢⎢⎣1.22 4.5 3.411.63 4.02 2.291.5 3.96 2.441.25 3.85 2.421.77 3.75 1.95

⎤⎥⎥⎥⎥⎦⎡⎣ −0.55−1.7519.24

⎤⎦ =⎡⎢⎢⎢⎢⎣57.06236.12839.19139.13629.982

⎤⎥⎥⎥⎥⎦To estimate the values using the second method we calculate:

³I+bΛ0bψ−1bΛ´−1 = 0.342; ³

I+bΛ0bψ−1bΛ´−1 bΛ0bψ−1 =⎡⎣ −0.36−1.1512.65

⎤⎦0

and the first 5 values of the first factor are calculated with bfi = X ³I+ bΛ0bψ−1bΛ´−1 bΛ0bψ−10,

bfi =⎡⎢⎢⎢⎢⎣1.22 4.5 3.411.63 4.02 2.291.5 3.96 2.441.25 3.85 2.421.77 3.75 1.95

⎤⎥⎥⎥⎥⎦⎡⎣ −0.36−1.1512.65

⎤⎦ =⎡⎢⎢⎢⎢⎣37.5223.7525.7725.7319.72

⎤⎥⎥⎥⎥⎦We see that both estimations have the same structure, but the contraction effect of the

second method makes the values obtained smaller.

12.8 DIAGNOSIS OF THE MODEL

Residuals of the factors

In order to test whether the model is appropriate it is advisable to calculate the factors bfand the residuals e and to study their properties. The model assumes that:

u ∼ Np(0,ψ)

Therefore, if the covariance matrix of the residuals is not diagonal we have to increasethe number of factors until the estimated residuals:

bui = ei = xi −Λbfiverify the hypothesis. Specifically, we test whether the residuals have a normal distribution.The residuals can also tell us of the presence of outliers, or of groups of observations whichdo not fit well with the constructed model.


Example: For the HBS we calculate the covariance matrix of the residuals of the modelestimated in exercise 12.8, where a single factor was estimated.

ψ =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

0.61 0.13 -0.04 -0.03 -0.06 -0.03 0.02 -0.11 0.050.13 0.57 -0.01 0.04 -0.01 0.01 -0.05 -0.13 0.04-0.04 -0.01 0.22 -0.07 -0.07 -0.05 -0.01 0.01 -0.06-0.03 0.04 -0.07 0.19 -0.03 -0.01 -0.05 -0.03 0.02-0.06 -0.01 -0.07 -0.03 0.3 0 -0.02 -0.04 -0.01-0.03 0.01 -0.05 -0.01 0 0.26 -0.06 -0.05 0.10.02 -0.05 -0.01 -0.05 -0.02 -0.06 0.1 -0.01 -0.1-0.11 -0.13 0.01 -0.03 -0.04 -0.05 -0.01 0.18 -0.050.05 0.04 -0.06 0.02 -0.01 0.1 -0.1 -0.05 0.45

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦The specific variance appears in the diagonal and we can see that the terms outside thediagonal are relatively small. If we compare this matrix with the result of estimating twofactors, which in the test has a p-value of 0.64, the new variance matrix of the residuals is:

ψ =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

0.6 0.12 -0.04 -0.04 -0.07 -0.06 0.02 -0.1 0.010.12 0.53 0.03 0.02 -0.01 -0.04 -0.01 -0.09 -0.08-0.04 0.03 0.22 -0.04 -0.05 -0.01 -0.03 0.01 0-0.04 0.02 -0.04 0.19 -0.02 -0.04 -0.02 0 -0.05-0.07 -0.01 -0.05 -0.02 0.3 -0.01 -0.01 -0.02 -0.03-0.06 -0.04 -0.01 -0.04 -0.01 0.18 -0.01 -0.01 -0.050.02 -0.01 -0.03 -0.02 -0.01 -0.01 0.03 -0.04 0-0.1 -0.09 0.01 0 -0.02 -0.01 -0.04 0.19 0.010.01 -0.08 0 -0.05 -0.03 -0.05 0 0.01 0.17

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦and the specific variability has decreased in variables X7 and X9 and outside the diagonal, ingeneral, the values are quite small. The number of factors could be increased to one more,but this runs the risk of over fitting the model and makes the interpretation of the weightsoverly subject to the specific data.Figures 12.5 and 12.6 show the loading of the factors and the representation of the

different provinces in the space formed by these two factors. The weights of the second factorhave an analogous interpretation to that described for the second principal component. InFigure 12.7 we show the histograms of the marginal distributions of the residuals. Some ofthese histograms do not appear to follow a normal distribution.

Residuals of fit

The residuals of fit are defined as the terms of S−bV. It is often easier to use standardizedresiduals, where each residual is divided by its asymptotic standard deviation.

Measurements of model fit

We can construct a measurement of fit for the factor model for each variable using

γ2i =h2is2i= 1− ψ2i

s2i

12.8. DIAGNOSIS OF THE MODEL 137

Figura 12.5: Representation of the loading matrix of the first two factors estimated usingmaximum likelihood.

which is usually called the squared correlation coefficient or coefficient of determinationbetween the variable and the factors. The coefficient of determination for the whole systemcan be constructed with

R2 = 1−

¯ bψ¯1/p¯ bV¯1/p ,where

¯ bψ¯ is the determinant of the matrix of residual variances and ¯ bV¯ is that estimatedby the model.The χ2 statistic given by (12.28) provides another global measure of fit. In order to

judge its value we compare it with its degrees of freedom. When the data are not normalthe distribution of (12.28) can deviate greatly from the χ2 but, in any case, its value can beused as a criterion of fit.Example: We calculate the squared correlation coefficient between the variable and the

factors for the data in example 12.8 with two factors. Since the original data were stan-dardized, s2i = 1 for i = 1, . . . , 9, the coefficients are calculated as 1 minus the specificvariance.

X1 X2 X3 X4 X5 X6 X7 X8 X9γ2i 0.4 0.47 0.78 0.81 0.7 0.82 0.97 0.81 0.83


Figura 12.6: Representation of the provinces in the first two factors.

The coefficient of determination is

R2 = 1− (1.781499× 10−8

0.0002415762)1/9 = . 652

and we see that it provides an average value of the relationships of dependence in thesystem.

12.9. RELATIONSHIP WITH PRINCIPAL COMPONENTS 139

Figura 12.7: Histogram of the marginal distribution of residuals.

12.9 Relationship with principal components

In principal components we decompose the covariance matrix of X as:

S = AΓA0 = [a1 . . .an]

⎡⎢⎢⎢⎣λ1 0 · · · 00 λ2 · · · 0...

......

0 0 · · · λn

⎤⎥⎥⎥⎦⎡⎢⎣ a1...a0n

⎤⎥⎦

= [a1 . . . an]

⎡⎢⎣ λ1a01...

λna0n

⎤⎥⎦= λ1a1a

01 + . . .+ λnana

0n

If λj = 0 for j < h, we can reconstruct S with the first j components. LettingH = AΓ1/2,we have:

S = HH0.

In factor analysis we decompose S as:

S = ΛΛ0 +ψ,


and since the matrix ψ is diagonal it can group the variances of the variables, whereasthe loading matrix groups the covariances. This is a significant difference between the twomethods. The first tries to explain the variances, while the second explains the covariancesor correlations. We observe that if bψ ' 0, we get the same result by taking m principalcomponents as by estimating m factors. The smaller bψ is, the smaller the difference.Another way of studying the relationship between both methods is the following: let X

be the matrix of original data and Z be the matrix of the values of the components. Then:

Z = XA,

or also, since A is orthogonalX = ZA0,

which allows original variables to be constructed from the components. Writing

x1 = α11z1 + . . .+ αm1zm + . . .+ αpmzp...

xp = αp1z1 + . . .+ αpmzm + . . .+ αppzp

with m components we have:

x1 = α11z1 + . . .+ α1mzm + v1...

xp = αp1z1 + . . .+ αmpzm + vp

This representation is apparently analogous to the factor model since v1 will be uncor-related with the factors (z1, ..., zm) by including only the variables (zm+1, ..., zp) which areorthogonal to the previous ones. Nevertheless, the basic difference is that in the factor mod-el the errors of the different equations are uncorrelated, whereas in this representation theywill not be. Note that since (v1, ...vp) contains all the common variables (zm+1, ..., zp) will becorrelated. For this reason, in general the results of both methods will be different.These results indicate that if there are m principal components which explain a high

proportion of variability in such a way that the specific variability given for diagonal termsofψ are small, then the factor analysis and principal components analysis over the correlationmatrix will be similar. Furthermore, in this case the estimation using the principal factorsmethod will lead to results similar to those of maximum likelihood.Besides these differences, both techniques have a different interpretation: in principal

components we try to graphically represent the data, while in factor analysis we assume thatthe factors generate the observed variables.

12.10 Confirmatory Factor Analysis

Factor analysis can be applied as an exploratory tool or as a model for testing theories.In the second case, the number of factors is assumed to be known a priori and restrictionsare established over the elements of the loading matrix. For example, some may be zero or

12.11. ADDITIONAL READING 141

equal to each other. Given the existence of additional information, it is usually assumedthat the factors have a covariance matrix Vf not necessarily an identity matrix, although ithas restrictions on its terms.The fundamental equation becomes

Vx = ΛVfΛ0 +ψ

but now the three unknown matrices of the system, Λ,Vf and ψ have many restrictionssuch that the total number of free parameters t, verifies

t ≤ p (p+ 1)2

so that the model is identified.The estimation is carried out using maximum likelihood, but the restrictionΛ0ψ−1Λ =diagonal

is not usually imposed if it is not necessary in identifying the model.The goodness of fit tests for the model are analogous to those studied, but now the number

of degrees of freedom is p(p+1)2− t, letting t be the number of free parameters estimated.

Nevertheless, the effects of non-normality here are more serious than when we estimate allthe parameters in exploratory factor analysis.We recommend that confirmatory factor analysis always be compared with an exploratory

analysis in order to confirm that the model used does not contradict the observed data.

12.11 Additional Reading

Most multivariate analysis textbooks contain a chapter on factor analysis. We recommend,Jobson (1992), Johnson and Wichern (1998), Mardia et al (1979), Rechner (1998) and Seber(1984). More extensive presentations can be found in Bartholomew and Knott (1999) andHarman (1980). Maximum likelihood estimation is studied in detail in Joreskov (1963) andLawley and Maxwell (1971), although they do not include the more modern estimationmethods based on the EM algorithm or Bayesian methods. Bartholomew and Knott (1999)provide a clear presentation of estimation using the EM algorithm. Another good referenceon the subject is Schafer (1997). For Bayesian methods see O’ Hagan (1994).In this chapter we have focussed on the linear factor model, but there is an extensive

bibliography available on the non-linear extension. Yalcin and Anemiya (2001) review thisfield and include numerous references.EXERCISESExercise: Given the factorial model x = Λ f + u , where x = (x1, x2, x3, x4) has zero

mean and variances ( 1, 2, 1, 7), and where Λ = (.8, 1, 0, 2)0 , and Var(f) = 1, (1) Calculatethe covariances between the variables and the factor; (2) Calculate the correlations betweenthe variables and the factor; (3) Write the model as a unifactorial model with a variancefactor equal to 5.Exercise: Indicate whether the following factorial model is possible: x = Λ f + u ,

where x = (x1, x2, x3) has zero mean and variances (3, 1, 2), and where Λ = (3, 0, 3)0, andVar(f) = 1.


Exercise: Given a factorial model with x = (x1, x2, x3, x4, x5,x6) of zero mean and Λ =

(λ1λ2) with λ1 = (1, 1, 1, 0, 0, 0)0 , and λ2 = (0, 1, 0, 1, 0, 1) and Var(f) =∙1 .5.5 2

¸, write

it in standard form with the normalization Λ0Λ = DiagonalExercise: Using the basic relationship prove that

a. If the specific variance is equal to the variance of a variable, the row of the loadingmatrix which corresponds to that variable must have all null elements.

b. If the covariance between two variables is zero, the rows corresponding to these variablesin the loading matrix are orthogonal.

c. If the variables are standardized the correlation between the variables i and j is thescalar product of the rows of the loading matrix corresponding to these variables.

d. If the variables are standardized, the correlation between variable i and factor j is theterm (ij) of the loading matrix.

Exercise: Using the basic relationship, prove that if the variances of the specific com-ponents are identical, the columns of the loading matrix are eigenvectors of the covariancematrix and come up with the eigenvalues.Exercise: Indicate what the maximum number of uncorrelated factors would be that

could be estimated with ten variables. And if the factors are correlated?Exercise: Using an example prove that with the principal factor method one cannot get

the same loading matrix with standardized variables as with unstandardized.Exercise: There are 20 variables and before carrying out a factorial analysis principal

components are carried out and the first five components are chosen. Next a factorial analysisis done on these components. How many factors do we expect to find?Exercise: Prove that if each column of the loading matrix has a single non-null element,

the factorial model is not identified.Exercise: Prove that if we rotate the factors the communality of each variable does not

vary.Exercise: If Λ = (1, 1, ..., 1)0 indicate the equation to estimate the value of the factor in

an individual if

a. diag(ψ) = p(1, 1, .., 1),

b. diag(ψ) = (1, 2, .., p)

Exercise: Prove that in the unifactorial model with ψ = σ2I , the determinant of bV0 is((Λ0Λ) + σ2)(σ2)p−1.Exercise: Prove that if all the variables have perturbations with the same variance and

ψ =ψ0I, and assuming Λ0Λ = diagonal = D, the columns of Λ are directly eigenvectors of

the matrix V, with eigenvalues di + ψ0, where di is the diagonal term of DExercise: Prove that if two factors, η, ξ, are related using η = βξ + u and we measure

them with error using variables y = η + e, x = ξ + v, the observed correlation between x, y


will be smaller than the one existing between the factors (this result was first obtained bySpearman).APPENDIX 12.1: MAXIMUM LIKELIHOOD ESTIMATION

OF THE FACTOR MODELML estimation of Λ and ψ requires that we write the likelihood equation using the

restriction Λψ−1Λ = D = diagonal, take the derivative and obtain the conditions of thefirst order. this process leads to the same result as solving the estimation equation bymoments

S = bΛbΛ0 + bψ (12.39)

with the restrictions that bΛ be (p ×m) and ψ diagonal. This second condition is satisfiedby using: bψ = diag ³S− bΛbΛ0

´. (12.40)

We assume that starting from an initial value of bΛ we obtain the matrix bψ using (12.40).The new estimator of Λ must approximately satisfy the equation S− bψ = bΛbΛ0. This systemhas p(p+1)/2 equations and p×m unknowns and in general, does not have a unique solution.In order to reduce it to a system of p ×m equations we postmultiply the equation (12.39)

by bψ−1bΛ. Re-ordering terms, we obtain the system of equations:

Sbψ−1bΛ = bΛ(bΛ0bψ−1bΛ+ I) =bΛ(D+ Im) (12.41)

where Im is the identity matrix of orderm.When bψ−1 is known this equation provides a non-linear system of (p×m) equations with (p×m) unknowns in order to obtain bΛ, and suggeststhat bΛ can be obtained using the eigenvectors Sbψ−1, but this matrix is not symmetric. Inorder to solve this problem, premultiplying by bψ−1/2, we can write

bψ−1/2Sbψ−1/2(bψ−1/2bΛ) = (bψ−1/2bΛ)(D+ Im) (12.42)

which shows that we can obtain bψ−1/2bΛ as eigenvectors of the symmetric matrix bψ−1/2Sbψ−1/2,or also of the symmetric matrix bψ−1/2Sbψ−1/2 − Ip.The ML estimators satisfy two important properties. The first is

tr³SbV−10 ´ = p

which indicates that with the distance of the trace, the estimated matrix is as close as possibleto the observed covariance matrix S. Note that bV0 = S, then tr

¡SS−1

¢= tr (Ip) = p. The

second is that the same result is obtained when working with standardized or unstandardizedvariables. This means that if we estimate the loading matrix with the original variables byML we get the same results as when (1) we standardize the variables, subtracting the meansand dividing by the standard deviations, (2) using ML we estimate the loading matrix ,which is then the correlation matrix between the original variables and the factors, (3) we go


from this correlation matrix to the covariance matrix multiplying by the standard deviationsof the variables.We will show the first property. If bV0 = bψ+ bΛbΛ0 is the ML estimation of the covariance

matrix, then (see section 2.3.4):

bV−10 =³bψ + bΛbΛ0

´−1= bψ−1 − bψ−1bΛ³Im + bΛ0bψ−1bΛ´−1 bΛ0bψ−1

and multiplying by S and taking traces:

tr³SbV−10 ´ = tr ³Sbψ−1´− tr ³Sbψ−1bΛ (Im +D)−1 bΛ0bψ−1´

Using the condition (12.41)

tr³SbV−10 ´ = tr ³Sbψ−1´− tr ³bΛbΛ0bψ−1´

and, by the linear properties of the trace,

tr³SbV−10 ´ = tr h³S− bΛbΛ0

´ bψ−1i = tr ³diag ³S− bΛbΛ0´ bψ−1´ .

where the last step comes from the fact that the product of the two diagonal matrices isdiagonal and the trace is simply the sum of the diagonal elements. On the other hand, theequation (12.40) implies

diag³S− bΛbΛ0

´ bψ−1 = Ipand taking traces in this equation

tr³SbV−10 ´ = tr ³³S− bΛbΛ0

´ bψ−1´ = tr (Ip) = pand we have proven that the ML estimator verifies (??).To demonstrate the second property, we assume that we carry out the transformation

y = DX, where D is any diagonal matrix (for example, we standardize the variables, whichis the same as working with R, the correlation matrix, instead of with S, the covariancematrix). Then Sy = DSxD and ψy = DψxD. Upon calculating the new matrix to obtainthe values and eigenvectors, we have

ψ−1/2y Sy−1ψ−1/2y = (DψxD)

−1/2DSxD(DψxD)−1/2

= ψ−1/2x Sx−1ψ−1/2x

which is identical to the above result.APPENDIX 12.2: TESTS OF THE RANK OF A MATRIXIn this appendix we show that the test of the number of factors is a particular case of

the general test of partial sphericity, which was studied in 10.6. To do this, we need thefollowing Lemma:


Lemma: |Im −UU0| = |Ip −U0U|. This equality is proved by applying the formula ofthe determinant of a partitioned matrix.∙

Im UU0 Ip

¸=

∙Ip U0

U Im

¸.

We are going to use this lemma to prove that the likelihood ratio (12.28) only takes into

account the p −m smallest eigenvalues of the matrix bψ−1/2Sbψ−1/2. Starting from the MLestimator of bV0, we have:¯ bV0

¯=

¯ bΛbΛ0 + bψ¯ = ¯ bψ1/2¯ ¯ bψ−1/2bΛbΛ0bψ−1/2 + Ip ¯ ¯ bψ1/2

¯=

¯ bψ ¯ ¯ bΛ0bψ−1/2bψ−1/2bΛ+ Im ¯Letting D = bΛ0bψ−1bΛ , we obtain,¯ bV0

¯=¯ bψ ¯ |D+ Im| = ¯ bψ¯ mY

i=1

(1 + dj) (12.43)

where dj is the diagonal element j of D.On the other hand, from the estimation of parameters by the ML method in Appendix

12.1, we saw in the equation (12.42), that the matrix bψ−1/2Sbψ−1/2 − Ip has m eigenvaluesequal to the diagonal terms of D. In general this matrix will have a rank p, and we also let dibe the remaining eigenvalues for i = m+ 1, ..., p. As a result, the eigenvalues of the matrixbψ−1/2Sbψ−1/2 will be 1 + di and we can write:¯ bψ−1/2Sbψ−1/2 ¯ = |S| / ¯ bψ ¯ = pY

i=1

(1 + di) (12.44)

and by (12.43) and (12.44)¯ bV0

¯|S| =

¯ bψ ¯ Qmi=1(1 + di)¯ bψ ¯ Qpi=1(1 + di)

=1Qp

i=m+1(1 + di)

so that we finally obtain:

λF = n log¯ bV0

¯/|S| = −n log

pXi=m+1

(1 + di) (12.45)

and the likelihood test depends solely on the smallest eigenvalues of the matrix bψ−1/2Sbψ−1/2.We are going to show that this test is a specific case of the test of partial sphericity,

presented in 10.6, whose statistic is:

λEP = n(p−m) logPp

i=m+1 λi

p−m − n logpY

i=m+1

λi (12.46)


where λi are the eigenvalues of S, and that the statistic (12.45) is a result of applying this

test to the matrix bψ−1/2Sbψ−1/2. If the factor model is correct, asymptotically S = ψ+ΛΛ0

and, pre- and postmultplying by ψ−1/2:

ψ−1/2Sψ−1/2 = I+ψ−1/2ΛΛ0ψ−1/2 (12.47)

which decomposes the matrix ψ−1/2Sψ−1/2 into one of rank m plus the identity matrix.Since the eigenvalues of this matrix are 1 + di, we have

λEP = n(p−m) logPp

i=m+1(1 + di)

p−m − n logpX

i=m+1

(1 + di) (12.48)

and what is left now is to prove that the first term is zero. Taking the traces in (12.47).

tr(ψ−1/2Sψ−1/2) = p+ tr(ψ−1/2ΛΛ0ψ−1/2) = p+ tr(Λ0ψ−1Λ) = p+mXi=1

dj

but also, because the eigenvalues ofψ−1/2Sψ−1/2 are 1+di , we know that tr(ψ−1/2Sψ−1/2) =Pp

i=1(1 + di), and equating both results for the trace:

pXi=1

(1 + di) = p+mXi=1

dj

from which we getpX

i=m+1

di = 0

or, its equivalentpX

i=m+1

(1 + di) = p−m.

and plugging in (12.48) the first term is canceled and only the second remains, thus weobtain the likelihood ratio test.APPENDIX 12.3: ESTIMATION OF THE FACTORSLet xi be the vector (p× 1) in the individual i. Its density function will be:

f(xi) = |ψ|−1/2 (2π)−p/2 exp©−1/2(xi −Λf i)0ψ−1(xi −Λf i)

ªassuming that ψ and Λ are known and we are trying to estimate fi. Then, the likelihoodfunction in logarithms will be:

L = log f(xi) = K − 1/2(xi − Λfi)0ψ−1(xi − Λfi)

where K is a constant. Maximizing L is equivalent to minimizing:

M = (xi −Λf i)0ψ−1(xi −Λf i)


which is the least squares criterion. Then:

M = x0iψ−1xi − 2f 0iΛ0ψ−1xi + f

0iΛ

0ψ−1Λf i.

Taking the derivative with respect to fi and setting it to zero:

dM

dfi= 0 = −2Λ0ψ−1xi + 2Λ

0ψ−1Λfi

thereforefi = (Λ

0ψ−1Λ)−1Λ0ψ−1xi

substituting the ML estimators in the equation for Λ and ψ we obtain the vector fi for eachobservation.If the parameter is considered to be a random variable, by the properties of conditional

expectation bfi = E [fi|xi] = Λ0V−1xi

Utilizing:V−1 = (ψ +ΛΛ0)−1 = ψ−1 −ψ−1Λ(I+Λ0ψ−1Λ)−1Λ0ψ−1,

then,

Λ0V−1 = Λ0ψ−1 −Λ0ψ−1Λ(I+Λ0ψ−1Λ)−1Λ0ψ−1

Λ0V−1 = [I−Λ0ψ−1Λ(I+Λ0ψ−1Λ)−1]Λ0ψ−1

Λ0V−1 = (I+Λ0ψ−1Λ)−1Λ0ψ−1.

and substituting in the expression of bfi, we obtain:bfi = (I+Λ0ψ−1Λ)−1Λ0ψ−1xi.

APPENDIX 12.4: BAYESIAN INTERPRETATION OF THEESTIMATOR OF THE FACTORSThe estimator (12.38) has a clear Bayesian interpretation. Since, a priori, the distribution

of the factor, π (fi), is N(0, I), and the likelihood f (x|f ,Λ,ψ) is N (Λf ,ψ), the posteriorconditional on the parameters is:

f (fi|x,Λ,ψ) = k f (x|f i,Λ,ψ)π (fi) (12.49)

where k is a constant. The exponent of the posterior distribution is:

(x−Λf)0ψ−1 (x−Λf) + f 0f = f 0¡I+Λ0ψ−1Λ

¢f−

2f 0Λ0ψ−1x+ x0ψ−1x,

and completing the square, the exponent can be written

f 0¡I+Λ0ψ−1Λ

¢f − 2f 0

¡I+Λ0ψ−1Λ

¢ ¡I+Λ0ψ−1Λ

¢−1Λ0ψ−1x+Rest


that is, ³f−bf´0 ¡I+Λ0ψ−1Λ

¢ ³f−bf´

where bf = E [f |Λ,ψ] = ¡I+Λ0ψ−1Λ¢−1

Λ0ψ−1x (12.50)

andV ar

³bf´ = V ar [f |Λ,ψ] = ¡I+Λ0ψ−1Λ¢−1

(12.51)

Therefore, the estimator (12.38) can be interpreted as the mean of the posterior distri-bution of the factors. We observe that the condition Λ0ψ−1Λ = diagonal, makes it so thatthe factors are, a posteriori, conditionally independent.

Documents

Captulo 12 FACTOR ANALYSIS