68
Linköping Studies in Science and Technology. Dissertations. No. 1255 Studies in Estimation of Patterned Covariance Matrices Martin Ohlson Department of Mathematics Linköping University, SE–581 83 Linköping, Sweden Linköping 2009

Studies in Estimation of Patterned Covariance Matrices220133/...merisk optimeringsalgoritm. Vi beräknar explicita skattningar som ett bra alternativ till maximum-likelihoodskattningarna

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Linköping Studies in Science and Technology. Dissertations.No. 1255

    Studies in Estimation of PatternedCovariance Matrices

    Martin Ohlson

    Department of MathematicsLinköping University, SE–581 83 Linköping, Sweden

    Linköping 2009

  • Linköping Studies in Science and Technology. Dissertations.No. 1255

    Studies in Estimation of Patterned Covariance Matrices

    Martin Ohlson

    [email protected]

    Mathematical StatisticsDepartment of Mathematics

    Linköping UniversitySE–581 83 Linköping

    Sweden

    ISBN 978-91-7393-622-4 ISSN 0345-7524

    Copyright c© 2009 Martin Ohlson

    Printed by LiU-Tryck, Linköping, Sweden 2009

  • Ágætis byrjun...

  • Abstract

    Ohlson, M. (2009). Studies in Estimation of Patterned Covariance Matrices.Doctoral dissertation. ISBN 978-91-7393-622-4. ISSN 0345-7524.

    Many testing, estimation and confidence interval procedures discussed in the multivariatestatistical literature are based on the assumption that the observation vectors are indepen-dent and normally distributed. The main reason for this is that often sets of multivariateobservations are, at least approximately, normally distributed. Normally distributed datacan be modeled entirely in terms of their means and variances/covariances. Estimatingthe mean and the covariance matrix is therefore a problem of great interest in statisticsand it is of great significance to consider the correct statistical model. The estimator forthe covariance matrix is important since inference on the mean parameters strongly de-pends on the estimated covariance matrix and the dispersion matrix for the estimator ofthe mean is a function of it.

    In this thesis the problem of estimating parameters for a matrix normal distributionwith different patterned covariance matrices, i.e., different statistical models, is studied.

    A p-dimensional random vector is considered for a banded covariance structure re-flectingm-dependence. A simple non-iterative estimation procedure is suggested whichgives an explicit, unbiased and consistent estimator of the mean and an explicit and con-sistent estimator of the covariance matrix for arbitraryp andm.

    Estimation of parameters in the classical Growth Curve model when the covariancematrix has some specific linear structure is considered. In our examples maximum like-lihood estimators can not be obtained explicitly and must rely on numerical optimizationalgorithms. Therefore explicit estimators are obtained as alternatives to the maximumlikelihood estimators. From a discussion about residuals, a simple non-iterative estima-tion procedure is suggested which gives explicit and consistent estimators of both themean and the linearly structured covariance matrix.

    This thesis also deals with the problem of estimating the Kronecker product structure.The sample observation matrix is assumed to follow a matrix normal distribution witha separable covariance matrix, in other words it can be written as a Kronecker productof two positive definite matrices. The proposed estimators are used to derive a likeli-hood ratio test for spatial independence. Two cases are considered, when the temporalcovariance is known and when it is unknown. When the temporal covariance is known,the maximum likelihood estimates are computed and the asymptotic null distribution isgiven. In the case when the temporal covariance is unknown the maximum likelihoodestimates of the parameters are found by an iterative alternating algorithm and the nulldistribution for the likelihood ratio statistic is discussed.

    v

  • Populärvetenskaplig sammanfattning

    Inom många skattningsproblem, testprocedurer och beräkningar av konfidensintervall somdiskuteras i den multivariata statistiska litteraturen antas det att de observerade vektor-erna eller matriserna är oberoende och normalfördelade. Det främsta skälet till detta äratt observationerna åtminstone oftast har dessa egenskaper approximativt. Normalförde-lad data kan modelleras enbart genom dess väntevärdesstruktur och varians / kovarians.Det är därför ett problem av stort intresse att skatta väntevärdet och kovariansmatrisenbra, samtidigt som det är viktigt att anta en korrekt statistisk modell. Skattningen av ko-variansmatrisen är också viktig eftersom slutsatser om väntevärdet beror på den skattadekovariansmatrisen.

    I den här avhandling diskuteras problemet med att skatta parametrarna, alltså vän-tevärdet och kovariansmatrisen, för en matrisnormalfördelning när kovariansmatrisen harolika mönster, det vill säga olika statistiska modeller.

    Flera olika strukturer beaktas. Först diskuteras enp-dimensionell stokastisk vektorsom antas ha en kovariansmatris med bandstruktur, vilket innebär ettm-beroende. Enenkel algoritm föreslås som ger en explicit, väntevärdesriktig och konsistent skattning avväntevärdet och en explicit och konsistent skattning av kovariansmatrisen för godtyckligdimensionp och bandbreddm.

    Skattning av parametrarna i den klassiska tillväxtkurvemodellen när kovariansma-trisen har en linjär struktur är ett problem av stort intresse. I många exempel kan maximum-likelihoodskattningar inte erhållas explicit och måste därför beräknas med någon nu-merisk optimeringsalgoritm. Vi beräknar explicita skattningar som ett bra alternativ tillmaximum-likelihoodskattningarna. Från en diskussion om residualerna, ges en enkel al-goritm som resulterar i väntevärdesriktiga och konsistenta skattningar både för väntevärdetoch den linjärt mönstrade kovariansmatrisen.

    Avhandlingen behandlar även problemet med att skatta kroneckerproduktstrukturen.Observationerna antas följa en matrisnormalfördelningmed en separabel kovariansmatris,det vill säga den kan skrivas som en kroneckerprodukt mellan två positivt definita matris-er. Dessa matriser kan tolkas som den spatiala och temporala kovariansen. Huvudmåletär att beräkna likelihoodkvoten som testar spatialt oberoende. Först antas det att den tem-porala kovariansen är känd och maximum-likelihoodskattningarna beräknas. Det visas attfördelningen för likelihoodkvoten är lika med den från fallet med oberoende observation-er. När den temporala kovariansen är okänd, härleds därefter en iterativ algoritm för atthitta maximum-likelihoodskattningarnaoch fördelningen för likelihoodkvoten diskuteras.

    När det görs olika tester på kovariansmatriser uppkommer kvadratiska former. I denhär avhandlingen härleds en generalisering av fördelningen på den kvadratiska formen avmatriser. Det visas att fördelningen på en kvadratisk form är densamma som fördelningenför en viktad summa av ickecentrala wishartfördelade matriser.

    vii

  • Acknowledgments

    First of all I would like to thank my supervisor Professor Timo Koski for giving me theopportunity to work on the problems discussed in this thesis. It has been very interestingand I am grateful for all the freedom in my work.

    I am also very grateful to my assistant supervisor Professor Dietrich von Rosen for allthe ideas and all the discussions we have had. Thank you for showing me the "multivariateworld" and for all inspiration.

    During my time as a PhD-student I have been visiting some people around the world.My deepest thanks goes to to Professor Muni S. Srivastava for my time at the Departmentof Statistics, University of Toronto. During my time in Canada I was invited for a seminarat Department of Mathematics & Statistics, University of Maryland, Baltimore County.Thank you Professor Bimal Sinha for that opportunity.

    I would also like to thank my present and former colleagues at the Department ofMathematics. In particular I wish to thank all the PhD-students at the department and mycolleagues at the Division of Mathematical Statistics, especially Dr. Eva Enqvist. Thanksfor your many interesting discussions and ideas.

    For LaTeX layout of this thesis I am grateful to Dr. Gustaf Hendeby. The LaTeXtemplate has been very convenient and easy to work with. Thank you a lot.

    I am very grateful to my dear friend Dr. Thomas Schön for the support and for theproofreading this thesis. We have had a lot of interesting discussions over the years andthe coffee breaks with "nice" coffee have been valuable.

    Finally, I would like to thank my family for all support. You have always believed inme and encouraged me. Especially you, Lotta! You are very special in my life, you aremy everything!

    Linköping, April 16, 2009

    Martin Ohlson

    ix

  • Contents

    1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.2.1 Outline of Part I . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 Outline of Part II . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.3 Other Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    I Estimation of Patterned Covariance Matrices 7

    2 Estimation of the Covariance Matrix for a Multivariate Normal Distribution 92.1 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.1.1 Matrix Normal Distribution . . . . . . . . . . . . . . . . . . . . 92.1.2 Wishart and Non-central Wishart Distribution . . . . . . . . . . . 102.1.3 Results on Quadratic Forms . . . . . . . . . . . . . . . . . . . . 11

    2.2 Estimation of the Parameters . . . . . . . . . . . . . . . . . . . . . . . . 182.2.1 Maximum Likelihood Estimators . . . . . . . . . . . . . . . . . 182.2.2 Patterned Covariance Matrix . . . . . . . . . . . . . . . . . . . . 202.2.3 Estimating the Kronecker Product Covariance . . . . . . . . . . . 262.2.4 Estimating the Kronecker Product Covariance (One Observation

    Matrix) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.2.5 The Distribution of a Special Sample Covariance Matrix . . . . . 33

    3 Estimation of the Covariance Matrix for a Growth Curve Model 373.1 The Growth Curve Model . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2 Estimation of the Parameters . . . . . . . . . . . . . . . . . . . . . . . . 39

    xi

  • xii Contents

    3.2.1 Maximum Likelihood Estimators . . . . . . . . . . . . . . . . . 393.2.2 Growth Curve Model with Patterned Covariance Matrix . . . . . 41

    4 Concluding Remarks 454.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    Bibliography 49

    Due to Copyright restrictions the articles are not included in the electronic version.

    II Papers 57

    A On Distributions of Matrix Quadratic Forms 591 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 Distribution of Multivariate Quadratic Forms . . . . . . . . . . . . . . . 633 Complex Matrix Quadratic Forms . . . . . . . . . . . . . . . . . . . . . 714 A Special Sample Covariance Matrix . . . . . . . . . . . . . . . . . . . . 73References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    B Explicit Estimators under m-Dependence for a Multivariate Normal Distri-bution 771 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 802 Definitions and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 813 Explicit Estimator of a Banded Covariance Matrix . . . . . . . . . . . . . 824 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

    C The Likelihood Ratio Statistic for Testing Spatial Independence using a Sep-arable Covariance Matrix 951 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 982 Known Dependency Structure . . . . . . . . . . . . . . . . . . . . . . . 993 Unknown Dependency Structure . . . . . . . . . . . . . . . . . . . . . . 103

    3.1 Ψ hasAR(1) Structure . . . . . . . . . . . . . . . . . . . . . . . 1043.2 Ψ has Intraclass Structure . . . . . . . . . . . . . . . . . . . . . 108

    References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    D Explicit Estimators of Parameters in the Growth Curve Model with LinearlyStructured Covariance Matrices 1131 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1162 Main Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1173 Maximum Likelihood Estimators . . . . . . . . . . . . . . . . . . . . . . 1204 Growth Curve Model with a Linearly Structured Covariance Matrix . . . 1205 Properties of the Proposed Estimators . . . . . . . . . . . . . . . . . . . 1246 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

  • 1Introduction

    THIS thesis is concerned with the problem of estimating patterned covariance matri-ces for different kinds of statistical models. Patterned covariance matrices can arisefrom a variety of contexts and can be both linear and non-linear. One example of a lin-early structured covariance matrix arise in the theory of graphical modeling. In graphicalmodeling, a bi-directed graph represent marginal independences among random variablesthat are identified with the vertices of the graph. Gaussian graphical models are calledcovariance graph models and can be used for estimation and testing.

    If two vertices are not joined by an edge, then the two associated random variables areassumed to be marginally independent. For example, in Figure 2.1 the graph representsthe covariance structure for a random vectorx = (x1, x2, x3, x4)

    ′. Covariance graph

    x3 x2x4x1

    Figure 1.1: Covariance graph for a random vectorx = (x1, x2, x3, x4)′.

    models will generate a patterned covariance matrix with some of the covariances equal tozero, i.e., if the vertices in the graph is not connected by a bi-directed edgei ↔ j thenσij = 0. In Figure 1.1 the graph imposesσ12 = σ14 = σ23 = 0 and the covariancematrix forx is given by

    Σ =

    σ11 0 σ13 00 σ22 0 σ24σ13 0 σ33 σ340 σ24 σ34 σ44

    .

    The covariance graph can also represent more advanced models if some constraints areadded. Another linearly structured covariance matrix is the Toeplitz matrix and a special

    1

  • 2 1 Introduction

    case is a Toeplitz matrix with zeros and different variances.The covariance graph for thisspecial Toeplitz structure is given in Figure 1.2. This structure can for example representthat the covariances depends on the distance between equally spaced locations or timepoints. The covariance matrix is then given by

    Σ =

    σ21 ρ 0 0ρ σ22 ρ 00 ρ σ23 ρ0 0 ρ σ24

    .

    In this thesis, primarily linearly structured covariance matrix is considered. We know that

    x2 x4x3x1

    Figure 1.2: Covariance graph for a kind of Toeplitz covariance matrix with zeros.

    inference on the mean parameters strongly depends on the estimated covariance matrixand the dispersion matrix for the estimator of the mean is a function of the covariancematrix. Hence, when testing the mean parameters the estimator of the covariance matrixis very important and it is of great significance to have good estimators for the correctmodel.

    1.1 Background

    One of the first to discuss a patterned covariance matrix was Wilks (1946) who consideredtheuniform(also known asintraclass) covariance structure when dealing with measure-ments onk equivalent psychological tests. The uniform structure is a linear covariancestructure which is given with equal diagonal elements and equal off-diagonal elements.This structure is also of interest since it arise as the marginal distribution in a random-effect model. Votaw (1948) extended the uniform model tocompound symmetrystructure,a covariance structure similar the uniform structure, but with blocks each having uniformstructure. The intraclass model can also be generalized to theToeplitzor circular Toeplitzstructure which is discussed by Olkin and Press (1969), Olkin (1973) and later Nahtman(2006). These models are all special cases of theinvariant normalmodels consideredby Andersson (1975) and a proper review of the results related to the invariant normalmodels can be found in Perlman (1987).

    More recently, linearly structured covariance matrices have been discussed in theframe of graphical models, e.g., see Drton and Richardson (2004), Chaudhuri et al. (2007)where an iterative algorithm for the maximum likelihood estimators were given. Origi-nally, many estimators of the covariance matrix were obtained from non-iterative leastsquare methods. With increasing computational power iterative methods such as max-imum likelihood, restricted maximum likelihood among others were introduced to esti-mate the patterned covariance matrix. Nowadays, the data sets and the dimension of themodels are very large and non-iterative methods are preferable again. In this thesis we

  • 1.2 Outline 3

    mainly consider explicit estimators for patterned covariance matrices in the multivariatelinear model, but also an iterative estimator for the Kronecker product covariance struc-ture is considered.

    The Growth Curve model introduced by Potthoff and Roy (1964) has been extensivelystudied over the years. The mean structure for the Growth Curve model is bilinear insteadof linear as for the ordinary multivariate linear model. Potthoff and Roy (1964) originallyderived a class of weighted estimators for the parameter matrix which are a function of anarbitrary positive definite matrix. Khatri (1966a) extended this result and showed that themaximum likelihood estimator also is a weighted estimator.

    Patterned covariance matrices for the Growth Curve model have been discussed in theliterature, e.g., the intraclass covariance structure has been considered by Khatri (1973),Arnold (1981), Lee (1988). Since the mean structure is bilinear, we will have a decom-position of the space generated by the design matrices as tensor spaces instead of linearspaces as in the linear model case. This decomposition will make maximum likelihoodestimation of a patterned covariance more complicated. In this thesis we use the decom-position of tensor spaces and derive explicit estimators of linearly structured covariancematrices.

    1.2 Outline

    This thesis consists of two parts and the outline is as follows.

    1.2.1 Outline of Part I

    In Part I the background and theory are given. Chapter 2 starts with definitions and someresults for the multivariate distributions that are used, i.e., matrix normal and Wishartdistribution. Also a review of the univariate and matrix quadratic form is presented.The second part of Chapter 2 discuss estimation of the mean and covariance matrix ina multivariate linear model. The different estimators which are discussed are maximumlikelihood for the non-patterned case and several methods for various structures.

    In Chapter 3 we consider the Growth Curve model. The maximum likelihood estima-tors are given for the ordinary non-patterned case and various structures for the covariancematrix are discussed.

    Part I ends with a conclusion and some pointers for future work in Chapter 4.

    1.2.2 Outline of Part II

    Part II consists of four papers. Below follows a short summary for each of the papers.

    Paper A: On Distributions of Matrix Quadratic Forms

    Ohlson, M. and Koski, T. (2009b). On distributions of matrix quadraticforms. Submitted toCommunications in Statistics - Theory and Methods.

  • 4 1 Introduction

    A characterization of the distribution of the multivariate quadratic form given byXAX′,whereX is ap × n normally distributed matrix andA is ann × n symmetric real ma-trix, is presented. We show that the distribution of the quadratic form is the same as thedistribution of a weighted sum of non-central Wishart distributed matrices. This is ap-plied to derive the distribution of the sample covariance between the rows ofX when theexpectation is the same for every column and is estimated with the regular mean.

    Paper B: Explicit Estimators under m-Dependence for aMultivariate Normal Distribution

    Ohlson, M., Andrushchenko, Z., and von Rosen, D. (2009). Explicit estima-tors under m-dependence for a multivariate normal distribution. Acceptedfor publication inAnnals of the Institute of Statistical Mathematics.

    The problem of estimating parameters of a multivariate normalp-dimensional randomvector is considered for a banded covariance structure reflectingm-dependence. A simplenon-iterative estimation procedure is suggested which gives an explicit, unbiased andconsistent estimator of the mean and an explicit and consistent estimator of the covariancematrix for arbitraryp andm.

    Paper C: The Likelihood Ratio Statistic for Testing SpatialIndependence using a Separable Covariance Matrix

    Ohlson, M. and Koski, T. (2009a). The likelihood ratio statistic for testingspatial independence using a separable covariance matrix. Technical ReportLiTH-MAT-R-2009-06, Department of Mathematics, Linköping University.

    This paper deals with the problem of testing spatial independence for dependent obser-vations. The sample observation matrix is assumed to follow a matrix normal distribu-tion with a separable covariance matrix, in other words it can be written as a Kroneckerproduct of two positive definite matrices. Two cases are considered, when the temporalcovariance is known and when it is unknown. When the temporal covariance is known,the maximum likelihood estimates are computed and the asymptotic null distribution isgiven. In the case when the temporal covariance is unknown the maximum likelihoodestimates of the parameters are found by an iterative alternating algorithm and the nulldistribution for the likelihood ratio statistic is discussed.

    Paper D: Explicit Estimators of Parameters in the Growth CurveModel with Linearly Structured Covariance Matrices

    Ohlson, M. and von Rosen, D. (2009). Explicit estimators of parameters inthe Growth Curve model with linearly structured covariance matrices. Sub-mitted toJournal of Multivariate Analysis.

    Estimation of parameters in the classical Growth Curve model when the covariance ma-trix has some specific linear structure is considered. In our examples maximum likeli-hood estimators can not be obtained explicitly and must rely on optimization algorithms.

  • 1.3 Other Publications 5

    Therefore explicit estimators are obtained as alternativesto the maximum likelihood esti-mators. From a discussion about residuals, a simple non-iterative estimation procedure issuggested which gives explicit and consistent estimators of both the mean and the linearlystructured covariance matrix.

    1.3 Other Publications

    Other publications from conferences are listed below.

    • Distribution of Quadratic Formsat MatTriad 2007, 22-24 March 2007, Bedlewo,Poland.

    • More on Distribution of Quadratic Formsat The 8th Tartu Conference on Multi-variate Statistics and The 6th Conference on Multivariate Distributions with FixedMarginals, 26-29 June 2007, Tartu, Estonia.

    • Explicit Estimators under m-Dependence for a Multivariate Normal DistributionatLinStat 2008, 21-25 April 2008, Bedlewo, Poland.

    • The Likelihood Ratio Statistic for Testing Spatial Independence using a SeparableCovariance Matrixat Swedish Society for Medical Statistics, Spring conference,27 March 2009, Uppsala, Sweden.

    1.4 Contributions

    The main contributions of the thesis are as follows.

    • In Paper A a characterization of the distribution of a matrix quadratic form is de-rived. This characterization is similar to the one for the univariate case and can beused to prove several properties for the matrix quadratic form.

    • Estimation of a banded covariance matrix, i.e., a matrix with zeros outside an arbi-trary large band, is considered in Paper B. A non-iterative estimation procedure issuggested which gives explicit, unbiased and consistent estimators.

    • In Paper C an iterative alternating algorithm for the maximum likelihood estimatorsis derived for the case when we have Kronecker product covariance structure andonly one observation matrix. The cases with intraclass and autoregressive structureof order one are handled.

    • Linearly structured covariance structures for the Growth Curve model are discussedin Paper D. Using the residuals a simple non-iterative estimation procedure is sug-gested which gives explicit and consistent estimators of both the mean and thelinearly structured covariance matrix.

  • 6 1 Introduction

  • Part I

    Estimation of PatternedCovariance Matrices

    7

  • 2Estimation of the Covariance Matrix

    for a Multivariate Normal Distribution

    M ANY testing, estimation and confidence interval procedures discussed in the multi-variate statistical literature are based on the assumption that the observation vectorsare independent and normally distributed (Anderson, 2003, Srivastava and Khatri, 1979,Muirhead, 1982). There are two main reasons for this. Firstly, sets of multivariate obser-vations are often, at least approximately, normally distributed. Secondly, the multivariatenormal distribution is mathematically tractable. Normally distributed data can be mod-eled entirely in terms of their means and variances/covariances. Estimating the mean andthe covariance matrix are therefore problems of great interest in statistics, as well as inmany related more applied areas.

    2.1 Multivariate Distributions

    2.1.1 Matrix Normal Distribution

    Suppose thatX : p×n is a random matrix. Let the expectation ofX beE(X) = M : p×nand the covariance matrix becov(X) = Ω : (pn) × (pn).

    Definition 2.1. A positive definite matrixA is said to be separable if it can be written asa Kronecker product of two positive definite matricesB andC,

    A = B⊗ C.

    Here⊗ is the Kronecker product, see Kollo and von Rosen (2005). Suppose that thecovariance matrixΩ is separable, i.e.,Ω = Ψ ⊗ Σ, whereΨ : n× n andΣ : p× p aretwo positive definite covariance matrices. Assume thatX is matrix normal distributed,denoted byX ∼ Np,n (M,Σ,Ψ), which is equivalent to

    vecX ∼ Npn (vecM,Ψ⊗ Σ) ,

    9

  • 10 2 Estimation of the Covariance Matrix for a Multivariate Normal Distribution

    wherevec ( · ) is the vectorization operator (Kollo and von Rosen, 2005). The covariancematrix Σ can be interpreted as the covariance between the rows ofX andΨ can beinterpreted as the covariance between the columns ofX. SinceΣ andΨ are positivedefinite, written asΣ > 0 andΨ > 0, the density function ofX is

    f(X) = (2π)− 12pn |Σ|−n/2|Ψ|−p/2etr

    {−

    1

    2Σ−1 (X − M)Ψ−1 (X − M)

    }, (2.1)

    which is the same density function as forvecX, since we can write

    |Ψ ⊗ Σ| = |Ψ|p|Σ|n, and

    vec′X(Ψ ⊗ Σ)−1vecX = tr{Σ−1XΨ−1X′},

    whereetr(A) = exp(tr(A)) and|Σ| is the determinant of the matrixΣ.One of the most important properties of the matrix normal distribution is that it is

    invariant under bilinear transformations.

    Theorem 2.1LetX ∼ Np,n (M,Σ,Ψ). For any matricesB : q × p andC : m× n

    BXC′ ∼ Nq,m(BMC′,BΣB′,CΨC′

    ).

    2.1.2 Wishart and Non-central Wishart Distribution

    The Wishart distribution is the multivariate generalization of theχ2 distribution. Thematrix W : p × p is said to be Wishart distributed if and only ifW = YY′ for somematrixY ∼ Np,n(M,Σ, I), whereΣ is positive definite (Kollo and von Rosen, 2005). Ifthe mean is zero,M = 0, the Wishart distribution is said to be central and this is denotedby W ∼ Wp(n,Σ). Otherwise, ifM 6= 0, the Wishart distribution is non-central,W ∼Wp(n,Σ,Ω), whereΩ = MM

    ′.The density function forW ∼Wp(n,Σ) exists ifn ≥ p and is given by

    fW(W) =(2pn/2Γp

    (n2

    )|Σ|n/2

    )−1|W|(n−p−1)/2 exp

    {−

    1

    2tr{Σ−1W}

    },

    for W > 0 and where the multivariate gamma functionΓp(

    n2

    )is given by

    Γp

    (n2

    )= πp(p−1)/4

    p∏

    i=1

    Γ

    (1

    2(n+ 1 − i)

    ). (2.2)

    If W ∼ Wp(n,Σ,Ω), then the characteristic function ofW can be found in Muirhead(1982) and is given by

    ϕW(T) = |I − iΓΣ|−n/2etr

    (−

    1

    2Σ−1Ω

    )etr

    (1

    2Σ−1Ω (I − iΓΣ)−1

    ),

    whereT = (tij)pi,j=1, Γ = (γij) = ((1 + δij) tij)

    pi,j=1, δij is the Kronecker delta and

    tij = tji. Hence, for the case of central Wishart distribution it is nothing more than

    ϕW(T) = |I− iΓΣ|−n/2.

  • 2.1 Multivariate Distributions 11

    It is known, that if a Wishart distributed matrixW is transformed asBWB′ for somematrixB : q × p, we have a new Wishart distributed matrix (Kollo and von Rosen, 2005,Muirhead, 1982).

    Theorem 2.2LetW ∼Wp (n,Σ,Ω) andB : q × p real matrix. Then

    BWB′ ∼Wq(n,BΣB′,BΩB′

    ).

    It is also easy to see from the definition of a Wishart distribution that the sum of indepen-dent Wishart distributed variables is again Wishart distributed.

    Theorem 2.3Let the random matrixW1 ∼ Wp(n,Σ,Ω1) be independent of the matrixW2 ∼Wp(m,Σ,Ω2). Then

    W1 + W2 ∼Wp(n+m,Σ,Ω1 + Ω2).

    2.1.3 Results on Quadratic Forms

    In this section we will study the distribution of a multivariate quadratic formQ = XAX′,whereX : p × n is a random matrix andA : n × n is a real symmetric matrix. Thedistribution ofQ has been studied by many authors, see for example Khatri (1962), Hogg(1963), Khatri (1966b), Hayakawa (1966), Shah (1970), Rao and Mitra (1971), Hayakawa(1972) and more recently Gupta and Nagar (2000), Vaish and Chaganty (2004), Kolloand von Rosen (2005) and the references therein. Wong and Wang (1993), Mathew andNordström (1997), Masaro and Wong (2003), Hu (2008) discussed Wishartness for thequadratic form when the covariance matrix is non-separable, i.e., we can not write it as aKronecker product.

    Univariate Quadratic Forms

    The quadratic formQ = XAX′ is the multivariate generalization of the univariatequadratic form

    q = x′Ax,

    wherex has a multivariate normal distributionx ∼ Np(µ,Σ), e.g., see Rao (1973),Graybill (1976), Srivastava and Khatri (1979), Muirhead (1982). We will start by consid-ering the univariate case when the variables are independent. Later on we consider thedependent case with a covariance matrixΣ. The results are given below for the sake ofcompleteness. There is a lot of literature on this topic and the results quoted below canbe found in for example Graybill (1976), Srivastava and Khatri (1979), Muirhead (1982).In the subsequent section these results are recapitulated for the multivariate case too. Anapplication of the distribution of a specific quadratic form is the sample variances2.

  • 12 2 Estimation of the Covariance Matrix for a Multivariate Normal Distribution

    Example 2.1

    Let

    C = In −1

    n11′,

    where1 = (1, . . . , 1)′ : n× 1, In : n× n is the identity matrix. Then we have

    q = x′Cx = x′(I −

    1

    n11′)(

    I −1

    n11′)

    x

    = (x − x̄1)′(x − x̄1) =

    p∑

    i=1

    (xi − x̄)2 = (n− 1)s2,

    where the second equality follows from the fact thatC = C2, i.e, C is idempotent.Furthermore,̄x = 1n1

    ′x and the distribution ofq = x′Cx is the same as the distributionof (n− 1)s2.

    In the simplest case we suppose that the variables are standard normal,x ∼ Np(0, I).The distribution ofq is centralχ2 under some restrictions onA.

    Theorem 2.4Suppose thatx ∼ Np(0, I) and letq = x′Ax. Then the distribution ofq is centralχ2,i.e.,q ∼ χ2k, if and only ifA is a symmetric idempotent matrix andrank(A) = k.

    Again, if we assume the variables to be independent, but with nonzero means, the distri-bution ofq is non-centralχ2, under the same restrictions onA as in Theorem 2.4. Thenon-centrality parameter will depend onA andµ.

    Theorem 2.5Suppose thatx ∼ Np(µ, I) and letq = x′Ax. Then the distribution ofq is non-centralχ2, i.e., q ∼ χ2k(δ), whereδ =

    12µ

    ′Aµ, if and only ifA is an idempotent matrix andrank(A) = k.

    Assume now that the variables inx are dependent, i.e., letx ∼ Np(µ,Σ), whereΣ > 0.SinceΣ is positive definite,Σ can be decomposed asΣ = Γ′Γ, whereΓ : p × p andrank(Γ) = p. Using this decomposition the following theorem from Graybill (1976) iseasy to prove.

    Theorem 2.6Suppose thatx ∼ Np(µ,Σ), whererank(Σ) = p and letq = x′Ax. Then the distribu-tion of q is non-centralχ2, i.e.,q ∼ χ2k(δ), whereδ =

    12µ

    ′Aµ, if and only if any of thefollowing three conditions are satisfied

    • AΣ is an idempotent matrix of rankk.

    • ΣA is an idempotent matrix of rankk.

    • Σ is a generalized inverse ofA andrank(A) = k.

  • 2.1 Multivariate Distributions 13

    HereΣ is a generalized inverse ofA, denotedΣ = A−, if and only if AΣA = A.A special case of Theorem 2.6 is the following corollary (Graybill, 1976), where thecovariance matrix is supposed to be a diagonal matrix and the mean is the same for everyvariable.

    Corollary 2.1Suppose thatx ∼ Np(µ,D), whereD is a diagonal matrix of rankp. Thenq = x′Ax ∼χ2p−1(δ), whereδ =

    12µ

    ′Aµ, if

    A = D−1 − (1′D−11)−1D−111′D−1.

    Alsoδ = 0 if µ = µ1 for any scalarµ.

    Corollary 2.1 gives the distribution of the ordinary sample variance.

    Example 2.2

    Letx1, . . . , xn be a random sample and letxi ∼ N(µ, σ2) for i = 1, . . . , n. Furthermore,let the variables be independent. We can now write

    x = (x1, . . . , xn)′ ∼ Nn(µ1, σ

    2I)

    and using Corollary 2.1 we haveD = σ2I and

    A = D−1 −(1′D−11

    )−1D−111′D−1 = σ−2

    (I − n−111′

    ).

    Using this we see that the quadratic formx′Ax has the followingχ2 distribution,

    x′Ax =x′(I − n−111′

    )x

    σ2=

    (x − x̄1)′ (x − x̄1)

    σ2

    =

    ∑ni=1(xi − x̄)

    2

    σ2=

    (n− 1)s2

    σ2∼ χ2n−1,

    and this is as we expected.

    The next theorem is the most general one in the question about the distribution of thequadratic form. It states that every quadratic formx′Ax has the same distribution asthe weighted sum of independent non-centralχ2 variables, see for example Baldessari(1967), Tan (1977), Graybill (1976) for more details.

    Theorem 2.7Suppose thatx ∼ Np(µ,Σ), whererank(Σ) = p. The random variableq = x′Ax hasthe same distribution as the random variablew =

    ∑ni=1 diwi, wheredi are the latent

    roots of the matrixAΣ, and wherewi are independent non-centralχ2 random variables,each with one degree of freedom.

    Theorem 2.7 and the fact that a sum of independent non-centralχ2 variables is againnon-centralχ2 distributed give that all the theorems above are special cases.

    One of the first to consider independence between quadratic forms was Cochran(1934).

  • 14 2 Estimation of the Covariance Matrix for a Multivariate Normal Distribution

    Theorem 2.8 (Cochran’s Theorem)Given x ∼ Np (0, I), suppose thatx′x is decomposed intok quadratic forms,qi =x′Bix, i = 1, 2, . . . , k, whererank (Bi) = ri and theBi ≥ 0, then any one of thefollowing conditions implies the other two.

    •∑k

    i=1 ri = p,

    • eachqi ∼ χ2ri ,

    • all theqi are mutually independent.

    Many authors have generalized Cochran’s theorem, see for example Chipman and Rao(1964), Styan (1970), Tan (1977) and the references therein.

    Multivariate Quadratic Forms

    Several authors, see for example Khatri (1962), Shah (1970), Vaish and Chaganty (2004),have investigated the conditions under which the quadratic formQ = XAX′, whereX ∼ Np,n (M,Σ,Ψ), has a Wishart distribution. Rao (1973) answered the questionwith the relation

    XAX′ ∼Wp ⇔ l′XAX′l ∼ χ2,

    for any fixed vectorl. Hence, the theory of univariate quadratic forms can be applied tothe multivariate case. IfM = 0 andΨ = I, we have a multivariate version of Theorem2.4 (see for example Rao (1973), Gupta and Nagar (2000) for the details).

    Theorem 2.9LetY ∼ Np,n (M,Σ, I) and letA : n×n be a symmetric real matrix. ThenQ = YAY

    is Wishart if and only ifA is an idempotent matrix.

    Example 2.3

    Let Y ∼ Np,n (µ1′,Σ, I), with µ = (µ1, . . . , µp)′, i.e.,Y is n independentp-vectors,Y = (y1, . . . ,yn). The sample mean vector̄y and sample covariance matrixS aredefined by

    ȳ =1

    n

    n∑

    i=1

    yi =1

    nY1,

    S =1

    n(Y − ȳ1′)(Y − ȳ1′)′ = YCY′,

    whereC is the centralization matrix, i.e.,C = I − 1n11′. We know thatC is idempotent

    andrank(C) = n− 1. Using Theorem 2.9 we have

    nS ∼Wp(n− 1,Σ),

    sinceMCM′ = µ1′C1µ′ = 0.

    In the non-central case whenM 6= 0 andΨ 6= I, Q is non-central Wishart if and only if

  • 2.1 Multivariate Distributions 15

    AΨ is an idempotent matrix.

    Corollary 2.2LetX ∼ Np,n (M,Σ,Ψ) and letA : n× n be a symmetric real matrix. Then

    Q = XAX′ ∼Wp(r,Σ,MAM′

    ),

    wherer = rank(AΨ) if and only ifAΨ is an idempotent matrix.

    The next theorem from Kollo and von Rosen (2005) (Theorem 2.2.4.) gives the necessaryand sufficient conditions for two quadratic forms to be independent.

    Theorem 2.10Let X ∼ Np,n (M,Σ,Ψ), Y ∼ Np,n (0,Σ,Ψ) and A : n × n and B : n × n benon-random matrices. Then

    (i) YAY′ is independent ofYBY′ if and only if

    ΨAΨB′Ψ = 0, ΨA′ΨBΨ = 0,

    ΨAΨBΨ = 0, ΨA′ΨB′Ψ = 0,

    (ii) YAY′ is independent ofYB if and only if

    B′ΨA′Ψ = 0,

    B′ΨAΨ = 0.

    Theorem 2.10 can be used to show the independence between the sample mean and sam-ple covariance.

    Example 2.4

    Let Y ∼ Np,n (µ1′,Σ, I), with µ = (µ1, . . . , µp)′. Using Theorem 2.10(ii) we see thatȳ andS given in Example 2.3 are independent.

    Khatri (1962) extended Cochran’s theorem (Theorem 2.8) to the multivariate case bydiscussing conditions for Wishartness and independence of second degree polynomials.Several other have also generalized Cochran’s theorem for the multivariate case, see forexample Rao and Mitra (1971), Khatri (1980), Vaish and Chaganty (2004), Tian and Styan(2005).

    Khatri (1966b) derived the density forQ = XAX′, whenX ∼ Np,n (0,Σ,Ψ), i.e.,for the central caseM = 0, as

    2−12pn

    (Γp

    (1

    2n

    ))−1|AΨ|−

    12p|Σ|−

    12n|Q|

    12 (n−p−1)

    exp

    {−

    1

    2q−1tr{Σ−1Q}

    }0F0

    (T,

    1

    2q−1Σ−1Q

    ),

    (2.3)

  • 16 2 Estimation of the Covariance Matrix for a Multivariate Normal Distribution

    whereq > 0 is an arbitrary constant andT = In − qA− 12 Ψ−1A−

    12 . The density

    function involves the the hypergeometric function of matrix argument0F0 and is relativelycumbersome to handle. The hypergeometric function can be expand in terms of zonalpolynomialsCκ( · ) (see Muirhead (1982) for details about the zonal polynomials) as

    0F0 (R,S) =

    ∞∑

    k=0

    κ

    Cκ(R)Cκ(S)

    Cκ(I)k!, (2.4)

    which is slowly convergent and the expansion in terms of Laguerre polynomials may bepreferable for computational purposes.

    The probability density function (2.3) is written as the product of a Wishart densityfunction and a generalized hypergeometric function. This form is not always convenientfor studying properties ofQ. For M = 0 andΨ = In, both Hayakawa (1966) andShah (1970) derived the probability density function forQ. Using the moment generat-ing function ofQ, Shah (1970) expressed the density function ofQ in terms of Laguerrepolynomials with matrix argument. Hayakawa (1966) gave the probability density func-tion for Q whenM = 0 andΨ = In as

    2−12pn

    (Γp

    (1

    2n

    ))−1|A|−

    12p|Q|

    12 (n−p−1)0F0

    (A−1,−

    1

    2Σ−1Q

    ). (2.5)

    Hayakawa (1966) also showed that any quadratic formQ can be decomposed into a linearcombination of independent central Wishart or pseudo Wishart matrices with coefficientsequal to the eigenvalues ofA.

    The Laplace transform was used in Khatri (1977) to generalize the results of Shah(1970) to the non-central case. WhenA = I, Khatri (1977) also obtained a similarrepresentation of the non-central Wishart density in terms of the generalized Laguerrepolynomial with matrix argument.

    In the non-central case whenX ∼ Np,n(M,Σ,Ψ) Gupta and Nagar (2000) gavethe non-central density forQ = XAX′ in terms of generalized Hayakawa polynomi-als, which are expectations of certain zonal polynomials. Gupta and Nagar (2000) alsocomputed the moment generating function and used it for proving Wishartness and inde-pendence of quadratic forms.

    Ohlson and Koski (2009b) characterized the distribution ofQ using the characteristicfunction. The characteristic function forQ = YAY′ is given by the following theoremdue to Ohlson and Koski (2009b).

    Theorem 2.11LetY ∼ Np,n (M,Σ, I), then the characteristic function ofQ = YAY

    ′ is

    ϕQ(T) =

    r∏

    j=1

    {|Ip − iλjΓΣ|

    −1/2etr

    (−

    1

    2Σ−1Ωj

    )

    etr

    (1

    2Σ−1Ωj (Ip − iλjΓΣ)

    −1

    )},

    whereT = (tij)pi,j=1, Γ = (γij) = ((1 + δij) tij)

    pi,j=1, tij = tji andδij is the Kronecker

    delta. The non-centrality parameters areΩj = mjm′j , wheremj = M∆j . The vectors∆j and the valueλj are the latent vectors and roots ofA respectively.

  • 2.1 Multivariate Distributions 17

    Ohlson and Koski (2009b) showed that the distribution ofQ coincides with the distribu-tion of a weighted sum of non-central Wishart distributed matrices, similar as in the casewhenM = 0 andΨ = I done by Hayakawa (1966). A multivariate version of Theorem2.7 was given by Ohlson and Koski (2009b) for the matrix quadratic formQ = YAY′.

    Theorem 2.12SupposeY ∼ Np,n (M,Σ, I) and letQ be the quadratic formQ = YAY

    ′, whereA :n× n is any real matrix. Then the distribution ofQ is the same as forW =

    ∑j λjWj ,

    whereλj are the nonzero latent roots ofA andWj are independent non-central Wishart,i.e.,

    Wj ∼Wp(1,Σ,mjm′j),

    wheremj = M∆j and∆j are the corresponding latent vectors.

    The multivariate quadratic formYAY′ has the same distribution asW =∑

    j λjWj ,whereλj are the nonzero latent roots ofA andWj are independent non-central Wishart,i.e.,Wj ∼ Wp(1,Σ,mjm′j), wheremj = M∆j and∆j are the corresponding latentvectors. Since this class of distributions are rather common Ohlson and Koski (2009b)defined the distribution and gave several properties.

    Definition 2.2. AssumeY ∼ Np,n (M,Σ, I). Define the distribution of the multivariatequadratic formQ = YAY′ to beQp(A,M,Σ).

    If we transform a Wishart distributed matrixW asBWB′, we have a new Wishart dis-tributed matrix. In the same way we can transform our quadratic form.

    Theorem 2.13LetQ ∼ Qp(A,M,Σ) andB : q × p real matrix. Then

    BQB′ ∼ Qq(A,B,BΣB′).

    More theorems and properties for the distribution ofQ = YAY′ can be found in Ohlsonand Koski (2009b).

    The standard results about distributions of quadratic forms such as Theorem 2.9 followfrom Theorem 2.12 and the fact that idempotent matrices only have latent roots equal tozero or one (Horn and Johnson, 1990). The expectation of the quadratic formYAY′ isanother property that is easy to derive using Theorem 2.12.

    Theorem 2.14LetQ = YAY′ ∼ Qp(A,M,Σ) thenE

    (YAY′

    )= tr (A)Σ + MAM′.

    Now, suppose thatX ∼ Np,n (M,Σ,Ψ) i.e., the columns are dependent as well.

    Corollary 2.3The distribution ofQ = XAX′ is the same as for

    W =∑

    j

    λjWj ,

  • 18 2 Estimation of the Covariance Matrix for a Multivariate Normal Distribution

    whereλj are the nonzero latent roots ofΨ1/2AΨ1/2 and Wj are independent non-

    central Wishart, i.e.,

    Wj ∼Wp(1,Σ,mjm′j),

    wheremj = MΨ−1/2∆j and∆j are the corresponding latent vectors.

    Hence, we see that the distribution ofQ isQp(Ψ1/2AΨ1/2,MΨ−1/2,Σ).

    Remark 2.1. The latent roots are the same forΨ1/2AΨ1/2 andAΨ.

    2.2 Estimation of the Parameters

    Originally, many estimators of the covariance matrix were obtained from non-iterativeleast squares methods such as the ANOVA and MINQUE approaches, for example whenestimating variance components. When computer resources became more powerful iter-ative methods such as maximum likelihood, restricted maximum likelihood, generalizedestimation equations among others were introduced. However, nowadays one is interestedin applying covariance structures, including variance components models, to very largedata sets. These are found for example in QTL-analysis in Genetics or time series withdensely sampled observations in meteorology or in EEG/EKG-studies in medicine.

    2.2.1 Maximum Likelihood Estimators

    Let x ∼ Np (µ,Σ), whereµ = (µ1, . . . , µp)′ andΣ > 0. Furthermore, letxi, i =1, . . . , n, be an independent random sample onx. The observation matrix can then bewritten as

    X = (x1, . . . ,xn) ∼ Np,n (µ1′,Σ, I) .

    From (2.1) the likelihood function is given by

    L (µ,Σ) = (2π)−12pn |Σ|−n/2etr

    {−

    1

    2Σ−1 (X− µ1′) (X− µ1′)

    ′}. (2.6)

    Since

    (X − µ1′) (X − µ1′)′= S + n (x̄ − µ) (x̄ − µ)

    ′,

    where

    x̄ =1

    nX1,

    and

    S = X(I − 1(1′1)−11′

    )X′ = X

    (I − n−111′

    )X′, (2.7)

  • 2.2 Estimation of the Parameters 19

    the likelihood function (2.6) can be written as

    L (µ,Σ) = (2π)− 12pn |Σ|−n/2etr

    {−

    1

    2Σ−1

    (S + n (x̄ − µ) (x̄ − µ)

    ′)}. (2.8)

    Hence, from Fischer-Neyman factorization Theorem (Schervish (1995), page 89)(x̄,S)is a sufficient statistic for(µ,Σ). The maximum likelihood estimators can be establishedthrough a series of inequalities.

    L (µ,Σ) = (2π)− 12pn |Σ|−n/2etr

    {−

    1

    2Σ−1

    (S + n (x̄ − µ) (x̄ − µ)

    ′)}

    (2.9)

    ≤ (2π)− 12pn |Σ|−n/2etr

    {−

    1

    2Σ−1S

    }(2.10)

    ≤ (2π)−12pn |

    1

    nS|−n/2 exp (−pn/2) , (2.11)

    where the first inequality holds since(x̄ − µ)′ Σ−1 (x̄ − µ) ≥ 0 and the second inequal-ity since we can use

    |Σ|−n/2etr

    {−

    1

    2Σ−1S

    }≤ |

    1

    nS|−n/2 exp (−pn/2)

    given in Srivastava and Khatri (1979), page 25. Equality in (2.9)-(2.11) holds if and onlyif µ = x̄ andΣ = 1nS. Hence, the maximum likelihood estimators forµ andΣ are

    µ̂ML =1

    nX1 (2.12)

    and

    Σ̂ML =1

    nX(I− 1(1′1)−11′

    )X′. (2.13)

    Since(I − 1(1′1)−11′

    )is a projection on the orthogonal space to the column space gen-

    erated of the vector1, using Theorem 2.10(ii) it can be seen that̂µML andΣ̂ML areindependent and distributed due to the following theorem (Anderson (2003), page 77 and255).

    Theorem 2.15If X ∼ Np,n (µ1′,Σ, I) then the maximum likelihood estimators given in(2.12) and(2.13)are independently distributed as

    µ̂ML ∼ Np

    (µ,

    1

    )

    and

    nΣ̂ML ∼Wp (n− 1,Σ) .

  • 20 2 Estimation of the Covariance Matrix for a Multivariate Normal Distribution

    2.2.2 Patterned Covariance Matrix

    Patterned covariance matrices arise in a variety of contexts and have been studied by manyauthors. In a seminal paper Wilks (1946) considered patterned structures when dealingwith measurements onk equivalent psychological tests. This led to a covariance matrixwith equal diagonal elements and equal off-diagonal elements, i.e., a covariance matrixgiven by

    Σ = σ2 ((1 − ρ)I + ρ11′) : p× p, (2.14)

    with − 1p−1 < ρ < 1. This structure is calleduniform, complete symmetryor intraclasscovariance structure. Wilks (1946) developed statistical test criteria for testing equality inmeans, equality in variances and equality in covariances. The structure implies that boththe inverse and the determinant have closed form expressions (Muirhead (1982), page114) and the maximum likelihood estimators are given by

    σ̂2 =trS

    p(n− 1)

    and

    ρ̂ =1′S1− trS

    (p− 1)trS,

    whereS is given in (2.7).Votaw (1948) extended the intraclass model to a model with blocks calledcompound

    symmetry, type Iandtype II. The compound symmetry covariance matrices are, for thep = 4 case, given as

    ΣI =

    α β β ββ γ δ δβ δ γ δβ δ δ γ

    and ΣII =

    α β κ σβ α σ κκ σ γ δσ κ δ γ

    . (2.15)

    Votaw (1948) considered different psychometric and medical research problems where thecompound symmetry is applicable. In Szatrowski (1982) block compound symmetry wasdiscussed and the models were applied to the analysis of an educational testing problem.In a series of papers, Szatrowski (1985) discussed how to obtain maximum likelihoodestimators for the elements of a class of patterned covariance matrices in the presence ofmissing data.

    Another type of block structure arise when multivariatecomplexnormal distribution isconsidered. The multivariate complex normal distribution can be defined as in Srivastavaand Khatri (1979).

    Definition 2.3. Let z = x + iy with meanθ and covariance matrixQ = Σ1 + iΣ2.Thenz ∼ CNp (θ,Q) if and only if

    (x

    y

    )∼ N2p

    ((θ1θ2

    ),ΣC

    ),

  • 2.2 Estimation of the Parameters 21

    where

    ΣC =1

    2

    (Σ1 −Σ2Σ2 Σ1

    ), θ = θ1 + iθ2.

    Goodman (1963) was one of the first to study the covariance matrix of the multivariatecomplex normal distribution, which for example arise in spectral analysis of multipletime series. A direct extension is to studyquaternions, e.g., see Andersson (1975) andAndersson et al. (1983). The covariance matrix in the quaternion case is a4p× 4pmatrixand is structured as

    ΣQ =

    Σ1 Σ2 Σ3 Σ4−Σ2 Σ1 −Σ4 Σ3−Σ3 Σ4 Σ1 −Σ2−Σ4 −Σ3 Σ2 Σ1

    .

    A circular stationarymodel, where variables are thought of as being equally spacedaround a circle was considered by Olkin and Press (1969). The covariance between twovariables in the circular stationary model depends on the distance between the variablesand the covariance matrix forp = 4 andp = 5 have the structures

    ΣCS = σ20

    1 ρ1 ρ2 ρ1ρ1 1 ρ1 ρ2ρ2 ρ1 1 ρ1ρ1 ρ2 ρ1 1

    and ΣCS = σ

    20

    1 ρ1 ρ2 ρ2 ρ1ρ1 1 ρ1 ρ2 ρ2ρ2 ρ1 1 ρ1 ρ2ρ2 ρ2 ρ1 1 ρ1ρ1 ρ2 ρ2 ρ1 1

    .

    Olkin and Press (1969) considered three symmetries, namely circular, intraclass andspherical and derived likelihood ratio test and the asymptotic distribution under the hy-pothesis and alternative. Olkin (1973) generalized the circular stationary model with amultivariate version in which each element was a vector and the covariance matrix can bewritten as ablock circularmatrix.

    The covariance symmetries investigated by for example Wilks (1946), Votaw (1948)and Olkin and Press (1969) are all special cases ofinvariant normal modelsconsideredby Andersson (1975). The invariant normal models include not only all models specifiedby symmetries of the covariance matrix, but also the linear models for the mean. Thesymmetry model defined by a groupG is the family of covariance matrices given by

    S+G ={Σ|Σ > 0,GΣG′ = Σ for all G ∈ G

    },

    i.e., this implies that ifx is a random vector withcov(x) = Σ such thatΣ ∈ S+G thenthere is a symmetry restrictions on the covariance matrix, namely thatcov(x) = cov(Gx)for all G ∈ G. Perlman (1987) summarized and discussed the the group symmetry covari-ance models suggested by Andersson (1975). Furthermore, several examples of differentsymmetries, maximum likelihood estimators and likelihood ratio tests are given in Perl-man (1987). The following example by Perlman (1987) shows the connection between acertain symmetry and a group representation.

  • 22 2 Estimation of the Covariance Matrix for a Multivariate Normal Distribution

    Example 2.5

    Let the random vectorx be partitioned asx = (x′1, . . . ,x′k)

    ′ and assume thatcov(x) =Σ have circular block symmetry, i.e.,

    cov(x) = cov(Prx), r = 1, . . . , k − 1, (2.16)

    where

    P =

    0 Iq 0 · · · 00 0 Iq0 0 0

    . . . IqIq 0 0 · · · 0

    : (qk) × (qk).

    Let G1 be the cyclic group of orderk given by

    G1 ={I,P, . . . ,Pk−1

    }.

    Fork = 4 one can verify thatS+G1 consists of all positive definite matrices of the form

    ΣCBS =

    A B C B′

    B′ A B C

    C B′ A B

    B C B′ A

    , A = A

    ′,C = C′,

    whereA,B,C : q × q and it follows that the circular block symmetry condition (2.16) isequivalent toΣ ∈ S+G1 .

    In Jensen (1988) a class of covariance models that is larger than the class of invariantnormal models was obtained. Jensen (1988) considered structures which are linear in boththe covariance and the inverse covariance and proved that these can be parameterized byJordan algebras. The structures which are linear in the covariance are given by

    Σ =

    m∑

    i=1

    σiGi, (2.17)

    whereGi, i = 1, . . . ,m are symmetric and linear independent known matrices andσi, i = 1, . . . ,m are unknown real parameters such thatΣ is positive definite. The struc-ture (2.17) was first discussed by Anderson (1969, 1970, 1973), where the likelihoodequations, an iterative method for solving these equations and the asymptotic distributionof the estimates were given.

    Permutation invariantcovariance matrices were considered in Nahtman (2006) andit was proven that permutation invariance implies a specific structure for the covariancematrix. Nahtman and von Rosen (2005) showed that shift invariance impliesToeplitzcovariance matrices and marginally shift invariance givesblock Toeplitzcovariance ma-trices.

  • 2.2 Estimation of the Parameters 23

    There exist many papers on Toeplitz covariance matrices, e.g., see Burg et al. (1982),Fuhrmann and Barton (1990), Marin and Dhorne (2002) and Christensen (2007). To havea Toeplitz structure means that certain invariance conditions are fulfilled, e.g., equality ofvariances and covariances. Toeplitz covariance matrices are all banded matrices. Bandedcovariance matrices and their inverses frequently arise in biological, economical timeseries and time series in engineering. For example in signal processing applications, in-cluding autoregressive or moving average image modeling, covariances of Gauss-Markovrandom processes (Woods, 1972, Moura and Balram, 1992), or numerical approximationsof partial differential equations based on finite differences. Banded matrices are also usedto model the correlation of cyclostationary processes in periodic time series (Chakraborty,1998).

    One type of Toeplitz matrices is the covariance matrices arising from the theory oftime series. Durbin (1959) suggested an efficient estimation procedure for the parametersin a Gaussian moving average time series of order one. Godolphin and De Gooijer (1982)computed the exact maximum likelihood estimators of the parameters for a moving av-erage process. The autoregressive moving average process has been discussed by manyauthors. Anderson (1975) gave the likelihood equations for the parameters which are non-linear in most cases and proposed some iterative solutions. Anderson (1977) derived theNewton-Raphson procedures for the same likelihood equations. For more references ontime series analysis, see for example Box and Jenkins (1970), Hannan (1970), Anderson(1971). See Figure 2.2 for the connections between different covariance structures.

    Several authors have considered other structures. For example, Jöreskog (1981) con-sidered linear structural relations (LISREL) models and Lauritzen (1996) more sophis-ticated structures within the framework of graphical models. Browne (1977) reviewspatterned correlation matrices arising from multiple psychological measurements.

    Linearly Structured Covariance Matrices with Zeros

    Covariance matrices with zeros have been considered by several authors, see for exampleGrzebyk et al. (2004), Drton and Richardson (2004), Mao et al. (2004), Chaudhuri et al.(2007).

    An new algorithm, callediterative conditional fitting, for the maximum likelihoodestimators in the multivariate normal case is derived in Drton and Richardson (2004),Chaudhuri et al. (2007). The iterative conditional fitting algorithm is based on covariancegraph models and is an iterative algorithm for deriving the maximum likelihood estima-tors. Suppose we have a random vectorx = (x1, x2, x3, x4) with the covariance matrix

    Σ =

    σ11 0 σ13 00 σ22 0 σ24σ13 0 σ33 σ340 σ24 σ34 σ44

    . (2.18)

    A covariance graph is a graph with a vertex for every random variable in the random vec-tor x. The vertices in the graph are connected by a bi-directed edgei↔ j unlessσij = 0.Hence, the covariance graph corresponding to the covariance matrix (2.18) is given inFigure 2.1. The iterative conditional fitting algorithm developed in Drton and Richardson(2004), Chaudhuri et al. (2007) is given as follows. First fix the marginal distribution for

  • 24 2 Estimation of the Covariance Matrix for a Multivariate Normal Distribution

    x3 x2x4x1

    Figure 2.1: Covariance graph for covariance matrix(2.18).

    the variables different fromi. Then estimate, by maximum likelihood, the conditionaldistribution of variablei given all the other variables under the constraints implied by thecovariance graph model. Last, calculate a new estimate of the joint distribution by mul-tiplying together the fixed marginal and the estimated conditional distributions. Repeatuntil some convergence is obtained.

    Let the index set of the variables beV = 1, . . . , p, the set of all variables buti be−i = V \{i} andsp(i) = {j|i↔ j}, i.e., the set of variables that is dependent to variablei. Let alsonsp(i) = V \(sp(i) ∪ {i}). Assume that we haven observations from a normaldistribution with mean zero and covariance matrix with some zeros. Let the observationmatrix beX = (x1, . . . ,xn) ∼ Np,n (0,Σ, In). Furthermore, the distribution if thevariablesX−i is given by

    X−i ∼ Np−1,n (0,Σ−i,−i, In) ,

    whereXS is the submatrix ofX including the rows given by the index setS andΣS,Tis the submatrix ofΣ including the rows and columns given by index setsS andT ,respectively.

    The conditional distribution ofXi|X−i is then

    Xi|X−i ∼ N1,n(Σi,−iΣ

    −1−i,−iX−i, λiIn

    ), (2.19)

    where

    λi = σii − Σi,−iΣ−1−i,−iΣ−i,i.

    Using the fact thatσij = 0 if j ∈ nsp(i) we have from (2.19)

    Xi|X−i ∼ N1,n

    j∈sp(i)

    σijZ(i)j , λiIn

    , (2.20)

    where all the pseudo-variablesZ(i)sp(i) are defined as

    Z(i)sp(i) =

    (Σ−1−i,−i

    )sp(i),−i

    X−i (2.21)

    A more precise formulation of the iterative conditional fitting algorithm is as follows.

  • 2.2 Estimation of the Parameters 25

    Algorithm 2.1 (Iterative conditional fitting algorithm)

    1. Set the iteration counterr = 0, and choose a starting valuêΣ(0)

    ≥ 0, for examplethe identity matrix.

    2. SetΣ̂(r,0)

    = Σ̂(r)

    and repeat the following steps for alli ∈ V :

    (i) Let Σ̂(r,i)

    −i,−i = Σ̂(r,i−1)

    −i,−i and calculate from this submatrix the pseudo-variables

    Z(i)sp(i) given in (2.21).

    (ii) Compute the maximum likelihood estimates

    Σ̂(r,i)

    i,sp(i) = XiZ(i)′sp(i)

    (Z

    (i)sp(i)Z

    (i)′sp(i)

    )−1,

    λ̂i =1

    n

    (Xi − Σ̂

    (r,i)

    i,sp(i)Z(i)sp(i)

    )(Xi − Σ̂

    (r,i)

    i,sp(i)Z(i)sp(i)

    )′,

    for the linear regression given in(2.20).

    (iii) CompleteΣ̂(r,i)

    by solving forσii and thus setting

    σ̂(r,i)ii = λ̂i + Σ̂

    (r,i)

    i,−i

    ((Σ̂

    (r,i)

    −i,−i

    )−1)

    sp(i),sp(i)

    Σ̂(r,i)

    −i,i .

    3. SetΣ̂(r+1)

    = Σ̂(r,p)

    , increment the counterr to r + 1 and repeat step 2 untilconvergence is obtained.

    Ohlson et al. (2009) studied banded matrices with unequal elements except that certaincovariances are zero. This banded structure is a special case of the structure considered byChaudhuri et al. (2007), see Figure 2.2. The basic idea is that widely separated observa-tions in time or space often appear to be uncorrelated. Therefore, it is reasonable to workwith a banded covariance structure where all covariances more thanm steps apart equalzero, a so calledm-dependentstructure. LetΣ(m)(k) : k × k be anm-dependent banded

    covariance matrix and partitionΣ(m)(k) as

    Σ(m)(k) =

    (m)(k−1) σ1kσ′k1 σkk

    ), (2.22)

    where

    σ′k1 = (0, . . . , 0, σk,k−m, . . . , σk,k−1) .

    The procedure suggested by Ohlson et al. (2009) to estimate a banded covariance matrixΣ

    (m)(k) is given by the following algorithm.

  • 26 2 Estimation of the Covariance Matrix for a Multivariate Normal Distribution

    Algorithm 2.2 (Explicit estimators for a banded covariance matrix)

    Let X ∼ Np,n(µ1′n,Σ(m)(p) , In), with arbitrary integerm. The estimators ofµ andΣ

    (m)(p)

    are given by the following two steps.

    (i) Use the maximum likelihood estimator forµ1, . . . , µm+1 andΣ(m)(m+1).

    (ii) Calculate the following estimators fork = m+ 2, . . . , p in increasing order, wherefor eachk let i = k −m, . . . , k − 1:

    µ̂k =1

    nx′k1n, (2.23)

    σ̂ki = β̂ki|Σ̂(k−1)|

    |Σ̂(k−2)|, (2.24)

    σ̂kk =1

    nx′k

    (In − X̂k−1(X̂

    k−1X̂k−1)−1X̂

    k−1

    )xk + σ̂

    ′k1Σ̂

    −1

    (k−1)σ̂1k, (2.25)

    where

    σ̂′k1 = (0, . . . , 0, σ̂k,k−m, . . . , σ̂k,k−1) ,

    β̂k =(β̂k0, β̂k,k−m, . . . , β̂k,k−1

    )′= (X̂

    k−1X̂k−1)−1X̂

    k−1xk, (2.26)

    X̂k−1 = (1n, x̂k−1,k−m, . . . , x̂k−1,k−1)

    and

    x̂k−1,i =

    k−1∑

    j=1

    (−1)i+j|M̂

    ji

    (k−1)|

    |Σ̂(k−2)|xj ,

    whereM̂ji

    (k−1) is the matrix obtained when thejth row andith column have been removed

    from Σ̂(k−1).

    The estimators in Algorithm 2.2 are fairly natural, but they are ad hoc estimators andOhlson et al. (2009) motivated them with the following theorem.

    Theorem 2.16The estimator̂µ = (µ̂1, . . . , µ̂p)′ given in Algorithm 2.2 is unbiased and consistent, and

    the estimator̂Σ(m)

    (p) = (σ̂ij) is consistent.

    2.2.3 Estimating the Kronecker Product Covariance

    If the covariance matrix for a matrix normal distribution is separable, i.e., if it can bewritten as a Kronecker product between two matrices the model belongs to the curvedexponential family. Thus, under the Kronecker product structure the parameter space

  • 2.2 Estimation of the Parameters 27

    ΣCT

    ΣARΣ(m)

    ΣWZ

    Σ > 0

    ΣT

    Σ = σ2IΣIC

    Figure 2.2: Different covariance structures. (ΣWZ = with zeros, Σ(m) = banded,ΣT = toeplitz, ΣCT = circular toeplitz, ΣAR = autoregressiveand ΣIC =intraclass)

    is of lower dimension and it has other statistical properties, i.e., one have to be carefulsince the estimation and testing can be more complicated. Kronecker product structuresin covariance matrices have recently been studied by Dutilleul (1999), Naik and Rao(2001), Lu and Zimmerman (2005), Roy and Khattree (2005a), Mitchell et al. (2005,2006), Srivastava et al. (2007).

    Let X1, . . . ,XN be a random sample from the matrix normal distribution,Xi ∼Np,n (M,Σ,Ψ) for i = 1, . . . , N . From (2.1) the logarithm of the likelihood functionfor the sampleX1, . . . ,XN can, ignoring the normalizing constant, be written as

    lnL(M,Σ,Ψ) = −nN

    2ln |Σ| −

    pN

    2ln |Ψ|

    −1

    2

    N∑

    i=1

    tr{Σ−1 (Xi − M)Ψ

    −1 (Xi − M)′}. (2.27)

    The likelihood equations for the maximum likelihood estimatorsM̂, Σ̂, Ψ̂ are given by

  • 28 2 Estimation of the Covariance Matrix for a Multivariate Normal Distribution

    Dutilleul (1999) as

    M̂ =1

    N

    N∑

    i=1

    Xi = X̄, (2.28)

    Σ̂ =1

    nN

    N∑

    i=1

    (Xi − X̄

    )Ψ̂

    −1 (Xi − X̄

    )′(2.29)

    and

    Ψ̂ =1

    pN

    N∑

    i=1

    (Xi − X̄

    )′Σ̂

    −1 (Xi − X̄

    ). (2.30)

    There is no explicit solution to (2.29) and (2.30). Dutilleul (1999) derived an estima-tor for Ψ ⊗ Σ using theflip-flop algorithm given in Algorithm 2.3. Srivastava et al.(2007) pointed out that the Gaussian model with a separable covariance matrix, i.e.,X ∼ Np,n (M,Σ,Ψ) belongs to the curved exponential family and the convergenceand uniqueness have to be carefully considered.

    Algorithm 2.3 (The flip-flop algorithm)

    1. Choose a starting value for̂Ψ = Ψ̂(0)

    .

    2. EstimatêΣ(r)

    from (2.29)with Ψ̂ = Ψ̂(r−1)

    .

    3. EstimateΨ̂(r)

    from (2.30)with Σ̂ = Σ̂(r)

    .

    4. Repeat Step 2 and 3 until some convergence criteria is fulfilled.

    Another problem with the estimation ofΣ and Ψ respectively, is that all the parametersare not uniquely defined since for every scalarc 6= 0 we have

    Ψ ⊗ Σ = cΨ⊗1

    cΣ.

    Srivastava et al. (2007) gave two ways to obtain unique estimates. Either setψnn = 1 orψii = 1 for i = 1, . . . , n respectively and then use Algorithm 2.3 under these constraints.

    2.2.4 Estimating the Kronecker Product Covariance (OneObservation Matrix)

    In many situations in statistical analysis the assumption of independence between obser-vations are violated. Observations from a spatio-temporal stochastic process can undersome conditions be described as a matrix normal distribution with a separable covariancematrix. Often also when the stochastic process are spatio-temporal, some structures canbe assumed for one or both of the matrices in the Kronecker product. Shitan and Brock-well (1995), Ohlson and Koski (2009a) have considered the Kronecker product structurein time series analysis and structures for the covariance matrices.

  • 2.2 Estimation of the Parameters 29

    The spatio-temporal processes typically only have one observation matrix. Let a ran-dom matrixX : p×n have matrix normal distribution with a separable covariance matrix,i.e.,X ∼ Np,n (M,Σ,Ψ), where the covariance between the rows isΣ : p × p and thecovariance between the columns isΨ : n × n. We will call Σ andΨ the spatial andtemporal covariance matrix, respectively. Furthermore, we will start by assuming thatΨis known.

    Assume thatX = (x1,x2, . . . ,xn), wherexi ∼ Np(µ,Σ). Since the temporalcovariance isΨ we haven dependent vectors. The expectation ofX is given byE(X) =µ1′, whereµ = (µ1, . . . , µp)

    ′. WhenN = 1 in log-likelihood function (2.27) it is easyto prove the following theorem.

    Theorem 2.17Let X ∼ Np,n (µ1′,Σ,Ψ), whereΣ > 0 andΨ is known. The maximum likelihoodestimators forµ andΣ are given by

    µ̂ = (1′Ψ−11)−1XΨ−11

    and

    nΣ̂ = XHX′,

    whereH is the weighted centralization matrix

    H = Ψ−1 − Ψ−11(1′Ψ−11

    )−11′Ψ−1. (2.31)

    We know that the ordinary sample covariance matrixnS = X(I − 1n11′)X′ is Wishart

    distributed and we can show that the same is valid for the sample covariance matrixA =XHX′. SinceHΨ is idempotent,rank(H) = n− 1 and using Corollary 2.2 we have thefollowing corollary.

    Corollary 2.4LetX ∼ Np,n (µ1′,Σ,Ψ), whereΨ is known, thennΣ̂ = XHX

    ′ ∼Wp (n− 1,Σ).

    In most cases, both the spatial and temporal covariance matrices are unknown. If the twocovariance matrices have no structure it is impossible to estimate all parameters from onesample observation matrixX. There arep(p + 1)/2 + n(n + 1)/2 + p parameters toestimate. In many applications we reduce the number of parameters by assuming that thetemporal covariance matrixΨ has some structure, see for example Chaganty and Naik(2002), Huizenga et al. (2002), Roy and Khattree (2005a,b).

    Let X ∼ Np,n (µ1′,Σ,Ψ), whereΣ > 0 andΨ > 0. The following corollarygiven by Ohlson and Koski (2009a) gives the maximum likelihood estimators when thetemporal covariance matrixΨ can be estimated explicitly.

    Corollary 2.5Let X ∼ Np,n (µ1′,Σ,Ψ) and assume that the temporal covariance matrixΨ can beestimated explicitly. The maximum likelihood estimators forµ andΣ are given by

    µ̂ = (1′Ψ̂−1

    1)−1XΨ̂−1

    1

  • 30 2 Estimation of the Covariance Matrix for a Multivariate Normal Distribution

    and

    nΣ̂ = XĤX′,

    whereĤ is the estimated weighted centralization matrix i.e.,

    Ĥ = Ψ̂−1

    − Ψ̂−1

    1(1′Ψ̂

    −11)−1

    1′Ψ̂−1. (2.32)

    Ohlson and Koski (2009a) presented an iterative algorithm to find the maximum likeli-hood estimates of the parameters in a matrix normal distribution, similar the algorithm inDutilleul (1999). The big difference is that Ohlson and Koski (2009a) only have one sam-ple observation matrix and assumed that the temporal covarianceΨ has an autoregressiveor intraclass structure, whereas the spatial covariance matrix is unstructured.

    Let Ψ be the covariance matrix from an autoregressive process of order one, i.e.,AR(1) (see Brockwell and Davis (2002), Ljung (1999) for more details). The covariancematrix is then given by

    Ψ(θ) =1

    1 − θ2

    1 θ θ2 · · · θn−1

    θ 1 · · ·

    θ2...

    . . .... θ

    θn−1 θ 1

    .

    Further, let the covariance matrixΣ > 0 be unstructured. This model implies that everyrow in X comes from the same stationaryAR(1) time series, but with different variancesσii. The model can be written as

    xit − µi = θ (xi,t−1 − µi) + εit, i = 1, . . . , p, t = 1, . . . , n

    for some expectationsµi and whereεit ∼ N(0, σii), |θ| < 1 andεit is uncorrelated withxis for eachs < t. The differentAR(1) time series are dependent sinceσij 6= 0.

    A reasonable first estimate of the parameterθ could be the mean of the Yule-Walkerestimates

    θ̂(0) =1

    p

    p∑

    i=1

    θ̂i, (2.33)

    whereθ̂i is the Yule-Walker estimate ofθ (see Brockwell and Davis (2002) for the theoryaround the Yule-Walker estimate) in each time series, i.e.,

    θ̂i =

    ∑n−1t=1 (xit − x̄i)(xi,t+1 − x̄i)∑n

    t=1(xit − x̄i)2

    , where x̄i =1

    n

    n∑

    t=1

    xit.

    The determinant and the inverse ofΨ can easily be calculated,

    |Ψ(θ)| = (1 − θ2)−1

  • 2.2 Estimation of the Parameters 31

    and

    Ψ−1(θ) = I + θ2D1 − θD2 =

    1 −θ 0 · · · 0−θ 1 + θ2 · · · 0

    0...

    . . ....

    ... 1 + θ2 −θ0 0 · · · −θ 1

    ,

    whereD1 = diag(0, 1, . . . , 1, 0) andD2 is tridiagonal matrix with zeros on the diagonaland ones on the super- and subdiagonal. Without the imposed structure on the temporalcovariance, the maximum likelihood estimates do not exist explicitly. Ohlson and Koski(2009a) gave the likelihood equations

    µ = (1′Ψ−1(θ)1)−1XΨ−1(θ)1, (2.34)

    nΣ = X(Ψ−1(θ) − Ψ−1(θ)1

    (1′Ψ−1(θ)1

    )−11′Ψ−1(θ)

    )X′ = XH(θ)X′ (2.35)

    and

    2θp+ (1 − θ2)tr{Υ(θ) (X − µ1′)

    ′Σ−1 (X − µ1′)

    }= 0, (2.36)

    whereH is defined in (2.31) and

    Υ(θ) = 2θD1 − D2 =

    0 −1 0 · · · 0−1 2θ · · · 0

    0...

    . . ....

    ... 2θ −10 0 · · · −1 0

    .

    Ohlson and Koski (2009a) also showed that the equations (2.34)-(2.36) give the equation

    2θp+ n(1 − θ2)tr{XBΥ(θ)B′X′(XBΨ−1(θ)B′X′)−1

    }= 0, (2.37)

    where

    B = I − (1′Ψ−1(θ)1)−1Ψ−1(θ)11′,

    which is a polynomial equation of order less than3p+2. Instead of solving the polynomialequation (2.37) exactly, Ohlson and Koski (2009a) showed that the maximum likelihoodestimateŝµ, Σ̂ andθ̂ can be calculated by iteratively solving the three equations (2.34),(2.35) and (2.36) above. The algorithm derived by Ohlson and Koski (2009a) is given inAlgorithm 2.4.

  • 32 2 Estimation of the Covariance Matrix for a Multivariate Normal Distribution

    Algorithm 2.4 (The algorithm given by Ohlson and Koski (2009a))

    1. Obtain an initial estimatêθ(0) of θ, using the mean of the Yule-Walker estimatesgiven by(2.33).

    2. Computêµ(1) andΣ̂(1)

    from (2.34)and (2.35), usingθ̂(0).

    3. Compute the value of̂θ(k) by solving the cubic equation(2.36)using the estimates

    µ̂(k) andΣ̂

    (k). Ensure that|θ̂(k)| < 1 and that the solution is a maximum.

    4. Computêµ(k) andΣ̂(k)

    from (2.34)and (2.35), using the estimatêθ(k) from previ-ous step.

    5. Repeat steps 3 and 4 until convergence is obtained, i.e., until

    |θ̂(k) − θ̂(k−1)| < ε

    and

    tr

    ((Σ̂

    (k)− Σ̂

    (k−1))2)

    < ε. (2.38)

    Convergence is checked by verifying that the maximum of the absolute difference be-tween two successive estimates ofθ̂(k) and that the trace of the quadratic difference be-

    tween two successive estimates ofΣ̂(k)

    is less than a predetermine numberε (ε can betaken to be10−6). The condition (2.38) is relevant since we have

    tr

    ((Σ̂

    (k)− Σ̂

    (k−1))2)

    = tr

    ((Σ̂

    (k)− Σ̂

    (k−1))′(

    Σ̂(k)

    − Σ̂(k−1)

    ))

    =∑

    i

    j

    (s(k)ij − s

    (k−1)ij )

    2 ≡ ‖Σ̂(k)

    − Σ̂(k−1)

    ‖2,

    whereΣ̂(k)

    =(s(k)ij

    )i,j

    and‖ · ‖ is thel2 norm (Horn and Johnson, 1990). Hence, we

    check that thel2 norm of the difference is less thanε. In every run of the algorithm onlya cubic equation has to be solved since givenµ andΣ, the last part of equation (2.36)

    tr{Υ(θ) (X− µ1′)

    ′Σ−1 (X− µ1′)

    }

    is a linear function inθ, implying that (2.36) is a cubic equation.Intraclass covariance structure (2.14) was first considered by Wilks (1946). For the

    matrix normal distribution and the Kronecker product structure several authors have dis-cussed the case with intraclass covariance structure forΨ, e.g., see Roy and Khattree(2005a,b). Ohlson and Koski (2009a) showed that the likelihood function is maximizedfor

    µ =(1′Ψ−1(ρ)1

    )−1XΨ−1(ρ)1 = n−1X1 = µ̂,

  • 2.2 Estimation of the Parameters 33

    i.e., the regular mean and that the resulting likelihood function is written as

    L(µ̂,Σ, ρ) = |Σ|−n/2((1 − ρ)n−1(1 + (n− 1)ρ)

    )−p/2etr

    {−

    1

    2

    1

    1 − ρΣ−1XCX′

    },

    where

    C = I − 1(1′1)−11′. (2.39)

    The likelihood function is decreasing and does not have a maximum for any permissibleρ. Hence,ρ is estimated, for alln, with the smallest value ofρ that is allowed. For largenand since− 1n−1 < ρ, we choosêρ = 0. Hence, in the intraclass model with one sample

    observation matrix, the best we can do for largen is to estimateΨ with Ψ̂ = In, i.e.,assume independence. The estimate for the spatial covarianceΣ is the ordinary samplecovariance matrix given in (2.13) .

    2.2.5 The Distribution of a Special Sample Covariance Matrix

    AssumeX ∼ Np,n (µ1′,Σ,Ψ), whereΣ > 0 andΨ is known. The maximum likeli-hood estimators are given in Theorem 2.17 and the distribution of the sample covariancematrix in Corollary 2.4 .

    Now for some reason we estimate the expectationµ with the regular mean̂µ =1nX1 = x̄, i.e., we use the same estimator as if the observations were independent.This estimator is also an unbiased and consistent estimator for the mean and is intuitivelyreasonable. There are several reasons for using the simpler estimator. For example, theestimatorµ̂ is more robust than̂µML for a large number of observations, i.e., for largen. Another reason could be that we only know the centralized observations,X − µ̂1′.However, when we estimate the covariance matrixΣ, we use the dependent model withΨ. The estimator ofΣ is than given as

    nΣ̂ = (X − µ̂1′)Ψ−1 (X− µ̂1′)′= XCΨ−1CX′,

    whereC is given in (2.39). Hence, the distribution of the sample covariance matrix is

    (X − µ̂1′)Ψ−1 (X − µ̂1′)′∼ Qp

    (Ψ1/2CΨ−1CΨ1/2,MΨ−1/2,Σ

    ), (2.40)

    where the distributionQp is defined in Definition 2.2. For simplifying the distribution(2.40) we need the eigenvalues and eigenvectors of the matrixΨ1/2CΨ−1CΨ1/2. Letλ(A) be the eigenvalues of the matrixA. We know that the eigenvalues of the matrixΨ1/2CΨ−1CΨ1/2 are the same as the eigenvalues ofCΨ−1CΨ (see Horn and Johnson(1990) for details). Furthermore, the matrixCΨ−1CΨ can be written as

    CΨ−1CΨ = I −1

    n

    (11′ + CΨ−111′Ψ

    ).

    The following two lemmas will give the distribution fornΣ̂. The first lemma gives theeigenvalues ofΨ1/2CΨ−1CΨ1/2.

  • 34 2 Estimation of the Covariance Matrix for a Multivariate Normal Distribution

    Lemma 2.1The eigenvalues ofΨ1/2CΨ−1CΨ1/2 are

    λ(Ψ1/2CΨ−1CΨ1/2

    )=

    (0, 1 −

    1

    n1′ΨCΨ−11, 1, . . . , 1

    ).

    Proof: The proof is straightforward. Using known properties of the eigenvalues we have

    λ(Ψ1/2CΨ−1CΨ1/2

    )= λ

    (CΨ−1CΨ

    )= λ

    (I −

    1

    n

    (11′ + CΨ−111′Ψ

    ))

    = 1 −1

    nλ(11′ + CΨ−111′Ψ

    )= 1 −

    1

    ((1 CΨ−11

    )( 1′1′Ψ

    ))

    = 1 −1

    n

    ((1′

    1′Ψ

    )(1 CΨ−11

    )), 0, . . . , 0

    )

    = 1 −1

    n

    (n 0

    1′Ψ1 1′ΨCΨ−11

    ), 0, . . . , 0

    )

    = 1 −1

    n

    (n,1′ΨCΨ−11, 0, . . . , 0

    )=

    (0, 1 −

    1

    n1′ΨCΨ−11, 1, . . . , 1

    )

    and the proof of the Lemma follows.

    Let λ∗ = 1 − 1n1′ΨCΨ−11, which is the eigenvalue ofΨ1/2CΨ−1CΨ1/2 not equal to

    zero or one.

    Lemma 2.2Let the vectorsh∗ andh0 beh

    ∗ = Ψ1/2CΨ−11 andh0 = Ψ−1/21, respectively. Then

    h∗ is an eigenvector forΨ1/2CΨ−1CΨ1/2 with eigenvalueλ∗ andh0 is an eigenvectorfor Ψ1/2CΨ−1CΨ1/2 with eigenvalue0, i.e.,

    Ψ1/2CΨ−1CΨ1/2h∗ = λ∗h∗,

    Ψ1/2CΨ−1CΨ1/2h0 = 0.

    The distribution ofnΣ̂ is given in the following corollary.

    Corollary 2.6AssumeX ∼ Np,n (µ1′,Σ,Ψ), whereΣ > 0 and Ψ is known. The distribution ofnΣ̂ = XCΨ−1CX′ is the same as the distribution ofW = W1 + λ∗W

    ∗, whereW1andW∗ are independent and

    W1 ∼Wp (n− 2,Σ) ,

    W∗ ∼Wp (1,Σ)

    andλ∗ = 1 − 1n1′ΨCΨ−11.

    Proof: We have that

    W1 ∼Wp

    (n− 2,Σ,µ1′Ψ−1/2

    (n−2∑

    i=1

    ∆i∆i′

    )Ψ−1/21µ′

    ),

  • 2.2 Estimation of the Parameters 35

    where∆i, i = 1, . . . , n− 2, are the orthonormal eigenvectors with eigenvalue1 and

    n−2∑

    i=1

    ∆i∆i′ = I −

    (∆0∆0

    ′ + ∆∗∆∗′).

    The noncentrality parameter is given by

    µ1′Ψ−1/2(I−

    (∆0∆0

    ′ + ∆∗∆∗′))

    Ψ−1/21µ′

    = µ1′Ψ−1/2(I − (h′0h0)

    −1h0h′0 − (h

    ∗′h∗)−1h∗h∗′)

    Ψ−1/21µ′

    = µ1′Ψ−11µ′ − (h′0h0)−1µ1′Ψ−1/2h0h

    ′0Ψ

    −1/21µ′

    − (h∗′h∗)−1µ1′Ψ−1/2h∗h∗

    ′Ψ−1/21µ′ (2.41)

    and we have, sinceh0 = Ψ−1/21 and1′Ψ−11 are both scalars,

    (h0′h0)

    −1µ1′Ψ−1/2h0h0′Ψ−1/21µ′ = (1′Ψ−11)−1µ1′Ψ−111′Ψ−11µ′

    = (1′Ψ−11)µµ′. (2.42)

    Furthermore, sinceh∗ = Ψ1/2CΨ−11 and1′C = C1 = 0,

    µ1′Ψ−1/2∆∗∆∗′Ψ−1/21µ′ = (h∗

    ′h∗)−1µ1′Ψ−1/2h∗h∗

    ′Ψ−1/21µ′

    = (h∗′h∗)−1µ1′CΨ−111′Ψ−1C1µ′ = 0. (2.43)

    Finally, from (2.41), (2.42) and (2.43), we have

    µ1′Ψ−1/2(I −

    (∆0∆

    ′0 + ∆

    ∗∆∗′))

    Ψ−1/21µ′ = 0

    and the Corollary follows.

    The expectation ofXCΨ−1CX′ can be computed straightforwardly,

    E(XCΨ−1CX′

    )=

    (n− 1 −

    1

    n1′ΨCΨ−11

    )Σ.

    Hence, an unbiased estimator ofΣ is

    Σ̂ =

    (n− 1 −

    1

    n1′ΨCΨ−11

    )−1XCΨ−1CX′.

  • 3Estimation of the Covariance Matrix

    for a Growth Curve Model

    THE Growth Curve model is a generalization of the multivariate analysis of variancemodel (MANOVA). The Growth Curve model belongs to the curved exponentialfamily and was introduced by Potthoff and Roy (1964). The mean structure for the GrowthCurve model is bilinear instead of linear as for the ordinary MANOVA model. For detailsand references connected to the model see Srivastava and Khatri (1979), Kshirsagar andSmith (1995), Srivastava and von Rosen (1999) or Kollo and von Rosen (2005).

    3.1 The Growth Curve Model

    In this section will we give the definition of the Growth Curve model and two examplesfor the understanding. The Growth Curve model is defined as follows.

    Definition 3.1. Let X : p× n andB : q × k be the observation and parameter matricesand letA : p × q andC : k × n be the within and between individual design matrices,respectively. Let alsoq ≤ p, rank(C) + p ≤ n andΣ : p × p be positive definite. TheGrowth Curve model is given by

    X = ABC + Σ1/2E, (3.1)

    where

    E ∼ Np,n (0, Ip, In) .

    One may note that the Growth Curve model defined above, is nothing more than theclassical MANOVA model ifA = I and that the design matrixC is the same as inclassical univariate and multivariate linear models. The Growth Curve model (3.1) can be

    37

  • 38 3 Estimation of the Covariance Matrix for a Growth Curve Model

    rewritten as

    vecX =(C′ ⊗ A

    )vecB +

    (I ⊗ Σ1/2

    )vecE,

    which is a special case of the classical multivariate linear model. However, there is nogain expressing the model in this way since, as we will see, the interesting part is thetensor space generated byC′ ⊗ A andI⊗ Σ.

    The following examples will show how the Growth Curve model can be used. Formore examples see Srivastava and Carter (1983), Kshirsagar and Smith (1995).

    Example 3.1: Potthoff & Roy - Dental Data

    Dental measurements on eleven girls and sixteen boys at four different ages(8, 10, 12, 14)were taken. Each measurement is the distance, in millimeters, from the center of pituitaryto pteryo-maxillary fissure. Suppose linear growth curves describe the mean growth forboth the girls and the boys. Then we may use the Growth Curve model where the obser-vation, parameter and design matrices are given as follows

    X = (x1, . . . ,x27) : 4 × 27

    =

    21 21 20.5 23.5 21.5 20 21.5 23 20 . . .16.5 24.5 26 21.5 23 20 25.5 24.5 22 . . .. . . 24 23 27.5 23 21.5 17 22.5 23 2220 21.5 24 24.5 23 21 22.5 23 21 . . .19 25 25 22.5 22.5 23.5 27.5 25.5 22 . . .. . . 21.5 20.5 28 23 23.5 24.5 25.5 24.5 21.521.5 24 24.5 25 22.5 21 23 23.5 22 . . .19 28 29 23 24 22.5 26.5 27 24.5 . . .. . . 24.5 31 31 23.5 24 26 25.5 26 23.523 25.5 26 26.5 23.5 22.5 25 24 21.5 . . .

    19.5 28 31 26.5 27.5 26 27 28.5 26.5 . . .. . . 25.5 26 31.5 25 28 29.5 26 30 25

    ,

    B =

    (b01 b02b11 b12

    ), A =

    1 81 101 121 14

    and C =

    (1′11 0

    ′16

    0′11 1′16

    ).

    Hence, for example for individual in the first groupi = 1, . . . , 11 the Growth Curvemodel gives the mean

    E (xi) =

    b01 + 8b11b01 + 10b11b01 + 12b11b01 + 14b11

    .

  • 3.2 Estimation of the Parameters 39

    Example 3.2

    Let there bek groups of individuals, withnj individuals in thejth group. Thek differentgroups have been taught with different learning processes. Every individual have beentested at the samep time pointstr, r = 1, . . . , p. If we assume that the testing results of anindividual are multivariate normal distributed with a covariance matrixΣ and that testingresults between different individuals are independent we can apply the Growth Curvemodel defined in Definition 3.1. Let the mean for the different groups be a polynomial intime of degreeq − 1. The mean for groupj can then be written as

    µj = β1j + β2jt+ · · · + βqjtq−1, j = 1, . . . , k,

    whereβij are unknown parameters. The Growth Curve model is then given by the matri-ces

    A =

    1 t1 . . . tq−11

    1 t2 . . . tq−12

    ......

    . . ....

    1 tp . . . tq−1p

    , B = (βij) andC =

    1′n1 0′n2 . . . 0

    ′nk

    0′n1 1′n2 . . . 0

    ′nk

    ......

    . . ....

    0′n1 0′n2 . . . 1

    ′nk

    .

    3.2 Estimation of the Parameters

    In this section we will give the maximum likelihood estimators for a Growth Curve model.We will also consider different structures for the covariance matrix in a Growth Curvemodel. We will discuss the methods existing in the literature for the estimation problem.

    3.2.1 Maximum Likelihood Estimators

    When no assumptions about the covariance matrix in the Growth Curve model (3.1) aremade, Potthoff and Roy (1964) originally derived a class of weighted estimators for theparameter matrixB as

    B̂ = (A′G−1A)−1A′G−1XC′(CC′)−1, (3.2)

    where the design matricesA andC are assumed to have full rank. The estimator (3.2)is a function of an arbitrary positive definite matrixG. If G is chosen to be the identitymatrix, i.e.,G = I, the weighted estimator (3.2) is an unweighted estimator

    B̂ = (A′A)−1A′XC′(CC′)−1. (3.3)

    Khatri (1966a) extended the results from Potthoff and Roy (1964) and showed that themaximum likelihood estimator is a weighted estimator as well. The maximum likelihoodestimator is a function of the matrix of the sum of squares

    S = X(I − C′(CC′)−C

    )X′ (3.4)

  • 40 3 Estimation of the Covariance Matrix for a Growth Curve Model

    and the maximum likelihood estimator is given by

    B̂ML = (A′S−1A)−A′S−1XC′(CC′)− + (A′)oZ1 + A

    ′Z2Co′, (3.5)

    whereZ1 andZ2 are arbitrary matrices andAo is any matrix of full rank which is span-

    ning the orthogonal complement toC(A), i.e.,C(Ao) = C(A)⊥, whereC(A) stands forthe linear space generated by the columns ofA. If the design matricesA andC have fullrank the estimator is nothing more than

    B̂ML = (A′S−1A)−1A′S−1XC′(CC′)−1. (3.6)

    Furth