21

ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

Embed Size (px)

Citation preview

Page 1: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ
Page 2: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

ERCIM May 2001

Analysis of variance, general balance and large data sets

Roger Payne

Statistics Department, IACR-Rothamsted,

Harpenden, Herts AL5 2JQ

Email: [email protected]

Page 3: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

General balance..

ERCIM May 2001

•is a very useful concept for small to

medium sized data sets•it caters for several sources of variation

•it leads to very efficient algorithms

•& to clear output, test statistics etc

•but what about large to very large data

sets...?

..

Page 4: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

General balance..

ERCIM May 2001

•fits a mixed model with several error (or

block) terms

•total sum of squares partitioned into

components known as strata, one for

each block term: each stratum contains•the sum of squares for the treatment terms

estimated between the units of that stratum

•a residual representing the random variability of

those units

..

Page 5: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

General balance: example..

ERCIM May 2001

V3 N3 V3 N2 V3 N2 V3 N3

V3 N1 V3 N0 V3 N0 V3 N1

V1 N0 V1 N1 V2 N0 V2 N2

V1 N3 V1 N2 V2 N3 V2 N1

V2 N0 V2 N1 V1 N1 V1 N2

V2 N2 V2 N3 V1 N3 V1 N0

V3 N2 V3 N0 V2 N3 V2 N0

V3 N1 V3 N3 V2 N2 V2 N1

V1 N3 V1 N0 V1 N2 V1 N3

V1 N1 V1 N2 V1 N0 V1 N1

V2 N1 V2 N0 V3 N2 V3 N3

V2 N2 V2 N3 V3 N1 V3 N0

V2 N1 V2 N2 V1 N2 V1 N0

V2 N3 V2 N0 V1 N3 V1 N1

V3 N3 V3 N1 V2 N3 V2 N2

V3 N2 V3 N0 V2 N0 V2 N1

V1 N0 V1 N3 V3 N0 V3 N1

V1 N1 V1 N2 V3 N2 V3 N3

V3 N3 V3 N2 V3 N2 V3 N3

V3 N1 V3 N0 V3 N0 V3 N1

V1 N0 V1 N1 V2 N0 V2 N2

V1 N3 V1 N2 V2 N3 V2 N1

V2 N0 V2 N1 V1 N1 V1 N2

V2 N2 V2 N3 V1 N3 V1 N0

V3 N2 V3 N0 V2 N3 V2 N0

V3 N1 V3 N3 V2 N2 V2 N1

V1 N3 V1 N0 V1 N2 V1 N3

V1 N1 V1 N2 V1 N0 V1 N1

V2 N1 V2 N0 V3 N2 V3 N3

V2 N2 V2 N3 V3 N1 V3 N0

V2 N1 V2 N2 V1 N2 V1 N0

V2 N3 V2 N0 V1 N3 V1 N1

V3 N3 V3 N1 V2 N3 V2 N2

V3 N2 V3 N0 V2 N0 V2 N1

V1 N0 V1 N3 V3 N0 V3 N1

V1 N1 V1 N2 V3 N2 V3 N3

Source of variation d.f. s.s. m.s. v.r. F pr.

blocks stratum 5 15875.3 3175.1 5.28

blocks.wplots stratum variety 2 1786.4 893.2 1.49 0.272 Residual 10 6013.3 601.3 3.40

blocks.wplots.subplots stratum nitrogen 3 20020.5 6673.5 37.69 <.001 variety.nitrogen 6 321.7 53.6 0.30 0.932 Residual 45 7968.8 177.1

Total 71 51985.9

Page 6: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

General balance: properties..

ERCIM May 2001

•block (error) terms mutually orthogonal

•treatment terms mutually orthogonal

•contrasts of each treatment term all have

equal efficiency factors in each of the

strata where they are estimated

..

Page 7: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

General balance: theory..

ERCIM May 2001

•mixed model:

• y = ZX

•dispersion structure:

• Var(y) = V = Š

• Š known symmetric matrices with

• Š Š = Š (i.e. orthogonal)

• Š = I

..

Page 8: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

General balance: theory..

ERCIM May 2001

•random effects model:• y E(y) = Z

• where E() = 0 and Var() = 2 I

•then• Var(y) =

2 ZZ

•and if terms orthogonal• S = ZZ

ZZ = (1 / n) ZZ

(if equal rep.)

• S is the projection operator (form and project means)

• Š = S ( I term marginal to term } Š )

• = n 2 + term marginal to term } n

2

..

Page 9: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

General balance: theory..

ERCIM May 2001

•treatment structure

• E(y) = X = iXi i (X = [ X0 | X1 | X2

| ...] )

• E(y) = = T (T = X ( X X ) X )

•treatment terms are orthogonal, so

• E(y) = i Ťi i

• Ti = XiXiXiXi

= (1 / ni) XiXi (if equal rep.)

• Ťi = Ti ( I jterm j marginal to term i} Ťj )

..

Page 10: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

General balance: theory..

ERCIM May 2001

•orthogonal block structure implies•independent least squares analysis within each stratum

•residual s.s. (Š y Š T ) (Š y Š T )

•normal equations T Š T ^ = T Š y

•final condition

•Ťi Š Ťj = ij i Ťi

•normal equations now i i Ťi ^i = i Ťi Š y

•solved by ^i = (1/i) Ťi Š y

•with var-cov matrix Ťi / i

i is efficiency factor of term i and eigenvalue of ŤiŠŤi

..

Page 11: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

Analysis by “sweeps”..

ERCIM May 2001

•requires a first-order balance•all effects of each model term have an equal efficiency

factor, at each point where the term is estimated

•(see Wilkinson 1970, Biometrika; Payne & Wilkinson

1977, Applied Statistics)

•similar to general balance, but that has•block (i.e. error) terms mutually orthogonal

(note: always true if nested)

•treatment terms mutually orthogonal

•(see Payne & Tobias 1992, Scandinavian J. Stats.)

..

Page 12: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

Analysis by “sweeps”..

ERCIM May 2001

•requires a working vector v which•initially contains the data values

•finally contains the residuals

•terms fitted sequentially: sweeps•estimate and remove effects of a term i in stratum by

v(+1) = { I ( 1 / i) Ti } v()

•and are then followed by a repeat of the sweeps up to this point (a reanalysis sequence)

•can omit reanalysis sweeps of terms orthogonal to term i (so none if i 1, & much simpler if general balance)

•notice•projection operator Ti simply calculates tables of means

•so no matrix inversion (unless there are covariates)

..

Page 13: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

Pictorial representation..

ERCIM May 2001

efficiency factor = sin2

(Payne & Wilkinson 1977, Applied Statist.)

Page 14: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

Analysis by “sweeps”..

ERCIM May 2001

•with general balance: initial working

vector for stratum calculated by

•S < (I S ) y

•S is a pivot (calculate means, and insert into vector)

•and fitted values for treatment term i in

stratum calculated by

•(1/i)Ti j: j>0; j<i {Rj (I-(1/j)Tj)} S <(I S )y

•Ri = I if i = 1

•Ri = S <(I S )y if i < 1

..

Page 15: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

Other issues..

ERCIM May 2001

•analysis of covariance•analyse the response (y) variate and the covariates

•calculate the covariate regression coefficients (regression of y residuals on covariate residuals)

•adjust treatment estimates and sums of squares

•combination of information•form treatment (and covariate) estimates combining

information from all the strata where each is estimated

•weighted combination of estimates with general balance

•estimate stratum variances to calculate weights

•see Payne & Tobias (1992, Scand.J.Stats.)

..

Page 16: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

Workspace requirements..

ERCIM May 2001

•sweep algorithm•working vector: N

•vectors for effects of each term: n or ni

•analysis of covariance - symmetric matrices: ncov(ncov+1)/2, ni (ni+1)/2

•c.f. multiple-regression style algorithms (including REML) which typically require•matrix: neffects(neffects+1)/2 where neffects is

total no block & treatment effects excluding residual

•vector(s): N

•much more efficient for large models•(see Payne & Welham 1990, COMPSTAT)

..

Page 17: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

Large data sets..

ERCIM May 2001

•data may be unbalanced•take a balanced sample..?

•adapt the algorithm..?

• Wilkinson (1970, Biometrika)•“general recursive algorithm”

•requires as many sweep sequences for each term i in

stratum as ŤiŠŤi has eigenvalues

• Iterative algorithms•Hemmerle (1974, JASA)

•Worthington (1975, Biometrika)

..

Page 18: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

Large data sets..

ERCIM May 2001

• Hemmerle (1974, JASA)•also uses sweep-type operations

•does not require first-order balance

•instead performs a sequence of “balanced sweeps” for each term until estimation converges

•but•only one error term

•fits whole model at once (so a sequence of increasingly large models is required to assess individual terms)

•data must be “connected” (i.e. no aliasing)

•does not provide sed’s

..

Page 19: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

Large data sets..

ERCIM May 2001

• Worthington (1975, Biometrika)

•performs sequence of operations analogous to

projections and sweeps

•assumes additional (unspecified) algorithm to

determine the strata (and their projectors)

•assumes orthogonal (equal replicated) block structure

•and assumes equally replicated treatment

combinations

•again fits whole model at once

..

Page 20: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

Generalizing the algorithms..

ERCIM May 2001

•both algorithms are based on the result• (I M)1 = I + M + M2 + ... (when this converges)

•apply this to the general form• T Š T ^

= T Š y

• T S < (I S ) T ^ = T S < (I S) y

• ^= (I + m(I TS <(IS )T)m) TS <(IS) y

• ^

(1)= TS <(I S ) y

• ^(m+1)

= ^(m) + (ITS <(IS )T)m TS <(IS)y

• relatively straightforward algorithm•sequences of sweeps & pivots with efficiency 1

Page 21: ERCIM May 2001 Analysis of variance, general balance and large data sets Roger Payne Statistics Department, IACR-Rothamsted, Harpenden, Herts AL5 2JQ

Generalizing the algorithms..

ERCIM May 2001

•iterative scheme• ^

(1)

= TS <(IS) y

• ^(m+1)

= ^(m) + (ITS <(IS )T)mTS <(IS )y

•implementation• ^

(m+1)

= ^(m) + (m)

• (m) = (I T S <(I S) T)m TS <(I S ) y

= (I T S <(I S) T) (m1)

•calculation of (m) ..• project (m1) into the treatment space T (m1)

• project into stratum S<(IS)T (m1)

• project into treatment space TS<(IS)T (m1)

• subtract result from (m1) (ITS<(IS)T) (m1)