Statistical Analysis of Compositional Data - UdGima.udg.edu/Activitats/CoDaWork05/Course_CW_05_Slides.pdf · Linear criterion Deﬂnition The ... 3 0.0611 1.36 1.09 4.10 42.98

Statistical Analysis of Compositional Data

Statistical Analysis of

Compositional Data

Carles Barcelo Vidal

J. Antoni Martın Fernandez

Santiago Thio Fdez-Henestrosa

Dept. d’Informatica i Matematica Aplicada

Universitat de Girona

Campus de Montilivi

E-17071 Girona

Catalunya – Spain.

Statistical Analysis of Compositional Data 2

What is compositional data?

Traditionally,

Composition

| |positive vector x = (x1, . . . , xD)′

whose components are subject to

a constant sum restriction:

x1 + . . . + xD = constant.

Compositional data ≡ Closed data


What is compositional data?

A positive vector w = (w1, . . . , wD)′

is compositional when our interest

lies on the relative magnitudes wj/wk of

its parts and not on the absolute values

Scale-invariance property

If a positive vector w = (w1, . . . , wD)′ is

compositional, the vectors w and kw,

with k > 0, give us the same information


USA - President election - 2000

States Bush Gore Others Total

Alabama 943799 696741 26270 1666810

Alaska 136068 64252 30347 230667

......

......

...

Wisconsin 1235035 1240431 114415 2589881

Wyoming 147674 60421 5331 213426

Alabama 56,6% 41,8% 1,6% 100%

Alaska 59,0% 27,9% 13,1% 100%

..

....

..

.... 100%

Wisconsin 47,7% 47,9% 4,4% 100%

Wyoming 69,2% 28,3% 2,5% 100%


Activity patterns of a statistician

Daily time (hours) devoted by an academic statistician to

different activities: te = teaching; co = consultation; ad

= administration; re = research; ot = other wakeful

activities; sl = sleep.

D te co ad re ot sl Total

1 3,5 2,0 4,5 2,5 6,5 5,0 24

2 4,0 2,0 2,5 3,0 6,5 6,0 24

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

19 2,5 2,5 3,0 2,0 5,0 8,5 24

20 2,5 2,0 3,0 3,0 4,0 9,0 24

1 14,4% 8,3% 18,8% 10,4% 27,1% 20,8% 100%

2 16,7% 8,3% 10,4% 2,5% 27,1% 25,0% 100%

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. 100%

19 10,4% 10,4% 12,5% 8,3% 20,8% 35,4% 100%

20 10,5% 8,3% 12,5% 12,5% 16,7% 37,5% 100%


Artic lake

Sand, silt, clay composition (% by weight) of 39sediment samples from an Artic lake

Sample Sand Silt Clay Total

S01 77.5 19.5 3.0 100%

S02 71.9 24.9 3.2 100%

.

.....

.

..... 100%

S39 2.0 47.8 50.2 100%


Volcano H

Percentage of Cl, K2O, P2O5, TiO2 and SiO2 in46 samples of volcanic rocks from a volcano H

Sample Cl K2O P2O5 TiO2 SiO2 Total

1 0.0638 1.83 1.01 3.70 44.99 51.59%

2 0.1116 1.36 0.81 3.83 43.45 49.56%

3 0.0611 1.36 1.09 4.10 42.98 49.59%

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

44 0.0477 1.67 0.96 3.64 45.14 51.46%

45 0.0015 1.70 0.95 3.64 45.15 51.44%

46 0.0986 3.22 0.50 1.87 54.41 60.10%


Halimba boreholes

Percentages of Al2O3, SiO2, Fe2O3, TiO2, H2OCaO and MgO in some samples from differentdrills in Halimba region (Hungary)

Al2O3 SiO2 Fe2O3 TiO2 H2O CaO MgO Total

52,5 6,7 23,6 2,6 12,0 0,2 0,1 97,7%

47,7 4,6 32,1 2,3 12,0 2,0 0,0 100,7%

50,6 8,9 25,4 2,5 11,9 1,1 0,0 100,4%

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


The space of compositions

Any D × 1 real vector w = (w1, . . . , wD)′ withpositive components w1, . . . , wD will be called aD-observational vector.

Therefore, the set of these vectors will be IRD+ , the

positive orthant of IRD.

Definition Two D-observational vectors w andw∗ are compositionally equivalent, w ∼ w∗, whenthere exists a positive proportionality constant k

such that w = kw∗.

This relation classifies the vectors of IRD+ in

classes of equivalence, called D-compositions.

The composition generated by an observationalvector w will be symbolized by w, i.e.,

w = {kw : k ∈ IR+}.


Scale invariance

Definition A function f defined on IRD+ is said

to be scale invariant if

f(kw) = f(w) for every w ∈ IRD+ and k ∈ IR+

or equivalently

f(w) = f(w∗) when w ∼ w∗.

Property Any scale invariant function f(w)defined on IRD

+ can be expressed in terms of ratiosof the components w1, . . . , wD of w, such as

w1/wD, . . . wD−1/wD or w1/g(w), . . . , wD/g(w),

where g(w) = (w1w2 . . . wD)1/D is the geometricmean of the components of w.

Property Any function defined on thecompositional space CD−1 arises from a scaleinvariant function defined on the positive realspace IRD

+ .


The space of compositions

A D-part composition can be geometricallyinterpreted as a ray from the origin in the positiveorthant of IRD:

W3

W2

W1

Lccl W

W

W

1

1

1

The set CD−1 of all D-compositions will be calledthe (D − 1)-dimensional compositional space.

The compositional closure mapping from IRD+ to

CD−1 —denoted by ccl — is defined by

cclw = w (w ∈ IRD+).


Representation of a composition

Linear criterion

Definition The linear criterion selects fromeach D-composition w the D-observational vectorw∗ —with components w∗1 , . . . , w∗D— whose sumis equal to 1. If this vector is symbolized bycclL w —or by Cw— then

cclL w = Cw = w/D∑

j=1

wj (w ∈ IRD+).

The set of all the vectors x = cclL w (w ∈ IRD+) is

the well-known (D − 1)-dimensional simplex SD.


Linear criterion

W3

W2

W1

Lccl W

W

W

1

1

1

1

2 3

x2

x3

x1


Representation of a composition

Other criteria

Spherical criterionW3

W2

W1

W

E

1

1

1ccl W

W

Hyperbolic criterion

1

W2

1 W1

W

Hccl W

W


Subcompositions

Sometimes, given a composition w in CD−1, wemay wish to focus attention on the relativemagnitudes of a subset of components.

Definition If S is any subset of the indices1, . . . , D of a given a D-composition w ∈ CD−1,and wS is the subvector formed from thecorresponding components of w, thenwS = cclwS is termed a subcomposition.

If the subset S is formed by C indices, with2 ≤ C < D, the subcomposition wS belongs tothe compositional space CC−1.

Definition The formation of a C-subcompo-sition wS from a D-composition w may beconsidered as the mapping subS from CD−1 toCC−1:

subS w = wS (w ∈ CD−1)


Subcompositions

W3

W2

W1

W

W

1

1

1

W12

W 12

Lccl W

1

2 3

L 12ccl W

Lccl W


Compositional Problems

1. Percentage of Cl, K2O, P2O5, TiO2 and SiO2

in 46 samples of volcanic rocks from a volcanoH:

Num. Cl K2O P2O5 TiO2 SiO2

1 0.0638 1.83 1.01 3.70 44.99

2 0.1116 1.36 0.81 3.83 43.45

3 0.0611 1.36 1.09 4.10 42.98

......

......

......

44 0.0477 1.67 0.96 3.64 45.14

45 0.0015 1.70 0.95 3.64 45.15

46 0.0986 3.22 0.50 1.87 54.41



1.a It is possible to describe the pattern ofvariability of these volcanic rocks and todefine a covariance or correlation structure?

1.b Is it possible to define a measure of totalvariability of this set of volcanic rocks?

1.c For a new volcanic rock specimen with knowncomposition (Cl,K2O,P2O5,TiO2,SiO2)’ andclaimed to be from the same volcano, can wesay whether it is fairly typical from thisvolcano? If not, can we place some measureon its atypicality?

1.d To what extent, if any, do the subcomposition(Cl,K2O,P2O5) explain the pattern ofvariability of the full composition?



1.e From this ternary diagram it seems that thepattern of (K2O,P2O5,TiO2) can be welladjusted by a curve. How can we confirmthis?



2. Percentage of Cl, K2O, P2O5, TiO2 and SiO2:65 samples of volcanic rocks from a volcanoA, and 19 samples from another volcano D.

Num. Cl K2O P2O5 TiO2 SiO2

1A 0.1776 0.64 0.34 1.57 49.26

2A 0.2050 1.55 0.43 1.61 48.22

.

.....

.

.....

.

.....

65A 0.0391 1.70 0.55 1.63 50.91

1D 0.0181 0.64 0.31 2.55 49.33

2D 0.0053 1.10 0.59 2.81 46.89

.

.....

.

.....

.

.....

19D 0.0200 0.99 0.37 3.16 45.85



2.a Can we detect any differences between thecompositional pattern of volcano A andvolcano D?

If so, how can we choose a 3-partsubcomposition which somehow captures theessence of the two patterns individually andyet emphasizes the differences between thepatterns?

2.b Is it possible to establish a classification rulefor discriminating between volcanos A and D?



3. Sand, silt,clay composition (% by weight) of39 sediment samples at different water depthsin an Artic lake:

Num. Sand Silt Clay Depth (m)

S01 77.5 19.5 3.0 10.4

S02 71.9 24.9 3.2 11.7

.

.....

.

.....

.

..

S39 2.0 47.8 50.2 103.7

3.a Is sediment composition dependent onwater depth?

3.b If so, how can we quantify the extent ofthe dependence?


How to analyze ”closed” raw data?

Spurious correlations

Pearson (1897) ”If u = f(x, y) andv = g(z, y) be two functions of three variablesx, y, z, and these variables be selected at randomso that there exists no correlation between x andy, y and z, or z and x, there will still be found toexist correlation between u and v. . . . That islikely to occur when u and v are indices with thesame denominator”.

Consequence The standard covariance matrix[sij ] of a closed data set from SD is alwayssingular because

D∑

j=1

sij = 0, for i = 1, . . . , D.


How to analyze ”closed” raw data?

Subcompositional incoherence

Example

Scientist A Scientist B

Full compositions from S4 Subcompositions from S3

(x1, x2, x3, x4) (s1, s2, s3)

(0.1, 0.2, 0.1, 0.6) (0.250, 0.500, 0.250)

(0.2, 0.1, 0.1, 0.6) (0.500, 0.250, 0.250)

(0.3, 0.3, 0.2, 0.2) (0.375, 0.375, 0.250)

corr{x(1),x(2)} = 0.5 corr{s(1), s(2)} = −1

Any statement that scientists A and B makeabout the common parts 1,2 and 3 must agree.


Statistics in IRD

Translation In IRD the inner operation istranslation. If t ∈ IRD, the translation t movesthe random vector X in IRD to a random vectorX + t in such a way that

E{X + t} = E{X}+ t and Σ{X + t} = Σ{X}.

Scalar product For any random vector X onIRD and for any λ ∈ R,

E{λX} = λE{X} and Σ{λX} = λ2Σ{X}.


Perturbations on CD−1

Scale invariance is the property which characteri-zes compositional data. Therefore, any ”opera-tion” involving compositions must be compatiblewith this property.

Definition We define an inner operation ⊕ inCD−1 as

w ⊕w∗ = ccl (w1w∗1 , . . . , wDw∗D)′.

(CD−1,⊕) is a commutative group:

• Composition 1D = ccl (1, . . . , 1)′ is theneutral element.

• The inverse composition w−1 ofw = ccl (w1, . . . , wD)′ is the compositionw−1 = ccl (1/w1, . . . , 1/wD)′.


The group of perturbations in CD−1

Definition Given a composition p ∈ CD−1, theperturbation associated to p is the transformationfrom CD−1 to CD−1 defined by

c → p⊕ c (c ∈ CD−1).

Then, we say that p⊕ c is the composition whichresults when the perturbation p is applied to thecomposition c.

Moreover, given two compositions of CD−1

w = ccl (w1, . . . , wD)′, w∗ = ccl (w∗1 , . . . , w∗D)′,

there exists a unique perturbation p which trans-forms w on w∗:

p = w∗ ⊕w−1 = ccl(

w∗1w1

, . . . ,w∗DwD

)′.



X1

X2 X3

X1

X2 X3

e

x

pox

p

X1

X2 X3

x

x*

xox* −1



Perturbation in compositional space plays thesame role as translation plays in real space. Theset of all perturbations in CD−1 is a commutativegroup isomorphic to (CD−1,⊕). For this reason,we will also call perturbation the inner operation⊕ defined on CD−1.

The assumption that the group of perturbationsis the operating group on the compositional spaceis the keystone of the methodology introduced byAitchison (1986). In fact, it implies to accept thatthe “difference” between two compositionsw = ccl (w1, . . . , wD)′ and w∗ = ccl (w∗1 , . . . , w∗D)′

will be based on the ratios w∗j /wj between partsinstead of on the arithmetic differences w∗j − wj .



Interpretation

• Some natural processes in nature can beinterpreted as a succession of changes fromone initial composition w0 to a finalcomposition wn through the application ofsuccessive perturbations:

w0 −→ p1 ⊕w0 = w1

−→ p2 ⊕w1 = w2

. . .

−→ pn ⊕wn−1 = wn.

In this manner,

wn = (pn ⊕ pn−1 ⊕ . . .p1)⊕w0.


Genesis of normal distribution

Particles fall from a funnel onto tips of triangles, where

they are deviated to the left or to the right with equal

probability (0.5) and finally fall into receptacles. If the tip

of a triangle is at distance x from the left edge of the

board, triangle tips to the right and to the left below it are

placed at x + k and x− k (k constant).


Genesis of lognormal distribution

Particles fall from a funnel onto tips of triangles, where

they are deviated to the left or to the right with equal

probability (0.5) and finally fall into receptacles. If the tip

of a triangle is at distance x from the left edge of the

board, triangle tips to the right and to the left below it are

placed at x/k and x.k (k constant).



Interpretation

• If w = ccl (wSiO2 , . . . , wP2O5)′ expresses the

percentage composition of major oxides of arock, its molecular composition will be

w∗ = ccl (wSiO2/mSiO2 , . . . , wP2O5/mP2O5)′,

where mj symbolizes the molecular weight ofoxide j.

Therefore, composition w∗ can be obtainedapplying the perturbation

m−1 = (ccl (mSiO2 , . . . , mP2O5)′)−1

to composition w:

w∗ = m−1 ⊕w.


The vector space (CD−1,⊕,⊗)

Definition The external operation ⊗ in CD−1 isdefined as

λ⊗w = ccl (wλ1 , . . . , wλ

D)′,

for each λ ∈ IR and each w ∈ CD−1.

(CD−1,⊕,⊗) is a vector space of dimension D− 1.

X1

X2 X3

X1

X2 X3

X1

X2 X3

X1

X2 X3

X1

X2 X3

e

x

2.x

−2.x


The log and the exp transformations

between IRD+ and IRD

The logarithmic transformation on IRD+ trans-

forms the rays from the origin —which representthe compositions of the space CD−1—, to straightlines of IRD parallel to vector 1D = (1,

D. . ., 1)′.

Inversely, the exponential transformation on IRD

transforms these straight lines of IRD parallel tovector 1D, to rays from the origin of IRD

+ .


W2

W1

W

W

ccl W

1

1

U

z+U

Z2

Z1

z

ucl z V

1

1


Centered logratio transformation

Definition The centered logratio transforma-

tion —denoted by clr — is the one-to-one functionfrom the compositional space CD−1 to thesubspaceV = {z = (z1, . . . , zD)′ ∈ IRD : z1 + . . . + zD = 0}of IRD, defined by

clrw = logw

g(w)(w ∈ CD−1).

The inverse transformation, from V to CD−1, isgiven by

clr−1 z = ccl (exp z) (z ∈ V ).

The logarithmic and the exponential transforma-tions establish a one-to-one correspondencebetween the simplex SD and the hyperplan V inIRD.



W2

W1

W

W

ccl W

1

1

U

z+U

Z2

Z1

z

ucl z V

1

1



Property The centered logratio transforma-

tion is an isomorphism between the vector space(CD−1,⊕,⊗) and the vector subspace

V = {z = (z1, . . . , zD)′ ∈ IRD : z1 + . . . + zD = 0}

of (IRD, +, .). Therefore,

clr (w ⊕w∗) = clrw + clrw∗ ;

clr (λ⊗w) = λ clrw,

where w,w∗ ∈ CD−1, and λ ∈ <.

Equally,

clr−1 (z + z∗) = clr−1 z⊕ clr−1 z∗ ;

clr−1 (λ z) = λ⊗ clr−1 z,

where z, z∗ ∈ V , and λ ∈ IR.


Isometric logratio transformation

Let V = {v1, . . . ,vD−1} an orthonormal basis ofthe subspace

V = {z = (z1, . . . , zD)′ ∈ IRD : z1 + . . . + zD = 0}.Then, since clrw ∈ V , it will be always possibleto write

clrw = u1v1 + . . . + uD−1vD−1,

for any w ∈ CD−1.

Definition The isometric logratio

transformation —denoted by ilrV — is theone-to-one function from the compositional spaceCD−1 to IRD−1 defined by

ilrV w = (u1, . . . , uD−1)′ (w ∈ CD−1).

Like clr , the transformation ilrV is anisomorphism between the vector spaces(CD−1,⊕,⊗) and (IRD−1,+, .).


Skye Lavas

Sample Na2O + K2O Fe2O3 MgO

S1 52 42 6

S2 52 44 4

S3 47 48 5

clr (Na2O + K2O) clr (Fe2O3) clr (MgO)

S1 0,7910 0,5775 -1,3685

S2 0,9107 0,7436 -1,6543

S3 0,7399 0,7609 -1,5008

ilr 1 [u1] ilr 2 [u2]

S1 0,1510 1,6760

S2 0,1181 2,0261

S3 -0,0149 1,8381

Hint The orthonormal basis V of the subspace V ⊂ IR3

linked to the ilr coordinates is

v1 = (1√2

,− 1√2

, 0)′, v2 = (1√6

,1√6

,− 2√6)′


Skye Lavas

Ternary diagram

clr-coordinates ilr-coordinates


CD−1 as an Euclidean space

The clr transformation between CD−1 and thesubspace V of IRD allows to translate to CD−1 thereal Euclidean structure defined on V :

<w,w∗>C = (clrw)′ clrw∗

= (log w)′HD log w∗,

‖w‖C = ‖clrw‖= [(log w)′HD log w]1/2,

d C(w,w∗) = d Euc(clrw, clrw∗)

= [(log w∗ − log w)′HD

(log w∗ − log w)]1/2,

where HD is the (D ×D)-centering matrix. Thismatrix is equal ID −D−1JD, where ID is theidentity matrix and JD = 1D1′D.

Therefore, by construction, transformations clrand clr−1 —and also ilr and ilr−1— preservethe distances defined in CD−1 and IRD−1.


Compositional geometry in CD−1

• We can not analyze the simplex SD as weanalyze the Euclidean real space.

Let

w1 = ccl (1.000, 49.500, 39.500)′

w2 = ccl (0.010, 49.995, 39.995)′

w3 = ccl (25.0, 50.0, 25.0)′

w4 = ccl (35.0, 30.0, 35.0)′

be four compositions from S3.

Then,

d Euc(w1,w2) ≈ 1.21 < 24.49 ≈ d Euc(w3,w4),

whereas

d C(w1,w2) ≈ 3.77 > 0.69 ≈ d C(w3,w4).


Compositional geometry in CD−1

• Any linear variety on CD−1 —straight lines,planes, etc— can always be implicitlyexpressed by a system of linear equations inlog w1, . . . , log wD in the form

a11 log w1+ . . . +a1D log wD = b1

. . . . . . . . .

am1 log w1+ . . . +amD log wD = bm

,

with ai1 + . . . + aiD = 0, for each i = 1, . . . , m.

• In particular, the parametric equation—varying t ∈ IR— of a straight line on CD−1

is given by

w(t) = ccl (exp(α1+λ1t), . . . , exp(αD+λDt))′,

where∑D

j=1 alphaj = 0 and∑D

j=1 λj = 0.

• Similarly to real space, the concepts ofparallelism and orthogonality can beintroduced in CD−1.


Parallelism in C2

1

2 3

k>0 k=0 k<0

log w2 − log w3 = k

1

2 3

k=−4

k=−2

k=0

k=2

k=4

log w1 − 2 log w2 + log w3 = k


Orthogonality in C2

1

2

3

w2 − w3 = 0

−2 log w1 + log w2 + log w3 = 0

1

2 3

log w1 − 3 log w2 + 2 log w3 = 0

5 log w1 − log w2 − 4 log w3 = 0


Circles in C2

Simplex S3

−0.5 0 0.5 1 1.5 2

−1.5

−1

−0.5

0

0.5

clr -space


The alr transformation

Definition The additive logratio transforma-

tion of index j (j = 1, . . . , D) — denoted byalrj — is the one-to-one transformation fromCD−1 to IRD−1 defined by

w −→ y = alrj w = logw−j

wj.

where

w−j = (w1, w2, . . . , wj−1, wj+1, . . . , wD)′.

The inverse transformation of alrD , from IRD−1

to CD−1, is given by

alr−1D y = ccl (exp y1, . . . , exp yD−1, 1)′ (y ∈ IRD−1).


The alr transformation

Property The alrj transformations(j = 1, . . . , D) are isomorphisms between thevector spaces (CD−1,⊕,⊗) and (IRD−1,+, .), i.e.,

alrj (w ⊕w∗) = alrj w + alrj w∗ ;

alrj (λ⊗w) = λ alrj w,

alr−1j (y + y∗) = alr−1

j y ⊕ alr−1j y∗ ;

alr−1j (λ y) = λ⊗ alr−1

j y,

where w,w∗ ∈ CD−1, y,y∗ ∈ IRD−1 and λ ∈ IR.

Property The alrj transformations(j = 1, . . . , D) do not preserve the distancesdefined in the metric spaces CD−1 and IRD−1, i.e.,

d C(w,w∗) 6= d Euc(alrj w, alrj w∗);

d Euc(y,y∗) 6= d C(alr−1j y, alr−1

j y∗).


Determination of a composition

A composition w ∈ CD−1 can be determined inseveral forms:

(i) Giving any D-observational vector belongingto w. Usually, we will choose the vectorx = Cw = cclL w belonging to SD.

(ii) Giving the components (z1, . . . , zD)′ = z ofthe centered logratio transformed vectorclrw. Since z belongs to the subspace V ofIRD, its components are related by theequality z1 + . . . + zD = 0.

(iii) Giving the components (y1, . . . , yD−1)′ = y ofthe additive logratio transformed vectoralrD w. If it is needed, we can choose thecomponents of any other logratio alrjw(j 6= D).

(iv) Giving the components (u1, . . . , uD−1)′ = u ofthe isometric logratio transformed vectorilrV w, where V is a known orthonormal basisof the subspace V of IRD.


Determination of a composition

• Skye lavas: A = Na2O + K2O, F = Fe2O3, M = MgO

Sample A F M

S1 52 42 6

S2 52 44 4

clr (A) clr (F) clr (M)

S1 0.7910 0.5775 -1.3685

S2 0.9107 0.7436 -1.6543

ilr 1 [u1] ilr 2 [u2]

S1 0,1510 1,6760

S2 0,1181 2,0261

alrMA alrMF

S1 2.159 1.946

S2 2.565 2.398

alrAF alrAM

S1 -0.214 -2.159

S2 -0.167 -2.565

Hint The orthonormal basis V of the subspace V ⊂ IR3

linked to the ilr coordinates is

v1 = (1√2

,− 1√2

, 0)′, v2 = (1√6

,1√6

,− 2√6)′.



Compositional data set

• Raw data matrix

W = [wij : i = 1, . . . , n; j = 1, . . . , D],

or

X = [xij : i = 1, . . . , n; j = 1, . . . , D],

where xi = (xi1, . . . , xiD)′ ∈ SD.

Example AFM composition of 23 aphyric Skyelavas [A = Na2O + K2O, F = Fe2O3, M = MgO].

Obs. A% F% M%

S1 52 42 6

..

....

..

....

S23 24 56 20

X =

52 42 6...

......

24 56 20

.



• Centred logratio (clr ) data matrix

Z = [zij : i = 1, . . . , n; j = 1, . . . , D],

where zij = log (wij/g(wi)), withg(wi) = (

∏Dk=1 wik)1/D.


Obs. A% F% M%

S1 52 42 6

..

....

..

....

S23 24 56 20

clr A clr F clr M

Z =

0.791 0.577 −1.368...

......

−0.222 0.626 −0.404

.



• Additive logratio (alr) data matrix

Y = [yij : i = 1, . . . , n; j = 1, . . . , d],

where yij = log(wij/wiD).


Obs. A% F% M%

S1 52 42 6

.

.....

.

.....

S23 24 56 20

log AM log F

M

Y =

2.159 1.946...

...

0.182 1.030

.


Center of a compositional data set

• The center of a set W of n compositions w1,

. . . ,wn of CD−1, is the composition g definedby

cenW = g = (1n⊗w1)⊕ . . .⊕ (

1n⊗wn).

This center is equal to

g = ccl

(n∏

i=1

wi1

)1/n

, . . . ,

(n∏

i=1

wiD

)1/n′

.

• It verifies that

clrg = z =n∑

i=1

1nzi =

n∑

i=1

1n

clrwi.

and

alrD g = y =n∑

i=1

1nyi =

n∑

i=1

1n

alrD wi.



Properties

cenW = ccl

(n∏

i=1

wi1

)1/n

, . . . ,

(n∏

i=1

wiD

)1/n′

.

• cenW = argmin︸︷︷︸ξ∈CD−1

{dC(w1,ξ)+...+dC(wn,ξ)

n

}.

• cen {p⊕W} = p⊕ cenW, where p ∈ CD−1.

• cen {t⊗W} = t⊗ cenW, where t ∈ IR.

• cen {W ⊕W∗} = cenW ⊕ cenW∗.



Example AFM composition of 23 aphyric Skyelavas

? Compositional (”geometric”) center:

g = ccl (25.85, 56.65, 17.50)′.

? ”Arithmetic” center:

a = ccl (26.83, 53.74, 19.43)′.


Centering

To ”centre” a compositional data set w1, . . .wn

with centre g, it suffices to consider the new dataset w∗

1 = g−1 ⊕w1, . . . ,w∗n = g−1 ⊕wn.

Obviously, the centre of the new ”centered” dataset w∗

1, . . . ,w∗n is ccl (1/D, . . . , 1/D)′.


Compositional covariance structure

• Variation matrix

T = [τjk] =[var

{log

w(j)

w(k)

}].

? τjk = 0 means a perfect relationship between w(j)

and w(k) in the sense that the ratio w(j)/w(k) is

constant.

? The larger the value of τjk the more departure

from proportionality between w(j) and w(k).

? A measure of degree of proportionality between

two parts j and k is given by

exp(−√τjk).

In this way, exp(−√τjk) = 0 means zero

propotionality, and exp(−√τjk) = 1 means

perfect propotionality.

? The variation matrix of any subcomposition is

simply obtained by picking out on T all the

logratio variances τjk associated with the parts j

and k of the subcomposition.



• Logratio covariance matrix

Σ = [σjk] =[cov

{y(j),y(k)

}]

=[cov

{log w(j)

w(D), log w(k)

w(D)

}].,

where

yj =(

logw1j

w1D, . . . , log

wnj

wnD

)′,

for j = 1, . . . , D − 1.



• Centered covariance matrix

Γ = [γjk] =[cov

{z(j), z(k)

}].

where

z(j) = (log (w1j/g(w1)) , . . . , log (wnj/g(wn)))′ ,

for j = 1, . . . , D.

Hint. Correlation corr{z(j), z(k)

}is not a

measure of a relationship between parts j andk because is subcompositionally incoherent.

• Total (relative) variability

totvarC{W} =∑n

i=11nd2

C(wi,g) = trace {Γ}

= 12D1

′DT1D = 1

D

∑i<j τij .



• The centered covariance matrix Γ =[γjk

]is singular

becauseD∑

k=1

γjk = 0 (j = 1, . . . , D).

• The relationships between the three covariance

matrices T,Σ and Γ are linear.

• The dimensionality of the covariance structure of a

compositional raw data matrix from CD−1 is equal to12D(D − 1).

• The covariance matrix T —and also Σ and Γ— is

coherent with the algebraic structure of (CD−1,⊕,⊗),

i.e.,

T{p⊕W} = T{W} and T{λ⊗W} = λ2T{W},

where W is a compositional raw data matrix from

CD−1, p ∈ CD−1 and λ ∈ IR.

Therefore,

totvar C{p⊕W} = totvarC{W};

totvarC{λ⊗W} = λ2totvarC{W}.



Example AFM composition of 23 Skye lavas

? Variation matrix

T =

0 0.251 1.144

0.251 0 0.350

1.144 0.350 0

.

? Logratio covariance matrix

ΣA =

[0.251 0.523

0.523 1.144

]ΣF =

[0.251 −0.271

−0.271 0.350

]

ΣM =

[1.144 0.622

0.622 0.350

].

? Centered covariance matrix

Γ =

0.271 0.013 −0.284

0.013 0.007 −0.020

−0.284 −0.020 0.304

.

? Total variability: totvarC = 0.582.


Biplots

In general, the biplot is a simultaneous represen-tation of the rows (observations) and columns(variables) of a n× p matrix X by means of arank-2 approximation.

Usually, biplot analysis starts with performingsome transformations on X, depending on thenature of the data, to obtain a transformedmatrix Z which is the one that is actuallydisplayed.


Biplots

• The singular value decomposition (SVD) of Zprovides a decomposition of this matrix:

Z = [u1 : . . . : ur] diag{λ1, . . . , λr} [v1 : . . . : vr]′,

where

? r is the rank of Z;

? u1, . . . ,ur are the standardizedeigenvectors of Z′;

? v1, . . . ,vr are the standardizedeigenvectors of Z;

? and λ1, . . . , λr the corresponding positiveeigenvalues in decreasing order.

• From this SVD of Z, and using only the twofirst eigenvectors, a rank-2 approximation Zis obtained:

Z = [u1 : u2] diag{λ1, λ2} [v1 : v2]′.


Biplots

• Then Z decomposes in

Z = [λα1 u1 : λα

2 u2]︸︷︷︸F

[λ1−α1 v1 : λ1−α

2 v2]′︸︷︷︸G′

,

where α is an arbitrary constant.

• The biplot represents simultaneously in IR2

the rows of F, which provides the coordinatesof n points (in correspondence with the n

rows/observations of Z) , and the rows of G,which provides the coordinates of p points (incorrespondence with the columns/variables ofZ).

Conventionally, the biplot depicts thevariables by rays and the observations bypoints.

Depending on the constant α, the biplot favoursthe display of rows (observations) or columns(variables). For α = 0, the biplot is calledcovariance biplot. In this case, the display ofvariables is favoured.


Biplots

• Singular value decomposition (SVD) of Z:

Z = [u1 : . . . : ur] diag{λ1, . . . , λr} [v1 : . . . : vr]′,

• Rank-2 approximation Z:

Z = [λα1 u1 : λα

2 u2]︸︷︷︸F

[λ1−α1 v1 : λ1−α

2 v2]′︸︷︷︸G′

,

• The ratioλ1 + λ2

λ1 + . . . + λr

is a measure of the proportion of the”variability” of Z captured by the biplot.


Relative variation diagrams

Definition The relative variation diagram of acompositional data set w1, . . . ,wn of CD−1 is thecovariance biplot of the matrix Zc which weobtain after centering the D columns of thecentered logratio matrix Z.

• Elements

? Origin, labeled O.

? Vertices, for each of the D parts(variables/columns) of compositions,labeled 1, . . . , j, . . .D.

? Case marker, for each of the n

observations (rows), labeled 1, . . . , i, . . . n.

? Ray. Is the join Oj of origin O to a vertexj.

? Link. Is the join jk of two vertices j and k.





• The vertices and case markers are bothcentered at the origin O.

• Rays and inter-ray angles represent thecentered logratio matrix Γ:

|Oj|2 = γjj = estimate of var{z(j)

},

|Oj| · |Ok| = γjk = estimate of cov{z(j), z(k)

},

so that

cos jOk = estimate of corr{z(j), z(k)

}.

Hint. Remember that correlationcorr

{z(j), z(k)

}is not a measure of a

relationship between parts j and k because issubcompositionally incoherent.



• The squared lengths of the links represent theset of estimated relative variances:

|jk|2 = τjk = estimate of var{

logw(j)

w(k)

}.

Therefore, if two vertices j and k coincide orare close together then components w(j) andw(k) are in constant proportion or nearly so.

jk

0



• Links jl and kl, with a common vertex l,represent the estimated logratio covariancematrix Σl:

|jl| · |kl| ≈∣∣∣∣cov

{log

w(j)

w(l), log

w(k)

w(l)

}∣∣∣∣ ,

so that

cos jlk ≈ corr{

logw(j)

w(l), log

w(k)

w(l)

}.

jl

0

k



• If the links jk and lm intersect at R then∣∣∣cos jRm

∣∣∣ ≈∣∣∣∣corr

{log

w(j)

w(k), log

w(l)

w(m)

}∣∣∣∣ .

Therefore, if two links jk and lm intersect atright angles then the logratios log(w(j)/w(k))and log(w(l)/w(m)) will be uncorrelated and,within the context of logistic normality, inde-pendent, i.e., subcompositions (j, k) and(l,m) are independent.

j

k

0m

l

R



• The relative variation diagram for any sub-composition S is simply the subdiagramformed by selecting the vertices correspondingto the parts of the subcomposition and takingthe centroide OS of these vertices as thecenter of the subcompositional biplot.

Therefore, if a subset — say 1, . . . , C— ofvertices is approximately collinear then theassociated subcomposition has a ”composi-tional” one-dimensional structure.

j

k0

l0S

jk

0

l


Volcano H

• Parts: 1=Cl; 2=K2O; 3=P2O5; 4=TiO2; 5=SiO2.

• Variation matrix T

0 2, 784 4, 134 3, 970 2, 966

2, 784 0 0, 647 0, 645 0, 146

4, 134 0, 647 0 0, 071 0, 304

3, 970 0, 645 0, 071 0 0, 249

2, 966 0, 146 0, 304 0, 249 0

• Centered covariance matrix Γ

2, 134 −0, 221 −0, 803 −0, 743 −0, 368

−0, 221 0, 208 −0, 022 −0, 043 0, 079

−0, 803 −0, 022 0, 394 0, 337 0, 094

−0, 743 −0, 043 0, 337 0, 350 0, 099

−0, 368 0, 079 0, 094 0, 099 0, 096


Volcano H


Volcano H


Dimension-reducing techniques

Compositional PCA

Given a set of compositions w1, . . . ,wn of CD−1

with center g, the PCA will start looking for adirection —determined by a C-unitary composi-tion c1— such that the total variability of theC-orthogonal projections of w1, . . . ,wn on thecompositional straight line through g with direc-tion c1 will be maximum. And so on.

Property The compositional principal compo-nents of a set of compositions w1, . . . ,wn of CD−1

can be determined from the standard principalcomponents of the clr -transformed observationsclrw1, . . . , clrwn.


Compositional PCA

In this manner, the positive eigenvalues λ1 ≥ . . .

≥ λD−1 of the centered logratio covariance matrixΓ give the decomposition of totvarC , and thecorresponding unitary eigenvectors z∗1, . . . , z

∗D−1

determine the corresponding directionsclr−1z∗1, . . . , clr

−1z∗D−1 of the principal axes.


Skye Lavas

PC1: 0.4436 log A− 0.8154 log F + 0.3719 log M ≈ −0.7849


Volcano H

(Cl, K2O, P2O5, TiO2, SiO2)

PC1: (0.4246,0.1656,0.1264,0.1296,0.1538)’ (87.8%)

PC2: (0.1469,0.3836,0.1230,0.1182,0.2293)’ (acum 98.1%)


Dimension-reducing techniques

Subcomposition analysis

Let w1, . . . ,wn be a compositional data set ofCD−1, and let subS w1, . . . , subS wn be the set ofthe corresponding subcompositions of CC−1

associated to a subset S of parts 1, . . . , D. Then,the ratio

totvarC{subS w1, . . . , subS wn}totvarC{w1, . . . ,wn}

gives the proportion of total variabilitiy retainedby the subcompositions.

If the purpose of subcompositional analysis is toretain as much variability as possible for a givennumber C of parts, then we have to search forsubcompositions of this size which maximize thisratio.


Volcano H


Subcomposition analysis

Example Percentage of Cl, K2O, P2O5, TiO2

and SiO2 in 46 samples of volcanic rocks from ananonymous volcano H

? Total variability: totvarC = 3.1829.

? Total variability of 3-parts subcompositions:

Subcomposition Percentage

P2O5, TiO2, SiO2 6.53%

K2O, TiO2, SiO2 10.90%

K2O, P2O5, SiO2 11.48%

K2O, P2O5, TiO2 14.27%

Cl, K2O, SiO2 61.74%

Cl, TiO2, SiO2 75.25%

Cl, K2O, TiO2 77.48%

Cl, P2O5, SiO2 77.53%

Cl, K2O, P2O5 79.21%

Cl, P2O5, TiO2 85.61%


Zeros in compositional data

• Logratio methodology is incompatible withcomposition with zeros in one or more parts.

• Two kinds of zeros:

? Essential zeros: part completely absent.

? Rounded zeros: no quantifiable proportionhas been recorded.

• Treatment of essential zeros:

? Is it suitable to amalgamate some parts?

? Pre-classification: create initial groupsaccording to the number and location ofzeros, and analyze each group separately.

• Treatment of rounded zeros:

? Consider the zero values as missing values.

? Imputation: replace zero values by a smallamount using non-parametric orparametric techniques.

? Apply log-ratio methodology to replacedobservations of resulting data set.


Rounded zeros

Multiplicative replacement

Let be w = ccl (w1, . . . , wD)′ ∈ CD−1 anycomposition with some wj = 0 (rounded zero).

The multiplicative replacement replaces w bythe composition w(r) = ccl (w(r)

1 , . . . , w(r)D )′

defined by

if wj = 0 → w(r)j = δj ;

if wj 6= 0 → w(r)j = wj

(1−

∑wl=0

δl

).

where δj are the ”small” values replacing zerosparts.


Rounded zeros

Multiplicative replacement

• It is a ”natural” replacement.

• Ratio between two non-zero parts ispreserved.

• It is compatible with subcompositions,perturbation and power transformation.

• Covariance structure of subcompositions withno zeros is preserved.


Modeling compositional data

In practice, many of the probability densityfunctions (pdf) on the compositional space CD−1

will be defined from a pdf on the real space IRD−1.Then the alrj

−1 transformations will allow toinduce on the simplex SD the corresponding pdf.

The most important pdf on CD−1 are:

• The Dirichlet class.

• The (additive) logistic normal class.

• The (additive) logistic skewnormal class.

Definition A random composition w on CD−1

is said to have an additive logistic normaldistribution (aln) of parameters µ and Σ—written w ∼ LD−1(µ,Σ)— if the randomvector y = alrD w = log(w−D/wD) has aND−1(µ,Σ) on IRD−1.


Logistic normal distributions on CD−1

Property Let w be a random vector on CD−1.If alrD w ∼ ND−1(µ,Σ), then all the otherlogratio random vectors alrjw (j = 1, . . . , d) arenormally distributed.

Property Let w be a random composition onCD−1. Let wS be the random subcomposition onCC−1 corresponding to a subset S of C parts ofw. If w ∼ LD−1(µ,Σ), thenwS ∼ LC−1(µS ,ΣS), where µS and ΣS can beeasily calculated from µ and Σ.

Property Let w be a random vector on CD−1,which w ∼ LD−1(µ,Σ). If we perturb w by aconstant composition p ∈ CD−1, then theperturbed random vectorp⊕w ∼ LD−1(µ + alrD p,Σ).


Logistic normal distributions on CD−1

Estimation of parameters

To estimate the parameters µ and Σ of a randomcomposition w ∼ LD−1(µ,Σ) from a randomsample w1, . . . ,wn of w, we estimate by standardprocedures the vector mean and the covariancematrix of a multivariate normal distribution fromthe alrD-transformed random sample

y1 = alrDw1, . . . ,yn = alrD wn.

The maximum likelihood estimations of µ and Σare given by

µj =1n

n∑

i=1

yij ,

σjk =1n

n∑

i=1

(yij − µj)(yik − µk),

for j, k = 1, . . . , D − 1.


Predictive regions

Definition Let w be a random composition aln

distributed on CD−1. If µ and Σ are the estima-tes of the unknown parameters of w from arandom sample of size n, the 1− α predictiveregion is defined as{

w∗ ∈ CD−1 : (alrD w∗ − µ)′Σ−1

(alrD w∗ − µ) ≤ r2}

,

where r2 is a real number such that

Prob

[FD−1,n−(D−1) ≤

n(n− (D − 1))

(n2 − 1)(D − 1)r2

]= 1− α.

99% Predictive region - Skye lavas


Atypicality index

Definition If a random composition w on CD−1

is LD−1(µ,Σ) distributed, the atypicality indexof a composition w∗ ∈ CD−1 in relation to therandom composition w is defined as

Prob[χ2

D−1 ≤ (alrD w∗ − µ)′Σ−1(alrD w∗ − µ)]

.

Definition Let w be a random composition aln

distributed on CD−1. If µ and Σ are the estimatesof the unknown parameters of w from a randomsample w1, . . . ,wn of size n, the atypicality index

of a composition w∗ ∈ CD−1 in relation to thecompositional data set w1, . . . ,wn is defined as

Prob

[FD−1,n−(D−1) ≤ k(alrD w∗ − µ)′Σ

−1(alrD w∗ − µ)

],

where k =n(n−(D−1))

(n2−1)(D−1).


Compositional Regression

Artic lake

Sand, silt,clay composition of 39 sedimentsamples at different water depths in an Artic lake:


S01 77.5 19.5 3.0 10.4

S02 71.9 24.9 3.2 11.7

.

.....

.

.....

.

..

S39 2.0 47.8 50.2 103.7

a. Is sediment composition dependent on waterdepth?

b. If so, how can we quantify the extent of thedependence?



• Compositions wi ∈ CD−1 regressing on a realconcomitant ti (i = 1, . . . , n):

wi = β0 ⊕ (ti ⊗ β1)⊗ εi (i = 1, . . . , n),

where

? β0: constant;

? β1: regression coefficient;

? εi (i = 1, . . . , n): errors.

• alr version of the regression model

alrwi = alr β0+tialr β1+alr εi (i = 1, . . . , n).

Can be reparametrized as

alrwi = α0 + tiα1 + εi (i = 1, . . . , n).



alrwi = α0 + tiα1 + εi (i = 1, . . . , n).

• Estimations α0 and α1 are obtained byapplication of the least squares method. Then

β0 = alr−1α0 , β1 = alr−1α1

• The error (residual) of wi (i = 1, . . . , n) willbe

ei = wi ª wi,

where wi = β0 ⊕ (ti ⊗ β1).

• Sum of squares of errors:

SSError =n∑

i=1

‖ei‖2C =n∑

i=1

(dC(wi, wi)

)2.

• Proportion of variability explained by thefitted linear regression model:

1− SSError

totvarC{w1, . . . ,wn} .


Artic lake


S01 77.5 19.5 3.0 10.4

..

....

..

....

..

.

S39 2.0 47.8 50.2 103.7

• alr fitted simple linear regression model:

log(sand/clay) = 9.697− 2.743 log(depth) + ε1;

log(silt/clay) = 4.805− 1.096 log(depth) + ε2.

• Fitted regression model in S3:

cclL (sand, silt, clay)′ = (0.992849, 0.007145, 0.00006)′⊕

log(depth)⊗ (0.04604, 0.238291, 0.71505)′.

• Proportion of variability explained by the fitted

simple linear regression model:

1− 0.7006

2.4692= 0.716 ≡ 71.6%.


Artic lake


S01 77.5 19.5 3.0 10.4

.

.....

.

.....

.

..

S39 2.0 47.8 50.2 103.7

Documents

Statistical Analysis of Compositional Data - UdGima.udg.edu/Activitats/CoDaWork05/Course_CW_05_Slides.pdf · Linear criterion Deﬂnition The ... 3 0.0611 1.36 1.09 4.10 42.98