Upload
phamkhuong
View
213
Download
0
Embed Size (px)
Citation preview
Statistical Analysis of Compositional Data
Statistical Analysis of
Compositional Data
Carles Barcelo Vidal
J. Antoni Martın Fernandez
Santiago Thio Fdez-Henestrosa
Dept. d’Informatica i Matematica Aplicada
Universitat de Girona
Campus de Montilivi
E-17071 Girona
Catalunya – Spain.
Statistical Analysis of Compositional Data 2
What is compositional data?
Traditionally,
Composition
| |positive vector x = (x1, . . . , xD)′
whose components are subject to
a constant sum restriction:
x1 + . . . + xD = constant.
Compositional data ≡ Closed data
Statistical Analysis of Compositional Data 3
What is compositional data?
A positive vector w = (w1, . . . , wD)′
is compositional when our interest
lies on the relative magnitudes wj/wk of
its parts and not on the absolute values
Scale-invariance property
If a positive vector w = (w1, . . . , wD)′ is
compositional, the vectors w and kw,
with k > 0, give us the same information
Statistical Analysis of Compositional Data 4
USA - President election - 2000
States Bush Gore Others Total
Alabama 943799 696741 26270 1666810
Alaska 136068 64252 30347 230667
......
......
...
Wisconsin 1235035 1240431 114415 2589881
Wyoming 147674 60421 5331 213426
Alabama 56,6% 41,8% 1,6% 100%
Alaska 59,0% 27,9% 13,1% 100%
..
....
..
.... 100%
Wisconsin 47,7% 47,9% 4,4% 100%
Wyoming 69,2% 28,3% 2,5% 100%
Statistical Analysis of Compositional Data 5
Activity patterns of a statistician
Daily time (hours) devoted by an academic statistician to
different activities: te = teaching; co = consultation; ad
= administration; re = research; ot = other wakeful
activities; sl = sleep.
D te co ad re ot sl Total
1 3,5 2,0 4,5 2,5 6,5 5,0 24
2 4,0 2,0 2,5 3,0 6,5 6,0 24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19 2,5 2,5 3,0 2,0 5,0 8,5 24
20 2,5 2,0 3,0 3,0 4,0 9,0 24
1 14,4% 8,3% 18,8% 10,4% 27,1% 20,8% 100%
2 16,7% 8,3% 10,4% 2,5% 27,1% 25,0% 100%
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 100%
19 10,4% 10,4% 12,5% 8,3% 20,8% 35,4% 100%
20 10,5% 8,3% 12,5% 12,5% 16,7% 37,5% 100%
Statistical Analysis of Compositional Data 6
Artic lake
Sand, silt, clay composition (% by weight) of 39sediment samples from an Artic lake
Sample Sand Silt Clay Total
S01 77.5 19.5 3.0 100%
S02 71.9 24.9 3.2 100%
.
.....
.
..... 100%
S39 2.0 47.8 50.2 100%
Statistical Analysis of Compositional Data 7
Volcano H
Percentage of Cl, K2O, P2O5, TiO2 and SiO2 in46 samples of volcanic rocks from a volcano H
Sample Cl K2O P2O5 TiO2 SiO2 Total
1 0.0638 1.83 1.01 3.70 44.99 51.59%
2 0.1116 1.36 0.81 3.83 43.45 49.56%
3 0.0611 1.36 1.09 4.10 42.98 49.59%
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
44 0.0477 1.67 0.96 3.64 45.14 51.46%
45 0.0015 1.70 0.95 3.64 45.15 51.44%
46 0.0986 3.22 0.50 1.87 54.41 60.10%
Statistical Analysis of Compositional Data 8
Halimba boreholes
Percentages of Al2O3, SiO2, Fe2O3, TiO2, H2OCaO and MgO in some samples from differentdrills in Halimba region (Hungary)
Al2O3 SiO2 Fe2O3 TiO2 H2O CaO MgO Total
52,5 6,7 23,6 2,6 12,0 0,2 0,1 97,7%
47,7 4,6 32,1 2,3 12,0 2,0 0,0 100,7%
50,6 8,9 25,4 2,5 11,9 1,1 0,0 100,4%
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Statistical Analysis of Compositional Data 9
The space of compositions
Any D × 1 real vector w = (w1, . . . , wD)′ withpositive components w1, . . . , wD will be called aD-observational vector.
Therefore, the set of these vectors will be IRD+ , the
positive orthant of IRD.
Definition Two D-observational vectors w andw∗ are compositionally equivalent, w ∼ w∗, whenthere exists a positive proportionality constant k
such that w = kw∗.
This relation classifies the vectors of IRD+ in
classes of equivalence, called D-compositions.
The composition generated by an observationalvector w will be symbolized by w, i.e.,
w = {kw : k ∈ IR+}.
Statistical Analysis of Compositional Data 10
Scale invariance
Definition A function f defined on IRD+ is said
to be scale invariant if
f(kw) = f(w) for every w ∈ IRD+ and k ∈ IR+
or equivalently
f(w) = f(w∗) when w ∼ w∗.
Property Any scale invariant function f(w)defined on IRD
+ can be expressed in terms of ratiosof the components w1, . . . , wD of w, such as
w1/wD, . . . wD−1/wD or w1/g(w), . . . , wD/g(w),
where g(w) = (w1w2 . . . wD)1/D is the geometricmean of the components of w.
Property Any function defined on thecompositional space CD−1 arises from a scaleinvariant function defined on the positive realspace IRD
+ .
Statistical Analysis of Compositional Data 11
The space of compositions
A D-part composition can be geometricallyinterpreted as a ray from the origin in the positiveorthant of IRD:
W3
W2
W1
Lccl W
W
W
1
1
1
The set CD−1 of all D-compositions will be calledthe (D − 1)-dimensional compositional space.
The compositional closure mapping from IRD+ to
CD−1 —denoted by ccl — is defined by
cclw = w (w ∈ IRD+).
Statistical Analysis of Compositional Data 12
Representation of a composition
Linear criterion
Definition The linear criterion selects fromeach D-composition w the D-observational vectorw∗ —with components w∗1 , . . . , w∗D— whose sumis equal to 1. If this vector is symbolized bycclL w —or by Cw— then
cclL w = Cw = w/D∑
j=1
wj (w ∈ IRD+).
The set of all the vectors x = cclL w (w ∈ IRD+) is
the well-known (D − 1)-dimensional simplex SD.
Statistical Analysis of Compositional Data 13
Linear criterion
W3
W2
W1
Lccl W
W
W
1
1
1
1
2 3
x2
x3
x1
Statistical Analysis of Compositional Data 14
Representation of a composition
Other criteria
Spherical criterionW3
W2
W1
W
E
1
1
1ccl W
W
Hyperbolic criterion
1
W2
1 W1
W
Hccl W
W
Statistical Analysis of Compositional Data 15
Subcompositions
Sometimes, given a composition w in CD−1, wemay wish to focus attention on the relativemagnitudes of a subset of components.
Definition If S is any subset of the indices1, . . . , D of a given a D-composition w ∈ CD−1,and wS is the subvector formed from thecorresponding components of w, thenwS = cclwS is termed a subcomposition.
If the subset S is formed by C indices, with2 ≤ C < D, the subcomposition wS belongs tothe compositional space CC−1.
Definition The formation of a C-subcompo-sition wS from a D-composition w may beconsidered as the mapping subS from CD−1 toCC−1:
subS w = wS (w ∈ CD−1)
Statistical Analysis of Compositional Data 16
Subcompositions
W3
W2
W1
W
W
1
1
1
W12
W 12
Lccl W
1
2 3
L 12ccl W
Lccl W
Statistical Analysis of Compositional Data 17
Compositional Problems
1. Percentage of Cl, K2O, P2O5, TiO2 and SiO2
in 46 samples of volcanic rocks from a volcanoH:
Num. Cl K2O P2O5 TiO2 SiO2
1 0.0638 1.83 1.01 3.70 44.99
2 0.1116 1.36 0.81 3.83 43.45
3 0.0611 1.36 1.09 4.10 42.98
......
......
......
44 0.0477 1.67 0.96 3.64 45.14
45 0.0015 1.70 0.95 3.64 45.15
46 0.0986 3.22 0.50 1.87 54.41
Statistical Analysis of Compositional Data 18
Compositional Problems
1.a It is possible to describe the pattern ofvariability of these volcanic rocks and todefine a covariance or correlation structure?
1.b Is it possible to define a measure of totalvariability of this set of volcanic rocks?
1.c For a new volcanic rock specimen with knowncomposition (Cl,K2O,P2O5,TiO2,SiO2)’ andclaimed to be from the same volcano, can wesay whether it is fairly typical from thisvolcano? If not, can we place some measureon its atypicality?
1.d To what extent, if any, do the subcomposition(Cl,K2O,P2O5) explain the pattern ofvariability of the full composition?
Statistical Analysis of Compositional Data 19
Compositional Problems
1.e From this ternary diagram it seems that thepattern of (K2O,P2O5,TiO2) can be welladjusted by a curve. How can we confirmthis?
Statistical Analysis of Compositional Data 20
Compositional Problems
2. Percentage of Cl, K2O, P2O5, TiO2 and SiO2:65 samples of volcanic rocks from a volcanoA, and 19 samples from another volcano D.
Num. Cl K2O P2O5 TiO2 SiO2
1A 0.1776 0.64 0.34 1.57 49.26
2A 0.2050 1.55 0.43 1.61 48.22
.
.....
.
.....
.
.....
65A 0.0391 1.70 0.55 1.63 50.91
1D 0.0181 0.64 0.31 2.55 49.33
2D 0.0053 1.10 0.59 2.81 46.89
.
.....
.
.....
.
.....
19D 0.0200 0.99 0.37 3.16 45.85
Statistical Analysis of Compositional Data 21
Compositional Problems
2.a Can we detect any differences between thecompositional pattern of volcano A andvolcano D?
If so, how can we choose a 3-partsubcomposition which somehow captures theessence of the two patterns individually andyet emphasizes the differences between thepatterns?
2.b Is it possible to establish a classification rulefor discriminating between volcanos A and D?
Statistical Analysis of Compositional Data 22
Compositional Problems
3. Sand, silt,clay composition (% by weight) of39 sediment samples at different water depthsin an Artic lake:
Num. Sand Silt Clay Depth (m)
S01 77.5 19.5 3.0 10.4
S02 71.9 24.9 3.2 11.7
.
.....
.
.....
.
..
S39 2.0 47.8 50.2 103.7
3.a Is sediment composition dependent onwater depth?
3.b If so, how can we quantify the extent ofthe dependence?
Statistical Analysis of Compositional Data 23
How to analyze ”closed” raw data?
Spurious correlations
Pearson (1897) ”If u = f(x, y) andv = g(z, y) be two functions of three variablesx, y, z, and these variables be selected at randomso that there exists no correlation between x andy, y and z, or z and x, there will still be found toexist correlation between u and v. . . . That islikely to occur when u and v are indices with thesame denominator”.
Consequence The standard covariance matrix[sij ] of a closed data set from SD is alwayssingular because
D∑
j=1
sij = 0, for i = 1, . . . , D.
Statistical Analysis of Compositional Data 24
How to analyze ”closed” raw data?
Subcompositional incoherence
Example
Scientist A Scientist B
Full compositions from S4 Subcompositions from S3
(x1, x2, x3, x4) (s1, s2, s3)
(0.1, 0.2, 0.1, 0.6) (0.250, 0.500, 0.250)
(0.2, 0.1, 0.1, 0.6) (0.500, 0.250, 0.250)
(0.3, 0.3, 0.2, 0.2) (0.375, 0.375, 0.250)
corr{x(1),x(2)} = 0.5 corr{s(1), s(2)} = −1
Any statement that scientists A and B makeabout the common parts 1,2 and 3 must agree.
Statistical Analysis of Compositional Data 25
Statistics in IRD
Translation In IRD the inner operation istranslation. If t ∈ IRD, the translation t movesthe random vector X in IRD to a random vectorX + t in such a way that
E{X + t} = E{X}+ t and Σ{X + t} = Σ{X}.
Scalar product For any random vector X onIRD and for any λ ∈ R,
E{λX} = λE{X} and Σ{λX} = λ2Σ{X}.
Statistical Analysis of Compositional Data 26
Perturbations on CD−1
Scale invariance is the property which characteri-zes compositional data. Therefore, any ”opera-tion” involving compositions must be compatiblewith this property.
Definition We define an inner operation ⊕ inCD−1 as
w ⊕w∗ = ccl (w1w∗1 , . . . , wDw∗D)′.
(CD−1,⊕) is a commutative group:
• Composition 1D = ccl (1, . . . , 1)′ is theneutral element.
• The inverse composition w−1 ofw = ccl (w1, . . . , wD)′ is the compositionw−1 = ccl (1/w1, . . . , 1/wD)′.
Statistical Analysis of Compositional Data 27
The group of perturbations in CD−1
Definition Given a composition p ∈ CD−1, theperturbation associated to p is the transformationfrom CD−1 to CD−1 defined by
c → p⊕ c (c ∈ CD−1).
Then, we say that p⊕ c is the composition whichresults when the perturbation p is applied to thecomposition c.
Moreover, given two compositions of CD−1
w = ccl (w1, . . . , wD)′, w∗ = ccl (w∗1 , . . . , w∗D)′,
there exists a unique perturbation p which trans-forms w on w∗:
p = w∗ ⊕w−1 = ccl(
w∗1w1
, . . . ,w∗DwD
)′.
Statistical Analysis of Compositional Data 28
The group of perturbations in CD−1
X1
X2 X3
X1
X2 X3
e
x
pox
p
X1
X2 X3
x
x*
xox* −1
Statistical Analysis of Compositional Data 29
The group of perturbations in CD−1
Perturbation in compositional space plays thesame role as translation plays in real space. Theset of all perturbations in CD−1 is a commutativegroup isomorphic to (CD−1,⊕). For this reason,we will also call perturbation the inner operation⊕ defined on CD−1.
The assumption that the group of perturbationsis the operating group on the compositional spaceis the keystone of the methodology introduced byAitchison (1986). In fact, it implies to accept thatthe “difference” between two compositionsw = ccl (w1, . . . , wD)′ and w∗ = ccl (w∗1 , . . . , w∗D)′
will be based on the ratios w∗j /wj between partsinstead of on the arithmetic differences w∗j − wj .
Statistical Analysis of Compositional Data 30
Perturbations on CD−1
Interpretation
• Some natural processes in nature can beinterpreted as a succession of changes fromone initial composition w0 to a finalcomposition wn through the application ofsuccessive perturbations:
w0 −→ p1 ⊕w0 = w1
−→ p2 ⊕w1 = w2
. . .
−→ pn ⊕wn−1 = wn.
In this manner,
wn = (pn ⊕ pn−1 ⊕ . . .p1)⊕w0.
Statistical Analysis of Compositional Data 31
Genesis of normal distribution
Particles fall from a funnel onto tips of triangles, where
they are deviated to the left or to the right with equal
probability (0.5) and finally fall into receptacles. If the tip
of a triangle is at distance x from the left edge of the
board, triangle tips to the right and to the left below it are
placed at x + k and x− k (k constant).
Statistical Analysis of Compositional Data 32
Genesis of lognormal distribution
Particles fall from a funnel onto tips of triangles, where
they are deviated to the left or to the right with equal
probability (0.5) and finally fall into receptacles. If the tip
of a triangle is at distance x from the left edge of the
board, triangle tips to the right and to the left below it are
placed at x/k and x.k (k constant).
Statistical Analysis of Compositional Data 33
Perturbations on CD−1
Interpretation
• If w = ccl (wSiO2 , . . . , wP2O5)′ expresses the
percentage composition of major oxides of arock, its molecular composition will be
w∗ = ccl (wSiO2/mSiO2 , . . . , wP2O5/mP2O5)′,
where mj symbolizes the molecular weight ofoxide j.
Therefore, composition w∗ can be obtainedapplying the perturbation
m−1 = (ccl (mSiO2 , . . . , mP2O5)′)−1
to composition w:
w∗ = m−1 ⊕w.
Statistical Analysis of Compositional Data 34
The vector space (CD−1,⊕,⊗)
Definition The external operation ⊗ in CD−1 isdefined as
λ⊗w = ccl (wλ1 , . . . , wλ
D)′,
for each λ ∈ IR and each w ∈ CD−1.
(CD−1,⊕,⊗) is a vector space of dimension D− 1.
X1
X2 X3
X1
X2 X3
X1
X2 X3
X1
X2 X3
X1
X2 X3
e
x
2.x
−2.x
Statistical Analysis of Compositional Data 35
The log and the exp transformations
between IRD+ and IRD
The logarithmic transformation on IRD+ trans-
forms the rays from the origin —which representthe compositions of the space CD−1—, to straightlines of IRD parallel to vector 1D = (1,
D. . ., 1)′.
Inversely, the exponential transformation on IRD
transforms these straight lines of IRD parallel tovector 1D, to rays from the origin of IRD
+ .
Statistical Analysis of Compositional Data 37
Centered logratio transformation
Definition The centered logratio transforma-
tion —denoted by clr — is the one-to-one functionfrom the compositional space CD−1 to thesubspaceV = {z = (z1, . . . , zD)′ ∈ IRD : z1 + . . . + zD = 0}of IRD, defined by
clrw = logw
g(w)(w ∈ CD−1).
The inverse transformation, from V to CD−1, isgiven by
clr−1 z = ccl (exp z) (z ∈ V ).
The logarithmic and the exponential transforma-tions establish a one-to-one correspondencebetween the simplex SD and the hyperplan V inIRD.
Statistical Analysis of Compositional Data 38
Centered logratio transformation
W2
W1
W
W
ccl W
1
1
U
z+U
Z2
Z1
z
ucl z V
1
1
Statistical Analysis of Compositional Data 39
Centered logratio transformation
Property The centered logratio transforma-
tion is an isomorphism between the vector space(CD−1,⊕,⊗) and the vector subspace
V = {z = (z1, . . . , zD)′ ∈ IRD : z1 + . . . + zD = 0}
of (IRD, +, .). Therefore,
clr (w ⊕w∗) = clrw + clrw∗ ;
clr (λ⊗w) = λ clrw,
where w,w∗ ∈ CD−1, and λ ∈ <.
Equally,
clr−1 (z + z∗) = clr−1 z⊕ clr−1 z∗ ;
clr−1 (λ z) = λ⊗ clr−1 z,
where z, z∗ ∈ V , and λ ∈ IR.
Statistical Analysis of Compositional Data 40
Isometric logratio transformation
Let V = {v1, . . . ,vD−1} an orthonormal basis ofthe subspace
V = {z = (z1, . . . , zD)′ ∈ IRD : z1 + . . . + zD = 0}.Then, since clrw ∈ V , it will be always possibleto write
clrw = u1v1 + . . . + uD−1vD−1,
for any w ∈ CD−1.
Definition The isometric logratio
transformation —denoted by ilrV — is theone-to-one function from the compositional spaceCD−1 to IRD−1 defined by
ilrV w = (u1, . . . , uD−1)′ (w ∈ CD−1).
Like clr , the transformation ilrV is anisomorphism between the vector spaces(CD−1,⊕,⊗) and (IRD−1,+, .).
Statistical Analysis of Compositional Data 41
Skye Lavas
Sample Na2O + K2O Fe2O3 MgO
S1 52 42 6
S2 52 44 4
S3 47 48 5
clr (Na2O + K2O) clr (Fe2O3) clr (MgO)
S1 0,7910 0,5775 -1,3685
S2 0,9107 0,7436 -1,6543
S3 0,7399 0,7609 -1,5008
ilr 1 [u1] ilr 2 [u2]
S1 0,1510 1,6760
S2 0,1181 2,0261
S3 -0,0149 1,8381
Hint The orthonormal basis V of the subspace V ⊂ IR3
linked to the ilr coordinates is
v1 = (1√2
,− 1√2
, 0)′, v2 = (1√6
,1√6
,− 2√6)′
Statistical Analysis of Compositional Data 42
Skye Lavas
Ternary diagram
clr-coordinates ilr-coordinates
Statistical Analysis of Compositional Data 43
CD−1 as an Euclidean space
The clr transformation between CD−1 and thesubspace V of IRD allows to translate to CD−1 thereal Euclidean structure defined on V :
<w,w∗>C = (clrw)′ clrw∗
= (log w)′HD log w∗,
‖w‖C = ‖clrw‖= [(log w)′HD log w]1/2,
d C(w,w∗) = d Euc(clrw, clrw∗)
= [(log w∗ − log w)′HD
(log w∗ − log w)]1/2,
where HD is the (D ×D)-centering matrix. Thismatrix is equal ID −D−1JD, where ID is theidentity matrix and JD = 1D1′D.
Therefore, by construction, transformations clrand clr−1 —and also ilr and ilr−1— preservethe distances defined in CD−1 and IRD−1.
Statistical Analysis of Compositional Data 44
Compositional geometry in CD−1
• We can not analyze the simplex SD as weanalyze the Euclidean real space.
Let
w1 = ccl (1.000, 49.500, 39.500)′
w2 = ccl (0.010, 49.995, 39.995)′
w3 = ccl (25.0, 50.0, 25.0)′
w4 = ccl (35.0, 30.0, 35.0)′
be four compositions from S3.
Then,
d Euc(w1,w2) ≈ 1.21 < 24.49 ≈ d Euc(w3,w4),
whereas
d C(w1,w2) ≈ 3.77 > 0.69 ≈ d C(w3,w4).
Statistical Analysis of Compositional Data 45
Compositional geometry in CD−1
• Any linear variety on CD−1 —straight lines,planes, etc— can always be implicitlyexpressed by a system of linear equations inlog w1, . . . , log wD in the form
a11 log w1+ . . . +a1D log wD = b1
. . . . . . . . .
am1 log w1+ . . . +amD log wD = bm
,
with ai1 + . . . + aiD = 0, for each i = 1, . . . , m.
• In particular, the parametric equation—varying t ∈ IR— of a straight line on CD−1
is given by
w(t) = ccl (exp(α1+λ1t), . . . , exp(αD+λDt))′,
where∑D
j=1 alphaj = 0 and∑D
j=1 λj = 0.
• Similarly to real space, the concepts ofparallelism and orthogonality can beintroduced in CD−1.
Statistical Analysis of Compositional Data 46
Parallelism in C2
1
2 3
k>0 k=0 k<0
log w2 − log w3 = k
1
2 3
k=−4
k=−2
k=0
k=2
k=4
log w1 − 2 log w2 + log w3 = k
Statistical Analysis of Compositional Data 47
Orthogonality in C2
1
2
3
w2 − w3 = 0
−2 log w1 + log w2 + log w3 = 0
1
2 3
log w1 − 3 log w2 + 2 log w3 = 0
5 log w1 − log w2 − 4 log w3 = 0
Statistical Analysis of Compositional Data 48
Circles in C2
Simplex S3
−0.5 0 0.5 1 1.5 2
−1.5
−1
−0.5
0
0.5
clr -space
Statistical Analysis of Compositional Data 49
The alr transformation
Definition The additive logratio transforma-
tion of index j (j = 1, . . . , D) — denoted byalrj — is the one-to-one transformation fromCD−1 to IRD−1 defined by
w −→ y = alrj w = logw−j
wj.
where
w−j = (w1, w2, . . . , wj−1, wj+1, . . . , wD)′.
The inverse transformation of alrD , from IRD−1
to CD−1, is given by
alr−1D y = ccl (exp y1, . . . , exp yD−1, 1)′ (y ∈ IRD−1).
Statistical Analysis of Compositional Data 50
The alr transformation
Property The alrj transformations(j = 1, . . . , D) are isomorphisms between thevector spaces (CD−1,⊕,⊗) and (IRD−1,+, .), i.e.,
alrj (w ⊕w∗) = alrj w + alrj w∗ ;
alrj (λ⊗w) = λ alrj w,
alr−1j (y + y∗) = alr−1
j y ⊕ alr−1j y∗ ;
alr−1j (λ y) = λ⊗ alr−1
j y,
where w,w∗ ∈ CD−1, y,y∗ ∈ IRD−1 and λ ∈ IR.
Property The alrj transformations(j = 1, . . . , D) do not preserve the distancesdefined in the metric spaces CD−1 and IRD−1, i.e.,
d C(w,w∗) 6= d Euc(alrj w, alrj w∗);
d Euc(y,y∗) 6= d C(alr−1j y, alr−1
j y∗).
Statistical Analysis of Compositional Data 51
Determination of a composition
A composition w ∈ CD−1 can be determined inseveral forms:
(i) Giving any D-observational vector belongingto w. Usually, we will choose the vectorx = Cw = cclL w belonging to SD.
(ii) Giving the components (z1, . . . , zD)′ = z ofthe centered logratio transformed vectorclrw. Since z belongs to the subspace V ofIRD, its components are related by theequality z1 + . . . + zD = 0.
(iii) Giving the components (y1, . . . , yD−1)′ = y ofthe additive logratio transformed vectoralrD w. If it is needed, we can choose thecomponents of any other logratio alrjw(j 6= D).
(iv) Giving the components (u1, . . . , uD−1)′ = u ofthe isometric logratio transformed vectorilrV w, where V is a known orthonormal basisof the subspace V of IRD.
Statistical Analysis of Compositional Data 52
Determination of a composition
• Skye lavas: A = Na2O + K2O, F = Fe2O3, M = MgO
Sample A F M
S1 52 42 6
S2 52 44 4
clr (A) clr (F) clr (M)
S1 0.7910 0.5775 -1.3685
S2 0.9107 0.7436 -1.6543
ilr 1 [u1] ilr 2 [u2]
S1 0,1510 1,6760
S2 0,1181 2,0261
alrMA alrMF
S1 2.159 1.946
S2 2.565 2.398
alrAF alrAM
S1 -0.214 -2.159
S2 -0.167 -2.565
Hint The orthonormal basis V of the subspace V ⊂ IR3
linked to the ilr coordinates is
v1 = (1√2
,− 1√2
, 0)′, v2 = (1√6
,1√6
,− 2√6)′.
Statistical Analysis of Compositional Data 54
Compositional data set
• Raw data matrix
W = [wij : i = 1, . . . , n; j = 1, . . . , D],
or
X = [xij : i = 1, . . . , n; j = 1, . . . , D],
where xi = (xi1, . . . , xiD)′ ∈ SD.
Example AFM composition of 23 aphyric Skyelavas [A = Na2O + K2O, F = Fe2O3, M = MgO].
Obs. A% F% M%
S1 52 42 6
..
....
..
....
S23 24 56 20
X =
52 42 6...
......
24 56 20
.
Statistical Analysis of Compositional Data 55
Compositional data set
• Centred logratio (clr ) data matrix
Z = [zij : i = 1, . . . , n; j = 1, . . . , D],
where zij = log (wij/g(wi)), withg(wi) = (
∏Dk=1 wik)1/D.
Example AFM composition of 23 aphyric Skyelavas [A = Na2O + K2O, F = Fe2O3, M = MgO].
Obs. A% F% M%
S1 52 42 6
..
....
..
....
S23 24 56 20
clr A clr F clr M
Z =
0.791 0.577 −1.368...
......
−0.222 0.626 −0.404
.
Statistical Analysis of Compositional Data 56
Compositional data set
• Additive logratio (alr) data matrix
Y = [yij : i = 1, . . . , n; j = 1, . . . , d],
where yij = log(wij/wiD).
Example AFM composition of 23 aphyric Skyelavas [A = Na2O + K2O, F = Fe2O3, M = MgO].
Obs. A% F% M%
S1 52 42 6
.
.....
.
.....
S23 24 56 20
log AM log F
M
Y =
2.159 1.946...
...
0.182 1.030
.
Statistical Analysis of Compositional Data 57
Center of a compositional data set
• The center of a set W of n compositions w1,
. . . ,wn of CD−1, is the composition g definedby
cenW = g = (1n⊗w1)⊕ . . .⊕ (
1n⊗wn).
This center is equal to
g = ccl
(n∏
i=1
wi1
)1/n
, . . . ,
(n∏
i=1
wiD
)1/n′
.
• It verifies that
clrg = z =n∑
i=1
1nzi =
n∑
i=1
1n
clrwi.
and
alrD g = y =n∑
i=1
1nyi =
n∑
i=1
1n
alrD wi.
Statistical Analysis of Compositional Data 58
Center of a compositional data set
Properties
cenW = ccl
(n∏
i=1
wi1
)1/n
, . . . ,
(n∏
i=1
wiD
)1/n′
.
• cenW = argmin︸ ︷︷ ︸ξ∈CD−1
{dC(w1,ξ)+...+dC(wn,ξ)
n
}.
• cen {p⊕W} = p⊕ cenW, where p ∈ CD−1.
• cen {t⊗W} = t⊗ cenW, where t ∈ IR.
• cen {W ⊕W∗} = cenW ⊕ cenW∗.
Statistical Analysis of Compositional Data 59
Center of a compositional data set
Example AFM composition of 23 aphyric Skyelavas
? Compositional (”geometric”) center:
g = ccl (25.85, 56.65, 17.50)′.
? ”Arithmetic” center:
a = ccl (26.83, 53.74, 19.43)′.
Statistical Analysis of Compositional Data 60
Centering
To ”centre” a compositional data set w1, . . .wn
with centre g, it suffices to consider the new dataset w∗
1 = g−1 ⊕w1, . . . ,w∗n = g−1 ⊕wn.
Obviously, the centre of the new ”centered” dataset w∗
1, . . . ,w∗n is ccl (1/D, . . . , 1/D)′.
Statistical Analysis of Compositional Data 61
Compositional covariance structure
• Variation matrix
T = [τjk] =[var
{log
w(j)
w(k)
}].
? τjk = 0 means a perfect relationship between w(j)
and w(k) in the sense that the ratio w(j)/w(k) is
constant.
? The larger the value of τjk the more departure
from proportionality between w(j) and w(k).
? A measure of degree of proportionality between
two parts j and k is given by
exp(−√τjk).
In this way, exp(−√τjk) = 0 means zero
propotionality, and exp(−√τjk) = 1 means
perfect propotionality.
? The variation matrix of any subcomposition is
simply obtained by picking out on T all the
logratio variances τjk associated with the parts j
and k of the subcomposition.
Statistical Analysis of Compositional Data 62
Compositional covariance structure
• Logratio covariance matrix
Σ = [σjk] =[cov
{y(j),y(k)
}]
=[cov
{log w(j)
w(D), log w(k)
w(D)
}].,
where
yj =(
logw1j
w1D, . . . , log
wnj
wnD
)′,
for j = 1, . . . , D − 1.
Statistical Analysis of Compositional Data 63
Compositional covariance structure
• Centered covariance matrix
Γ = [γjk] =[cov
{z(j), z(k)
}].
where
z(j) = (log (w1j/g(w1)) , . . . , log (wnj/g(wn)))′ ,
for j = 1, . . . , D.
Hint. Correlation corr{z(j), z(k)
}is not a
measure of a relationship between parts j andk because is subcompositionally incoherent.
• Total (relative) variability
totvarC{W} =∑n
i=11nd2
C(wi,g) = trace {Γ}
= 12D1
′DT1D = 1
D
∑i<j τij .
Statistical Analysis of Compositional Data 64
Compositional covariance structure
• The centered covariance matrix Γ =[γjk
]is singular
becauseD∑
k=1
γjk = 0 (j = 1, . . . , D).
• The relationships between the three covariance
matrices T,Σ and Γ are linear.
• The dimensionality of the covariance structure of a
compositional raw data matrix from CD−1 is equal to12D(D − 1).
• The covariance matrix T —and also Σ and Γ— is
coherent with the algebraic structure of (CD−1,⊕,⊗),
i.e.,
T{p⊕W} = T{W} and T{λ⊗W} = λ2T{W},
where W is a compositional raw data matrix from
CD−1, p ∈ CD−1 and λ ∈ IR.
Therefore,
totvar C{p⊕W} = totvarC{W};
totvarC{λ⊗W} = λ2totvarC{W}.
Statistical Analysis of Compositional Data 65
Compositional covariance structure
Example AFM composition of 23 Skye lavas
? Variation matrix
T =
0 0.251 1.144
0.251 0 0.350
1.144 0.350 0
.
? Logratio covariance matrix
ΣA =
[0.251 0.523
0.523 1.144
]ΣF =
[0.251 −0.271
−0.271 0.350
]
ΣM =
[1.144 0.622
0.622 0.350
].
? Centered covariance matrix
Γ =
0.271 0.013 −0.284
0.013 0.007 −0.020
−0.284 −0.020 0.304
.
? Total variability: totvarC = 0.582.
Statistical Analysis of Compositional Data 66
Biplots
In general, the biplot is a simultaneous represen-tation of the rows (observations) and columns(variables) of a n× p matrix X by means of arank-2 approximation.
Usually, biplot analysis starts with performingsome transformations on X, depending on thenature of the data, to obtain a transformedmatrix Z which is the one that is actuallydisplayed.
Statistical Analysis of Compositional Data 67
Biplots
• The singular value decomposition (SVD) of Zprovides a decomposition of this matrix:
Z = [u1 : . . . : ur] diag{λ1, . . . , λr} [v1 : . . . : vr]′,
where
? r is the rank of Z;
? u1, . . . ,ur are the standardizedeigenvectors of Z′;
? v1, . . . ,vr are the standardizedeigenvectors of Z;
? and λ1, . . . , λr the corresponding positiveeigenvalues in decreasing order.
• From this SVD of Z, and using only the twofirst eigenvectors, a rank-2 approximation Zis obtained:
Z = [u1 : u2] diag{λ1, λ2} [v1 : v2]′.
Statistical Analysis of Compositional Data 68
Biplots
• Then Z decomposes in
Z = [λα1 u1 : λα
2 u2]︸ ︷︷ ︸F
[λ1−α1 v1 : λ1−α
2 v2]′︸ ︷︷ ︸G′
,
where α is an arbitrary constant.
• The biplot represents simultaneously in IR2
the rows of F, which provides the coordinatesof n points (in correspondence with the n
rows/observations of Z) , and the rows of G,which provides the coordinates of p points (incorrespondence with the columns/variables ofZ).
Conventionally, the biplot depicts thevariables by rays and the observations bypoints.
Depending on the constant α, the biplot favoursthe display of rows (observations) or columns(variables). For α = 0, the biplot is calledcovariance biplot. In this case, the display ofvariables is favoured.
Statistical Analysis of Compositional Data 69
Biplots
• Singular value decomposition (SVD) of Z:
Z = [u1 : . . . : ur] diag{λ1, . . . , λr} [v1 : . . . : vr]′,
• Rank-2 approximation Z:
Z = [λα1 u1 : λα
2 u2]︸ ︷︷ ︸F
[λ1−α1 v1 : λ1−α
2 v2]′︸ ︷︷ ︸G′
,
• The ratioλ1 + λ2
λ1 + . . . + λr
is a measure of the proportion of the”variability” of Z captured by the biplot.
Statistical Analysis of Compositional Data 70
Relative variation diagrams
Definition The relative variation diagram of acompositional data set w1, . . . ,wn of CD−1 is thecovariance biplot of the matrix Zc which weobtain after centering the D columns of thecentered logratio matrix Z.
• Elements
? Origin, labeled O.
? Vertices, for each of the D parts(variables/columns) of compositions,labeled 1, . . . , j, . . .D.
? Case marker, for each of the n
observations (rows), labeled 1, . . . , i, . . . n.
? Ray. Is the join Oj of origin O to a vertexj.
? Link. Is the join jk of two vertices j and k.
Statistical Analysis of Compositional Data 72
Relative variation diagrams
• The vertices and case markers are bothcentered at the origin O.
• Rays and inter-ray angles represent thecentered logratio matrix Γ:
|Oj|2 = γjj = estimate of var{z(j)
},
|Oj| · |Ok| = γjk = estimate of cov{z(j), z(k)
},
so that
cos jOk = estimate of corr{z(j), z(k)
}.
Hint. Remember that correlationcorr
{z(j), z(k)
}is not a measure of a
relationship between parts j and k because issubcompositionally incoherent.
Statistical Analysis of Compositional Data 73
Relative variation diagrams
• The squared lengths of the links represent theset of estimated relative variances:
|jk|2 = τjk = estimate of var{
logw(j)
w(k)
}.
Therefore, if two vertices j and k coincide orare close together then components w(j) andw(k) are in constant proportion or nearly so.
jk
0
Statistical Analysis of Compositional Data 74
Relative variation diagrams
• Links jl and kl, with a common vertex l,represent the estimated logratio covariancematrix Σl:
|jl| · |kl| ≈∣∣∣∣cov
{log
w(j)
w(l), log
w(k)
w(l)
}∣∣∣∣ ,
so that
cos jlk ≈ corr{
logw(j)
w(l), log
w(k)
w(l)
}.
jl
0
k
Statistical Analysis of Compositional Data 75
Relative variation diagrams
• If the links jk and lm intersect at R then∣∣∣cos jRm
∣∣∣ ≈∣∣∣∣corr
{log
w(j)
w(k), log
w(l)
w(m)
}∣∣∣∣ .
Therefore, if two links jk and lm intersect atright angles then the logratios log(w(j)/w(k))and log(w(l)/w(m)) will be uncorrelated and,within the context of logistic normality, inde-pendent, i.e., subcompositions (j, k) and(l,m) are independent.
j
k
0m
l
R
Statistical Analysis of Compositional Data 76
Relative variation diagrams
• The relative variation diagram for any sub-composition S is simply the subdiagramformed by selecting the vertices correspondingto the parts of the subcomposition and takingthe centroide OS of these vertices as thecenter of the subcompositional biplot.
Therefore, if a subset — say 1, . . . , C— ofvertices is approximately collinear then theassociated subcomposition has a ”composi-tional” one-dimensional structure.
j
k0
l0S
jk
0
l
Statistical Analysis of Compositional Data 77
Volcano H
• Parts: 1=Cl; 2=K2O; 3=P2O5; 4=TiO2; 5=SiO2.
• Variation matrix T
0 2, 784 4, 134 3, 970 2, 966
2, 784 0 0, 647 0, 645 0, 146
4, 134 0, 647 0 0, 071 0, 304
3, 970 0, 645 0, 071 0 0, 249
2, 966 0, 146 0, 304 0, 249 0
• Centered covariance matrix Γ
2, 134 −0, 221 −0, 803 −0, 743 −0, 368
−0, 221 0, 208 −0, 022 −0, 043 0, 079
−0, 803 −0, 022 0, 394 0, 337 0, 094
−0, 743 −0, 043 0, 337 0, 350 0, 099
−0, 368 0, 079 0, 094 0, 099 0, 096
Statistical Analysis of Compositional Data 80
Dimension-reducing techniques
Compositional PCA
Given a set of compositions w1, . . . ,wn of CD−1
with center g, the PCA will start looking for adirection —determined by a C-unitary composi-tion c1— such that the total variability of theC-orthogonal projections of w1, . . . ,wn on thecompositional straight line through g with direc-tion c1 will be maximum. And so on.
Property The compositional principal compo-nents of a set of compositions w1, . . . ,wn of CD−1
can be determined from the standard principalcomponents of the clr -transformed observationsclrw1, . . . , clrwn.
Statistical Analysis of Compositional Data 81
Compositional PCA
In this manner, the positive eigenvalues λ1 ≥ . . .
≥ λD−1 of the centered logratio covariance matrixΓ give the decomposition of totvarC , and thecorresponding unitary eigenvectors z∗1, . . . , z
∗D−1
determine the corresponding directionsclr−1z∗1, . . . , clr
−1z∗D−1 of the principal axes.
Statistical Analysis of Compositional Data 82
Skye Lavas
PC1: 0.4436 log A− 0.8154 log F + 0.3719 log M ≈ −0.7849
Statistical Analysis of Compositional Data 83
Volcano H
(Cl, K2O, P2O5, TiO2, SiO2)
PC1: (0.4246,0.1656,0.1264,0.1296,0.1538)’ (87.8%)
PC2: (0.1469,0.3836,0.1230,0.1182,0.2293)’ (acum 98.1%)
Statistical Analysis of Compositional Data 84
Dimension-reducing techniques
Subcomposition analysis
Let w1, . . . ,wn be a compositional data set ofCD−1, and let subS w1, . . . , subS wn be the set ofthe corresponding subcompositions of CC−1
associated to a subset S of parts 1, . . . , D. Then,the ratio
totvarC{subS w1, . . . , subS wn}totvarC{w1, . . . ,wn}
gives the proportion of total variabilitiy retainedby the subcompositions.
If the purpose of subcompositional analysis is toretain as much variability as possible for a givennumber C of parts, then we have to search forsubcompositions of this size which maximize thisratio.
Statistical Analysis of Compositional Data 86
Subcomposition analysis
Example Percentage of Cl, K2O, P2O5, TiO2
and SiO2 in 46 samples of volcanic rocks from ananonymous volcano H
? Total variability: totvarC = 3.1829.
? Total variability of 3-parts subcompositions:
Subcomposition Percentage
P2O5, TiO2, SiO2 6.53%
K2O, TiO2, SiO2 10.90%
K2O, P2O5, SiO2 11.48%
K2O, P2O5, TiO2 14.27%
Cl, K2O, SiO2 61.74%
Cl, TiO2, SiO2 75.25%
Cl, K2O, TiO2 77.48%
Cl, P2O5, SiO2 77.53%
Cl, K2O, P2O5 79.21%
Cl, P2O5, TiO2 85.61%
Statistical Analysis of Compositional Data 87
Zeros in compositional data
• Logratio methodology is incompatible withcomposition with zeros in one or more parts.
• Two kinds of zeros:
? Essential zeros: part completely absent.
? Rounded zeros: no quantifiable proportionhas been recorded.
• Treatment of essential zeros:
? Is it suitable to amalgamate some parts?
? Pre-classification: create initial groupsaccording to the number and location ofzeros, and analyze each group separately.
• Treatment of rounded zeros:
? Consider the zero values as missing values.
? Imputation: replace zero values by a smallamount using non-parametric orparametric techniques.
? Apply log-ratio methodology to replacedobservations of resulting data set.
Statistical Analysis of Compositional Data 88
Rounded zeros
Multiplicative replacement
Let be w = ccl (w1, . . . , wD)′ ∈ CD−1 anycomposition with some wj = 0 (rounded zero).
The multiplicative replacement replaces w bythe composition w(r) = ccl (w(r)
1 , . . . , w(r)D )′
defined by
if wj = 0 → w(r)j = δj ;
if wj 6= 0 → w(r)j = wj
(1−
∑wl=0
δl
).
where δj are the ”small” values replacing zerosparts.
Statistical Analysis of Compositional Data 89
Rounded zeros
Multiplicative replacement
• It is a ”natural” replacement.
• Ratio between two non-zero parts ispreserved.
• It is compatible with subcompositions,perturbation and power transformation.
• Covariance structure of subcompositions withno zeros is preserved.
Statistical Analysis of Compositional Data 90
Modeling compositional data
In practice, many of the probability densityfunctions (pdf) on the compositional space CD−1
will be defined from a pdf on the real space IRD−1.Then the alrj
−1 transformations will allow toinduce on the simplex SD the corresponding pdf.
The most important pdf on CD−1 are:
• The Dirichlet class.
• The (additive) logistic normal class.
• The (additive) logistic skewnormal class.
Definition A random composition w on CD−1
is said to have an additive logistic normaldistribution (aln) of parameters µ and Σ—written w ∼ LD−1(µ,Σ)— if the randomvector y = alrD w = log(w−D/wD) has aND−1(µ,Σ) on IRD−1.
Statistical Analysis of Compositional Data 91
Logistic normal distributions on CD−1
Property Let w be a random vector on CD−1.If alrD w ∼ ND−1(µ,Σ), then all the otherlogratio random vectors alrjw (j = 1, . . . , d) arenormally distributed.
Property Let w be a random composition onCD−1. Let wS be the random subcomposition onCC−1 corresponding to a subset S of C parts ofw. If w ∼ LD−1(µ,Σ), thenwS ∼ LC−1(µS ,ΣS), where µS and ΣS can beeasily calculated from µ and Σ.
Property Let w be a random vector on CD−1,which w ∼ LD−1(µ,Σ). If we perturb w by aconstant composition p ∈ CD−1, then theperturbed random vectorp⊕w ∼ LD−1(µ + alrD p,Σ).
Statistical Analysis of Compositional Data 92
Logistic normal distributions on CD−1
Estimation of parameters
To estimate the parameters µ and Σ of a randomcomposition w ∼ LD−1(µ,Σ) from a randomsample w1, . . . ,wn of w, we estimate by standardprocedures the vector mean and the covariancematrix of a multivariate normal distribution fromthe alrD-transformed random sample
y1 = alrDw1, . . . ,yn = alrD wn.
The maximum likelihood estimations of µ and Σare given by
µj =1n
n∑
i=1
yij ,
σjk =1n
n∑
i=1
(yij − µj)(yik − µk),
for j, k = 1, . . . , D − 1.
Statistical Analysis of Compositional Data 93
Predictive regions
Definition Let w be a random composition aln
distributed on CD−1. If µ and Σ are the estima-tes of the unknown parameters of w from arandom sample of size n, the 1− α predictiveregion is defined as{
w∗ ∈ CD−1 : (alrD w∗ − µ)′Σ−1
(alrD w∗ − µ) ≤ r2}
,
where r2 is a real number such that
Prob
[FD−1,n−(D−1) ≤
n(n− (D − 1))
(n2 − 1)(D − 1)r2
]= 1− α.
99% Predictive region - Skye lavas
Statistical Analysis of Compositional Data 94
Atypicality index
Definition If a random composition w on CD−1
is LD−1(µ,Σ) distributed, the atypicality indexof a composition w∗ ∈ CD−1 in relation to therandom composition w is defined as
Prob[χ2
D−1 ≤ (alrD w∗ − µ)′Σ−1(alrD w∗ − µ)]
.
Definition Let w be a random composition aln
distributed on CD−1. If µ and Σ are the estimatesof the unknown parameters of w from a randomsample w1, . . . ,wn of size n, the atypicality index
of a composition w∗ ∈ CD−1 in relation to thecompositional data set w1, . . . ,wn is defined as
Prob
[FD−1,n−(D−1) ≤ k(alrD w∗ − µ)′Σ
−1(alrD w∗ − µ)
],
where k =n(n−(D−1))
(n2−1)(D−1).
Statistical Analysis of Compositional Data 95
Compositional Regression
Artic lake
Sand, silt,clay composition of 39 sedimentsamples at different water depths in an Artic lake:
Num. Sand Silt Clay Depth (m)
S01 77.5 19.5 3.0 10.4
S02 71.9 24.9 3.2 11.7
.
.....
.
.....
.
..
S39 2.0 47.8 50.2 103.7
a. Is sediment composition dependent on waterdepth?
b. If so, how can we quantify the extent of thedependence?
Statistical Analysis of Compositional Data 96
Compositional Regression
• Compositions wi ∈ CD−1 regressing on a realconcomitant ti (i = 1, . . . , n):
wi = β0 ⊕ (ti ⊗ β1)⊗ εi (i = 1, . . . , n),
where
? β0: constant;
? β1: regression coefficient;
? εi (i = 1, . . . , n): errors.
• alr version of the regression model
alrwi = alr β0+tialr β1+alr εi (i = 1, . . . , n).
Can be reparametrized as
alrwi = α0 + tiα1 + εi (i = 1, . . . , n).
Statistical Analysis of Compositional Data 97
Compositional Regression
alrwi = α0 + tiα1 + εi (i = 1, . . . , n).
• Estimations α0 and α1 are obtained byapplication of the least squares method. Then
β0 = alr−1α0 , β1 = alr−1α1
• The error (residual) of wi (i = 1, . . . , n) willbe
ei = wi ª wi,
where wi = β0 ⊕ (ti ⊗ β1).
• Sum of squares of errors:
SSError =n∑
i=1
‖ei‖2C =n∑
i=1
(dC(wi, wi)
)2.
• Proportion of variability explained by thefitted linear regression model:
1− SSError
totvarC{w1, . . . ,wn} .
Statistical Analysis of Compositional Data 98
Artic lake
Num. Sand Silt Clay Depth (m)
S01 77.5 19.5 3.0 10.4
..
....
..
....
..
.
S39 2.0 47.8 50.2 103.7
• alr fitted simple linear regression model:
log(sand/clay) = 9.697− 2.743 log(depth) + ε1;
log(silt/clay) = 4.805− 1.096 log(depth) + ε2.
• Fitted regression model in S3:
cclL (sand, silt, clay)′ = (0.992849, 0.007145, 0.00006)′⊕
log(depth)⊗ (0.04604, 0.238291, 0.71505)′.
• Proportion of variability explained by the fitted
simple linear regression model:
1− 0.7006
2.4692= 0.716 ≡ 71.6%.