23
V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final comments The statistical analysis of compositional data: The Aitchison geometry Prof. Dr. Vera Pawlowsky-Glahn Prof. Dr. Juan Jos´ e Egozcue Ass. Prof. Dr. Ren´ e Meziat Instituto Colombiano del Petr´ oleo Piedecuesta, Santander, Colombia March 20–23, 2007

The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

The statistical analysis ofcompositional data:

The Aitchison geometry

Prof. Dr. Vera Pawlowsky-GlahnProf. Dr. Juan Jose EgozcueAss. Prof. Dr. Rene Meziat

Instituto Colombiano del PetroleoPiedecuesta, Santander, Colombia

March 20–23, 2007

Page 2: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

logo

IAMG Distinguished Lecturer – 2007

Prof. Dr. Vera Pawlowsky-Glahn

Department of Computer Science and Applied MathematicsUniversity of Girona, Spain

Page 3: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

recall

compositional data are parts of some wholewhich only carry relative information

usual units of measurement: parts per unit,percentages, ppm, ppb, concentrations, ...

historically: data subject to a constant sumconstraint

examples: geochemical analysis; (sand, silt, clay)composition; proportions of minerals in a rock; ...

Page 4: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

historical remarks: end of the XIXth century

Karl Pearson, 1897: “On a form of spurious correlationwhich may arise when indices are used in themeasurement of organs”

he was the first to point out dangers that may befall theanalyst who attempts to interpret correlations betweenratios whose numerators and denominators containcommon parts

the closure problem was stated within the framework ofclassical statistics, and thus within the framework ofEuclidean geometry in real space

Page 5: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

the problem: negative bias & spurious correlation

example: scientists A and B record the composition of aliquots of soilsamples; A records (animal, vegetable, mineral, water) compositions,B records (animal, vegetable, mineral) after drying the sample; both areabsolutely accurate (adapted from Aitchison, 2005)

sample A x1 x2 x3 x4

1 0.1 0.2 0.1 0.62 0.2 0.1 0.2 0.53 0.3 0.3 0.1 0.3

sample B x ′1 x ′

2 x ′3

1 0.25 0.50 0.252 0.40 0.20 0.403 0.43 0.43 0.14

corr A x1 x2 x3 x4

x1 1.00 0.50 0.00 -0.98x2 1.00 -0.87 -0.65x3 1.00 0.19x4 1.00

corr B x ′1 x ′

2 x ′3

x ′1 1.00 -0.57 -0.05

x ′2 1.00 -0.79

x ′3 1.00

Page 6: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

historical remarks: from 1897 to 1980 (and beyond)

the fact that correlations between closed data are inducedby numerical constraints caused Felix Chayes to attemptto separate the spurious part from the real correlation

(“On correlation between variables of constant sum”, 1960)

many studied the effects of closure on methods related tocorrelation and covariance analysis (principal componentanalysis, partial and canonical correlation analysis) ordistances (cluster analysis)

an exhaustive search was initiated within the frameworkof classical (applied) statistics

Page 7: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

historical remarks: end of the XXth century

John Aitchison, 1982, 1986: “The statistical analysis ofcompositional data”

key idea: compositional data represent parts of somewhole; they only carry relative information

by analogy with the log-normal approach, Aitchisonprojected the sample space of compositional data,the D-part simplex SD, to real space RD−1 or RD,using log-ratio transformations

the log-ratio approach was born ...

Page 8: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

compositional data: definition

definition: parts of some whole which carry only relativeinformation ⇐⇒ compositional data are equivalence classes

X2

1

1 X1

compositional data in R2 compositional data in R3

usual representation: subject to a constant sum constraint

Page 9: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

compositional data: usual representation

definition: x = [x1, x2, . . . , xD] is a D-part composition

⇐⇒

xi > 0, for all i = 1, ..., DD∑

i=1xi = κ (constant)

κ = 1 ⇐⇒ measurements in parts per unitκ = 100 ⇐⇒ measurements in percent

other frequent units: ppm, ppb, ...

a subcomposition xs with s parts is obtained as the closure ofa subvector

[xi1 , xi2 , . . . , xis

]of x

Page 10: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

the simplex as sample space

SD = {x = [x1, x2, . . . , xD]|xi > 0;D∑

i=1

xi = κ}

standard representation for D = 3:the ternary diagram

X1

X2

X3

x2

x1

x3

Page 11: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

example 1: genetic hypothesis

MN

MM NN

data: genotyps in the MN system of blood groups; code: Ab = Aborigines;Ch = Chinese; In= Indian; AmIn = American Indian; Es = Eskimo;question: despite the high variability which can be observed, is there anyinherent stability in the data? do they follow any genetic law?

Page 12: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

requirements for a proper analysis

scale invariance: the analysis should not depend on theclosure constant κ

permutation invariance: the order of the parts should beirrelevant

subcompositional coherence: studies performed onsubcompositions should not stand in contradiction withthose performed on the full composition

Page 13: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

why a new geometry on the simplex?

in real space we add vectors, we multiply them by a constant, welook for orthogonality between vectors, we look for distancesbetween points, ...

possible because <D is a linear vector space

BUT Euclidean geometry is not a proper geometry for compositionaldata because

results might not be in the simplex when we addcompositional vectors, multiply them by a constant, or computeconfidence regions

Euclidean differences are not always reasonable: from0.05% to 0.10% the amount is doubled; from 50.05% to 50.10%the increase is negligible

Page 14: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

basic operations

closure of z = [z1, z2, . . . , zD] ∈ <D+

C [z] =

[κ · z1∑D

i=1 zi,

κ · z2∑Di=1 zi

, · · · ,κ · zD∑D

i=1 zi

]

perturbation of x ∈ SD by y ∈ SD

x⊕ y = C [x1y1, x2y2, . . . , xDyD]

powering of x ∈ SD by α ∈ <

α� x = C [xα1 , xα

2 , . . . , xαD ]

Page 15: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

interpretation of perturbation and powering

A

B C

A

B C

left: perturbation of initial compositions (◦) by p = [0.1, 0.1, 0.8]resulting in compositions (?)

right: powering of compositions (?) by α = 0.2 resulting incompositions (◦)

Page 16: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

comments

closure = projection of a point in <D+ on SD

points on a ray are projected onto the same point

a ray in <D+ is an equivalence class

the point on SD is a representant of the class

a generalization to other representants is possible

for z ∈ <D+ and x ∈ SD, x⊕ (α� z) = x⊕ (α� C [z])

Page 17: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

vector space structure of (SD,⊕,�)

commutative group structure of (SD,⊕)1 commutativity: x⊕ y = y⊕ x2 associativity: (x⊕ y)⊕ z = x⊕ (y⊕ z)3 neutral element: e = C [1, 1, . . . , 1] = barycentre of SD

4 inverse of x: x−1 = C[x−1

1 , x−12 , . . . , x−1

D

]⇒ x⊕ x−1 = e and x⊕ y−1 = x y

properties of powering1 associativity: α� (β � x) = (α · β)� x;2 distributivity 1: α� (x⊕ y) = (α� x)⊕ (α� y)3 distributivity 2: (α + β)� x = (α� x)⊕ (β � x)4 neutral element: 1� x = x

Page 18: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

inner product space structure of (SD,⊕,�)

inner product : 〈x, y〉a =1

2D

D∑i=1

D∑j=1

lnxi

xjln

yi

yj, x, y ∈ SD

norm : |x|a =

√√√√ 12D

D∑i=1

D∑j=1

(ln

xi

xj

)2

, x ∈ SD

distance : da(x, y) =

√√√√ 12D

D∑i=1

D∑j=1

(ln

xi

xj− ln

yi

yj

)2

, x, y ∈ SD

Aitchison geometry on the simplex

Page 19: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

properties of the Aitchison geometry

distance and perturbation: da(p⊕ x, p⊕ y) = da(x, y)

distance and powering: da(α� x, α� y) = |α|da(x, y)

compositional lines: y = x0 ⊕ (α� x)(x0 = starting point, x = leading vector)

orthogonal lines: y1 = x0 ⊕ (α1 � x1), y2 = x0 ⊕ (α2 � x2),

y1 ⊥y2 ⇐⇒ 〈x1, x2〉a = 0

(the inner product of the leading vectors is zero)parallel lines: y1 = x0 ⊕ (α� x) ‖ y2 = p⊕ x0 ⊕ (α� x)

Page 20: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

orthogonal compositional lines

x y

z

x y

z

orthogonal grids in S3, equally spaced, 1 unit in Aitchisondistance; the right grid is rotated 45o with respect to the left grid

Page 21: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

circles and other geometric figures

x2

x1

x3

n

Page 22: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

advantages of Euclidean spaces

orthonormal basis can be constructed: {e1, . . . , eD−1}

coordinates obey the rules of real Euclidean space:

x ∈ SD ⇒ y = [y1, . . . , yD−1] ∈ RD−1, with yi = 〈x, ei〉astandard methods can be directly applied to coordinates

expressing results as compositions is easy:

if h : SD 7→ RD−1 assigns to each x ∈ SD its coordinates,i.e. h(x) = y, then

h−1(y) = x =D−1⊕i=1

yi � ei

Page 23: The statistical analysis of compositional data: The Aitchison … · 2011-08-09 · V. Pawlowsky-Glahn and J. J. Egozcue CoDa historical remarks sample space Aitchison geometry final

V. Pawlowsky-Glahn and

J. J. Egozcue

CoDa historical remarks sample space Aitchison geometry final comments

conclusions

the Aitchison geometry of the simplex offers a new tool toanalyse CoDa

the geometry is apparently complex, but it is completelyequivalent to standard Euclidean geometry in real space

the key is to use a proper representation in coordinates