Upload
eric-stafford
View
222
Download
2
Tags:
Embed Size (px)
Citation preview
What we Measure vs. What we Want to Know
"Not everything that counts can be counted, and not everything that can be counted counts." - Albert Einstein
Scales, Transformations, Vectors and Multi-Dimensional
Hyperspace
• All measurement is a proxy for what is really of interest - The Relationship between them
• The scale of measurement and the scale of analysis and reporting are not always the same - Transformations
• We often make measurements that are highly correlated - Multi-component Vectors
Multivariate Description
Gulls Variables
Weight
400 420 440 105 115 125 135
700
900
1100
400
420
440
Wing
Bill
1618
2022
700 800 900 1100
105
115
125
135
16 17 18 19 20 21 22
H.and.B
Scree Plot
Comp.1 Comp.2 Comp.3 Comp.4
gulls.pca2V
ari
an
ces
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Output
> gulls.pca2$loadings
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4Weight -0.505 -0.343 0.285 0.739Wing -0.490 0.852 -0.143 0.116Bill -0.500 -0.381 -0.742 -0.232H.and.B -0.505 -0.107 0.589 -0.622
> summary(gulls.pca2)
Importance of components:
Comp.1 Comp.2 Comp.3 Standard deviation 1.8133342 0.52544623 0.47501980 Proportion of Variance 0.8243224 0.06921464 0.05656722 Cumulative Proportion 0.8243224 0.89353703 0.95010425
Bi-Plot
-0.15 -0.10 -0.05 0.00 0.05 0.10
-0.1
5-0
.10
-0.0
50
.00
0.0
50
.10
Comp.1
Co
mp
.2
1
2
3
4
5
6
7
8
9
1011
1213
14
15
16
17
18
19
20
21
22
23
24
25
26
27 28
2930
31
32
3334
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
6768
69
70
71
72
73
74
75
76
77
78
79
80
81
82
8384
85 86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146147
148
149
150
151152
153154
155156
157
158 159
160
161
162163
164
165
166
167
168
169170
171172
173
174
175
176
177
178
179
180
181182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217218
219
220
221
222
223
224225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255256
257
258
259
260
261
262263
264
265
266 267
268
269
270
271
272
273
274
275
276
277
278
279
280
281 282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353354
355
356
357
358
359
360
361
362
-20 -10 0 10
-20
-10
01
0
Weight
Wing
Bill
H.and.B
Environmental Gradients
Inferring Gradients from Attribute Data (e.g. species)
Indirect Gradient Analysis
• Environmental gradients are inferred from species data alone
• Three methods:– Principal Component Analysis - linear model– Correspondence Analysis - unimodal model– Detrended CA - modified unimodal model
Terschelling Dune Data
PCA gradient - site plot
PCA 1
PCA
2
2.01.51.00.50.0-0.5-1.0-1.5
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
Managmentbiodynamichobbynaturestandard
PCA Plot for Dune Species Data
PCA gradient - site/species biplot
Axis 1
Axi
s 2
210-1-2
2.5
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
Ach mil
Agr sto
Alo gen
Ant odo
Bel perBro hor
Ele pal
Ely rep
Jun art
J un buf
Leo aut
Lol per
Pla lan
Poa pra
Poa tri
Ran flaRum ace
Sag pro
Tri rep
Bra rut
Biplot for Dune Species Data
standard
nature
biodynamic& hobby
Making Effective Use of Environmental Variables
Approaches
• Use single responses in linear models of environmental variables
• Use axes of a multivariate dimension reduction technique as responses in linear models of environmental variables
• Constrain the multivariate dimension reduction into the factor space defined by the environmental variables
Dimension Reduction (Ordination) ‘Constrained’ by the
Environmental Variables
Constrained?
Working with the Variability that we Can Explain
• Start with all the variability in the response variables.
• Replace the original observations with their fitted values from a model employing the environmental variables as explanatory variables (discarding the residual variability).
• Carry our gradient analysis on the fitted values.
Unconstrained/Constrained
• Unconstrained ordination axes correspond to the directions of the greatest variability within the data set.
• Constrained ordination axes correspond to the directions of the greatest variability of the data set that can be explained by the environmental variables.
Direct Gradient Analysis
• Environmental gradients are constructed from the relationship between species environmental variables
• Three methods:– Redundancy Analysis - linear model– Canonical (or Constrained) Correspondence
Analysis - unimodal model– Detrended CCA - modified unimodal model
Dune Data Unconstrained
-2 -1 0 1 2
-10
12
3
CA1
CA
2
Belper
Empnig
Junbuf
Junart
Airpra
Elepal
Rumace
ViclatBrarut Ranfla
Cirarv
Hyprad
LeoautPotpal
Poapra
Calcus
TripraTrirep
Antodo
Salrep
Achmil
Poatri
ChealbElyrep
Sagpro
Plalan
AgrstoLolper
Alogen
Brohor
213
4
166
1
85
17
15
10
11
9
18
3
20
14
19
12
7
Dune Data Constrained
-2 -1 0 1 2 3
-2-1
01
CCA1
CC
A2
Belper
Empnig
Junbuf
Junart
Airpra
Elepal
Rumace
Viclat
Brarut Ranfla
Cirarv
Hyprad
LeoautPotpal
Poapra
Calcus
Tripra Trirep
Antodo
Salrep
Achmil
Poatri
Chealb
ElyrepSagpro
Plalan
Agrsto
Lolper
Alogen
Brohor
2
13
4
16
6
1 8
5
17
15
1011
9
18
3
20
14
19
12
7
Manure.L
Manure.Q
Manure.C
Manure^4
Moisture.L
Moisture.Q
Moisture.C
A1
-10
Manure0
Manure1Manure2
Manure3
Manure4
Moisture1
Moisture2
Moisture4
Moisture5
How Similar are Objects/Samples/Individuals/Sites?
Similarity approachesor what do we mean by similar?
Different types of data
example
Continuous data : height
Categorical data
ordered (nominal) : growth rate very slow, slow, medium, fast, very
fast
not ordered : fruit colour yellow, green, purple, red, orange
Binary data : fruit / no fruit
Different scales of measurement
example
Large Range : soil ion concentrations
Restricted Range : air pressure
Constrained : proportions
Large numbers : altitude
Small numbers : attribute counts
Do we standardise measurement scales to make them equivalent? If so what do we lose?
Similarity matrixWe define a similarity between units – like the correlation between continuous variables.
(also can be a dissimilarity or distance matrix)
A similarity can be constructed as an average of the similarities between the units on each variable.
(can use weighted average)
This provides a way of combining different types of variables.
relevant for continuous variables:
Euclidean
city block or Manhattan
Distance metrics
A
B
A
B
(also many other variations)
Similarity coefficients for binary data
simple matching
count if both units 0 or both units 1
Jaccard
count only if both units 1
(also many other variants, eg Bray-Curtis)
simple matching can be extended to categorical data
0,1 1,1
0,0 1,0
0,1 1,1
0,0 1,0
A Distance Matrix
Uses of Distances
Distance/Dissimilarity can be used to:-
• Explore dimensionality in data using Principal coordinate analysis (PCO or PCoA)
• As a basis for clustering/classification
UK Wet Deposition Network
-400 -200 0 200 400
-40
0-2
00
02
00
40
0
Dim1
Dim
2
Goonhilly
Lough Navar
Achanarras
Flatford Mill
Strathvaich Dam
Yarner WoodBarcombe Mills
Stoke Ferry
Hillsborough Forest
Tycanol Wood
Allt a MharcaidhGlen Dye
Driby
Woburn
Balquhidder 2
Compton
High Muffles
Bottesford
Whiteadder
Pumlumon
Loch Dee Redesdale
Wardlow Hay Cop
Cow Green ReservoirBannisdale
Grouping methods
Cluster Analysis
hierarchical
divisive
put everything together and split
monothetic / polythetic
agglomerative
keep everything separate and join the most similar points (classical cluster analysis)
non-hierarchical
k-means clustering
Clustering methods
Agglomerative hierarchical
Single linkage or nearest neighbour
finds the minimum spanning tree: shortest tree that connects all points
chaining can be a problem
Agglomerative hierarchical
Complete linkage or furthest neighbour
compact clusters of approximately equal size.(makes compact groups even when none exist)
Agglomerative hierarchical
Average linkage methods
between single and complete linkage
From Alexandria to Suez
Hierarchical Clustering
CS
RA
11
CS
RA
12
CS
RA
13
CS
RA
18
CS
RA
19
CS
RA
20
CS
RA
21
CS
RA
22
CS
RA
23
CS
RA
16
CS
RA
17
CS
RA
14
CS
RA
15
CS
RA
33
CS
RA
34
CS
RA
35
CS
RA
26
CS
RA
27
CS
RA
28
CS
RA
29
AL
EX
07
AL
EX
06
AL
EX
08
AL
EX
05
AL
EX
02
AL
EX
04
AL
EX
01
AL
EX
03
AL
EX
09
AL
EX
10
CS
RA
31
CS
RA
30
CS
RA
32
CS
RA
24
CS
RA
25
0.0
0.2
0.4
0.6
0.8
1.0
Cluster Dendrogram
hclust (*, "complete")des.dist
He
igh
t
Hierarchical Clustering
CS
RA
11
CS
RA
12
CS
RA
13
CS
RA
18
CS
RA
19
CS
RA
20
CS
RA
21
CS
RA
22
CS
RA
23
CS
RA
16
CS
RA
17
CS
RA
14
CS
RA
15
CS
RA
33
CS
RA
34
CS
RA
35
CS
RA
26
CS
RA
27
CS
RA
28
CS
RA
29
AL
EX
07
AL
EX
06
AL
EX
08
AL
EX
05
AL
EX
02
AL
EX
04
AL
EX
01
AL
EX
03
AL
EX
09
AL
EX
10
CS
RA
31
CS
RA
30
CS
RA
32
CS
RA
24
CS
RA
25
0.0
0.2
0.4
0.6
0.8
1.0
Cluster Dendrogram
hclust (*, "complete")des.dist
He
igh
t
Hierarchical Clustering
-2 -1 0 1 2
-2-1
01
2
CA1
CA
2
+
+
+
+
+
+
+
+
++
+
+
++
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Building and testing models
Basically you just approach this in the same way as for multiple regression – so there are the same issues of variable selection, interactions between variables, etc.
However the basis of any statistical tests using distributional assumptions are more problematic, so there is much greater use of randomisation tests and permutation procedures to evaluate the statistical significance of results.
Some Examples
Part of Fig 4.
What Technique?
Response variable(s) ...
Predictors(s)
No
Predictors(s)
Yes
... is one • distribution summary • regression models
... are many • indirect gradient analysis
(PCA, CA, DCA, MDS)
• cluster analysis
• direct gradient analysis
• constrained cluster analysis
• discriminant analysis (CVA)
Raw Data
65 70 75 80 85
81
01
21
41
61
82
0
Height
Dia
me
ter
Linear Regression
65 70 75 80 85
81
01
21
41
61
82
0
Height
Dia
me
ter
Two Regressions
65 70 75 80 85
81
01
21
41
61
82
0
Height
Dia
me
ter
Principal Components
65 70 75 80 85
81
01
21
41
61
82
0
Height
Dia
me
ter
Models of Species Response
There are (at least) two models:-
• Linear - species increase or decrease along the environmental gradient
• Unimodal - species rise to a peak somewhere along the environmental gradient and then fall again
Linear
-0.4 +0.4
+0.0
+7.0
Unimodal
-2.5 +3.5
+0.0
+250.0
Non-metric multidimensional scaling
NMDS maps the observed dissimilarities onto an ordination space by trying to preserve their rank order in a low number of dimensions (often 2) – but the solution is linked to the number of dimensions chosen
it is like a non-linear version of PCO
define a stress function and look for the mapping with minimum stress(e.g. sum of squared residuals in a monotonic regression of NMDS space distances between original and mapped dissimilarities)
need to use an iterative process, so try with many different starting points and convergence is not guaranteed
Procrustes rotation
-3 -2 -1 0 1 2 3
-3-2
-10
12
Procrustes errors
Dimension 1
Dim
en
sio
n 2
-4 -2 0 2 4
-5-4
-3-2
-10
12
Procrustes errors
Dimension 1
Dim
en
sio
n 2
used to compare graphically two separate ordinations